Web Analytics

Technically Impossible

Lets look at the weak link in your statement. Anything "Technically Impossible" basically means we haven't figured out how yet.

Llama2.c on PC with 8GB RAM, No GPU, No Container - pretokenization, training and inference


If performance is not concern, whether it is GPT or Diffusion, they work on a machine with only 8GB RAM withuot GPU, be it on a PC or and Android smartphone. This is my finding from generative AI exploration in this spring*1. But there is a condition that inference only. In other words, it was conditional on using a small-scale model provided by someone else, and it was not realistic to conduct fine-tuning with own data. Llama2.c may be able to expand what user can be done.


Andrej Karpathy provides simple and minumum inference engine based on llama-2 architecture. This is the outcome from his weekend leisure, it supports not only inference, but also pretokenization and training of data and model. And they work on resource limited machine as RAM=8GB.

Leaving aside practical aspect, trace entire process from pretokenization and training of model to inference with it. This post outlines its steps.

Inference easily works on Windows and Linux commonly, pretokenization and training on Windows is challenging due to followings. For Windows, this post mentions only inference.

  • Rewriting shell scripts called from programs
  • Handling encoding for training data text
  • Building sentencepiece for Windows
  • Compiling training data model (torch.compile is not compatible with Windows)


In this post, the following environments are being used.

Windows Windows 11 Pro 22H2
Visual Studio 2022 17.6.5
Linux Clear Linux 39850
Python 3.11.4

This post assumes the following folder structure.

work folder ~/work
llama2.c folder ~/work/llama2.c
Python virtual environment ~/work/llama2.c/myenv
sentencepiece folder ~/work/sentencepiece

Inference engine

Moving to the work folder, download llama2.c on a local environment.

cd ~/work
git clone https://github.com/karpathy/llama2.c.git

Download the trained model "stories15M.bin" to the llama2.c folder.


Visual Studio or C++ building tool is required to build an EXE file. Their installation is introduced past posts for Windows below.

Run following commands from "Developer PowerShell"

.\run.exe .\stories15M.bin
.\run.exe .\stories15M.bin -i "In Japan, there is a tradition of dancing in circles during summer nights, "

This run is on the 6th gen Intel Core i7. Its inference perfomrance is 300 tokens/sec. Although stories are rondomly generated, something is reflected from provided instruction.


All tasks belows were performed on Clear Linux*2. But there is no configuration or operational tasks dependant on the certain distribution. So, they should work commonly across any distributions.

Similar to Windows, build execution file and perform inference. "runomp" is for parallel parallel processing.

make runomp
./run ./stories15M.bin
./run ./stories15M.bin -i "In Japan, there is a tradition of dancing in circles during summer nights, "

Python and sentencepiece

From this chapter, all topics are only for Linux. Creating an environment for running Python scripts and installing sentencepiece*3 for preprocessing of data. Its execution file and libraries are stored in the following folders.

execution file /usr/local/bin
libraries /usr/local/lib64

Depending on the environment, it may be required to register the library path as environment variables with "~/.config/environment.d/envvars.conf"*4.

# Python
cd ~/work/llama2.c
python -m venv myenv
source myenv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt

# sentencepiece
cd ~/work
git clone https://github.com/google/sentencepiece.git 
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v

# Library path
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib64

Machine learning: preprocessing

Commands introduced on "training" generates "stories15M.bin" downloaded for the inference. But it takes for a single day with 4 GPUs on cloud, according to GitHub.
In case of the commands on "custom tokenizer", it took for 30min with the 7th gen Intel Core i5.

cd ~/work/llama2.c
python tinystories.py download
python tinystories.py train_vocab --vocab_size=4096
python tinystories.py pretokenize --vocab_size=4096

Vocabulary size (vocab size) in the command of "training" is 32000. Instead, it is 4096 in "custom tokenizer". While a larger vocabulary size can ennhance adequacy of the model, it can also lead to over-specification depending on the use case. By pursuing a constrained vocabulary size that can still provide adequate accuracy and specifying it, it mades the model smaller, and it mades inference faster.

The command "python tinystories.py download" downloads "TinyStories_all_data.tar.gz". But it can be directly downloded from URL specified n "tinystories.py". In this case, all JSON files in it should be saved the next folder manually.


Machine learning: generating a model


In advance of generating a model, prepare the machine learning environment. First, configure the swap. In an environment with 8GB RAM, there is the risk that a process is killed due to insufficient RAM. To avoid this, create an 8GB swap.

sudo dd if=/dev/zero of=~/work/llama2.c/swap.img bs=1M count=8096

sudo chown 0:0 ~/work/llama2.c/swap.img
sudo chmod 600 ~/work/llama2.c/swap.img

sudo mkswap ~/work/llama2.c/swap.img
sudo swapon ~/work/llama2.c/swap.img


The "train.py" responsible for machine learning processing comes with numerous parameters. Adjusting these parameters brings the risk of operational issues and unexpected errors. In this post, practicality is sacrificed to make the loop count and processing time realistic by editing the parameters as follows.

parameter default this post
batch_size 128 8
compile True False
device cuda cpu
drop_out 0.0 0.1
eval_interval 2000 20
learning_rate 5e-4 4e-4
max_iters 100000 1000
warmup_iters 1000 10

By "compile = True", PyTorch compiles the model, allowing the generation of a faster model. However, torch.compile is not compatible with Python 3.11. To work around this issue, it is set to False here.

By "device = 'cpu'", the model can be generated in an environment without GPU.

"eval_interval," "max_iters," and "warup_iters" are related to the loop control for model generation and are specifically related to the following loop processing.

while True:
# ...
    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 and master_process:
# ...
        if losses["val"] < best_val_loss or always_save_checkpoint:
# ...
            if iter_num > 0:
# ...
                model_export(raw_model, os.path.join(out_dir, "model.bin"), version=0)

The 7th gen Intel Core i5 takes for 4.5sec per a single iteration. Based on this, "max_iters=100000" iterations in default takes for 5 days. "max_iters=1000" limits iterations 1.5 hours.

With combination of this change and "eval_interval=2000", the nested if statement is completely skipped. To follow the nested if statement every 20 times, "eval_interval=20" is set.

Furthermore, to avoid division by zero in the next process, "warmup_iters=10" is set.

decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)

Other parameters affect the characteristics of machine learning. As mentioned in the following quote, adjustments should be made by referring to the tables provided in the paper.

Look at the table at the very end of the Chinchilla paper to get a sense of how the Transformer parameters (dim, n_layers, n_heads) grow or shrink together.

📁the table at the very end of the Chinchilla paper


Execute the next command. The 7th gen Intel Core i5 took for 90min. Although htop shows sufficient room with 8GB RAM, as mentioned earlier, the process has experienced being "killed" indicating that there is a point in the process where the RAM capacity becomes constrained.

python train.py --vocab_source=custom --vocab_size=4096

The execution will not be carried out in this post, but the "train.py" can resume training by continuing from previous checkpoints. For instance, even with a loop of 100000 iterations spanning over 5 days of processing, it's possible to interrupt the process after each checkpoint output and resume from the iteration following the latest checkpoint. To achieve this, the parameters need to be edited as follows. Both checkpoints and models are reused for appending after the resumption.

start continue
init_from scratch resume

Once the process is complete, the model and checkpoints will be generated in the folder "~/work/llama2.c/out".

The size of the newly generated model is 28.7MB, in contrast to the size of "stories15M.bin" which is 57.9MB. Utilizing this model improves performance as processing over 400 tokens per second. However, the quality of English sentences has regressed and specified instructions are not being effectively reflected.

All of these aspects are contingent on the number of training iterations during model generation.

python tokenizer.py --tokenizer-model=data/tok4096.model
./run out/model.bin -z data/tok4096.bin

Aside: Training with custom data and JSON file

"stories15M.bin" is a model trained on 140MB of short story-like text, consisting of 50 files. What if this text were customer support dialogues*5 or, for instance, the "Meditations" by Marcus Aurelius, one of the Five Good Emperors of Rome, a collection of his own questions and answers to himself? Even if it doesn't assume a general AI service in dialogue format, there might be a practical application for the generated model.

Just because ChatGPT and other generative AI services are interactive does not mean that all services using language models need to be so, or that they need to be as generic as such them.

The JSON files included in "TinyStories_all_data.tar.gz" consist of single-line text. Each record has the following structure.

    "prompt": "",
    "words": [],
    "features": []
  "summary": "",
  "source": ""

📁original data

  "story": "Once upon a time, in a small house, there lived a boy named Will. Will was a selfish boy who never shared his toys with other kids. One day, his mom told him, \"Will, you need to share your toys with your friends.\"\nWill didn't want to share, but he heard his mom and thought about it. The next day, he went to the park to play. He saw a girl named Sue who was sad because she had no toys to play with. Will thought of what his mom said and decided to share his toys with Sue.\nSue was very happy and said, \"Thank you, Will! You are a good friend.\" Will felt happy too, and from that day on, he was not selfish anymore. He always shared his toys with his friends and they all played happily together.",
  "instruction": {
    "prompt:": "Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would understand. The story should use the verb \"hear\", the noun \"will\" and the adjective \"selfish\". The story has the following features: the story should contain at least one dialogue. Remember to only use simple words!",
    "words": ["hear", "will", "selfish"],
    "features": ["Dialogue"]
  "summary": "A selfish boy named Will learns to share his toys after his mom tells him to do so, and he makes a new friend at the park by sharing his toys with her.",
  "source": "GPT-4"

In other words, by training with such a structure of custom data, it becomes possible to generate user's own model. To the extent introduced in this post, even in an environment with 8GB RAM, machine learning data training is possible.