GPT on Linux PC with 8GB RAM, No GPU, No Container, and No Python

2023-03-23

Performance of prediction in this post is not practical, a single token per a minute. The next post shows the other case. Its performance is better due to small size language model. If you are interested, please refer to it.
impsbl.hatenablog.jp

Abstract

One of the popular topic in "AI" is GPT (Generative Pre-trained Transformer). Although it usually requires rich computational resources, running it on minimal environment as Raspberry Pi is technically possible now*1. And regardless of whether it is practical, it can run on consumer PC as well. "ggerganov/ggml"*2, especially "gpt-j" there made running GPT on PC with 16GB RAM.

Using 8GB swap, it can run on Linux PC with 8GB RAM. This post introduces its how-to and performance. Specification of the test machine in this post is as following.

OS	Clear Linux 38590*3
CPU	Intel Core i5-8250U
RAM	8GB
Storage	SSD: 256GB

FYI, prediction performance in this spec is a single token per around a minute. And this is much slower than Raspberry Pi's case mentioned above.

2023-03-23
Abstract
Build and installation of ggml
Download GPT-J 6B model
Create swap
Run "gpt-j"
Google Colab

Build and installation of ggml

In an environment that cmake and make are initially prepared as Linux, build and installation of ggml is quite simple. As described at the GitHub page, requireds commands are for

download (clone) source cord
make a directory if required
cmake and make

To download "ggml" under the home directory, and build only gpt-j there, commands will be

cd ~
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build
cd build
cmake ..
make -j4 gpt-j

Then, "gpt-j" is build under the directory "~/ggml/build/bin/".

Download GPT-J 6B model

GPT model is required to run "gpt-j". Although the GitHub pages says to run "download-ggml-model.sh", taking the other way in this post. The size of this model is around 12GB. I think it would be better to directly download it with familiar way. Download "ggml-model-gpt-jt-6B.bin" from Hugging Face below, and save it under "~/ggml/build/bin/".

huggingface.co

Create swap

As described at the GitHub page, "gpt-j" requires 16GB RAM.

No video card required. You just need to have 16 GB of RAM.

ggml/examples/gpt-j at master · ggerganov/ggml · GitHub

This is because it loads GPT model on memory. If required amount of memory is not acquired, it causes segmentation fault. To avoid this, make swap with the commands below. In this case, the swap file "swap.img" is created under the "~/ggml/build/bin/"

sudo dd if=/dev/zero of=~/ggml/build/bin/swap.img bs=1M count=8096

sudo chown 0:0 ~/ggml/build/bin/swap.img
sudo chmod 600 ~/ggml/build/bin/swap.img

sudo mkswap ~/ggml/build/bin/swap.img
sudo swapon ~/ggml/build/bin/swap.img

Run "gpt-j"

Now run GPT. Options of "gpt-j" is defined in "examples/utils.cpp". Major one is

-m	path of model
-n	token so to speak, count of responded words
-p	prompt
-t	threads assigned for processing

In this post, the model and "gpt-j" is saved on the common folder. Then, typical command will be

cd ~/ggml/build/bin
./gpt-j -m ggml-model-gpt-j-6B.bin -p "hello"

To measure and compare processing performance, run this with different options as

./gpt-j -n 5 -t 4 -m ggml-model-gpt-j-6B.bin -p "hello"
./gpt-j -n 5 -t 8 -m ggml-model-gpt-j-6B.bin -p "hello"
./gpt-j -n 10 -t 4 -m ggml-model-gpt-j-6B.bin -p "hello"

Results shows that more threads reduces predict time per token, but more tokens has no impact to it. Test machine predict a single token per around a minute.

And in case of the next command with default options, it spend around 4 hours by end.

./gpt-j -m ggml-model-gpt-j-6B.bin -p "hello"

🔎result of the commands above

Google Colab

Although "gpt-j" can be built and run on Google Coalb*4 with the commands below, its process is forcedly terminated within few minutes.

!git clone https://github.com/ggerganov/ggml
%cd /content/ggml
%mkdir build
%cd build
!cmake ..
!make -j4 gpt-j

!../examples/gpt-j/download-ggml-model.sh 6B
!./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "who are you?"
!./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "Do you know what day today is?"

*1:

I've sucefully runned LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It's super slow about 10sec/token. But it looks we can run powerful cognitive pipelines on a cheap hardware. pic.twitter.com/XDbvM2U5GY
— Artem Andreenko 🇺🇦 (@miolini) 2023年3月12日