2023-03-23
Performance of prediction in this post is not practical, a single token per a minute. The next post shows the other case. Its performance is better due to small size language model. If you are interested, please refer to it.
impsbl.hatenablog.jp
Abstract
One of the popular topic in "AI" is GPT (Generative Pre-trained Transformer). Although it usually requires rich computational resources, running it on minimal environment as Raspberry Pi is technically possible now*1. And regardless of whether it is practical, it can run on consumer PC as well. "ggerganov/ggml"*2, especially "gpt-j" there made running GPT on PC with 16GB RAM.
Using 8GB swap, it can run on Linux PC with 8GB RAM. This post introduces its how-to and performance. Specification of the test machine in this post is as following.
OS | Clear Linux 38590*3 |
CPU | Intel Core i5-8250U |
RAM | 8GB |
Storage | SSD: 256GB |
FYI, prediction performance in this spec is a single token per around a minute. And this is much slower than Raspberry Pi's case mentioned above.
Build and installation of ggml
In an environment that cmake and make are initially prepared as Linux, build and installation of ggml is quite simple. As described at the GitHub page, requireds commands are for
- download (clone) source cord
- make a directory if required
- cmake and make
To download "ggml" under the home directory, and build only gpt-j there, commands will be
cd ~ git clone https://github.com/ggerganov/ggml cd ggml mkdir build cd build cmake .. make -j4 gpt-j
Then, "gpt-j" is build under the directory "~/ggml/build/bin/".
Download GPT-J 6B model
GPT model is required to run "gpt-j". Although the GitHub pages says to run "download-ggml-model.sh", taking the other way in this post. The size of this model is around 12GB. I think it would be better to directly download it with familiar way. Download "ggml-model-gpt-jt-6B.bin" from Hugging Face below, and save it under "~/ggml/build/bin/".
Create swap
As described at the GitHub page, "gpt-j" requires 16GB RAM.
No video card required. You just need to have 16 GB of RAM.
ggml/examples/gpt-j at master · ggerganov/ggml · GitHub
This is because it loads GPT model on memory. If required amount of memory is not acquired, it causes segmentation fault. To avoid this, make swap with the commands below. In this case, the swap file "swap.img" is created under the "~/ggml/build/bin/"
sudo dd if=/dev/zero of=~/ggml/build/bin/swap.img bs=1M count=8096 sudo chown 0:0 ~/ggml/build/bin/swap.img sudo chmod 600 ~/ggml/build/bin/swap.img sudo mkswap ~/ggml/build/bin/swap.img sudo swapon ~/ggml/build/bin/swap.img
Run "gpt-j"
Now run GPT. Options of "gpt-j" is defined in "examples/utils.cpp". Major one is
-m | path of model |
-n | token so to speak, count of responded words |
-p | prompt |
-t | threads assigned for processing |
In this post, the model and "gpt-j" is saved on the common folder. Then, typical command will be
cd ~/ggml/build/bin ./gpt-j -m ggml-model-gpt-j-6B.bin -p "hello"
To measure and compare processing performance, run this with different options as
./gpt-j -n 5 -t 4 -m ggml-model-gpt-j-6B.bin -p "hello" ./gpt-j -n 5 -t 8 -m ggml-model-gpt-j-6B.bin -p "hello" ./gpt-j -n 10 -t 4 -m ggml-model-gpt-j-6B.bin -p "hello"
Results shows that more threads reduces predict time per token, but more tokens has no impact to it. Test machine predict a single token per around a minute.
And in case of the next command with default options, it spend around 4 hours by end.
./gpt-j -m ggml-model-gpt-j-6B.bin -p "hello"
🔎result of the commands above
Google Colab
Although "gpt-j" can be built and run on Google Coalb*4 with the commands below, its process is forcedly terminated within few minutes.
!git clone https://github.com/ggerganov/ggml %cd /content/ggml %mkdir build %cd build !cmake .. !make -j4 gpt-j !../examples/gpt-j/download-ggml-model.sh 6B !./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "who are you?" !./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "Do you know what day today is?"
*1: I've sucefully runned LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It's super slow about 10sec/token. But it looks we can run powerful cognitive pipelines on a cheap hardware. pic.twitter.com/XDbvM2U5GY
twitter.com