GPT on Windows PC with 8GB RAM, No GPU, No Container, and No Python - case of ggml

Abstract

One of the popular topic in "AI" is GPT (Generative Pre-trained Transformer). Although it usually requires rich computational resources, running it on minimal environment as Raspberry Pi is technically possible now*1. And regardless of whether it is practical, it can run on consumer PC as well. "ggerganov/ggml"*2, "gpt-2" there made running GPT on PC with 8GB RAM. With 16GB RAM, running "gpt-j" is also possible.

This post introduces its how-to and performance. Specification of the test machine in this post is Microsoft Surface Pro 4 and the desktop PC with specs below. "gpt-2" and "gpt-j" are prediction performance per a single token.

	Surface Pro 4	desktop
OS	Windows 11 Pro 22H2	Windows 11 Pro 22H2
CPU	Intel Core i7-6650U	Intel Core i7-6700T
RAM	8GB	32GB
gpt-2	21~22ms
gpt-j		484~487ms

Abstract
Download source code and model
Build execution file
- Workaround for whisper.c
- Build with Visual Studio
Run gpt-2.exe
Run gpt-j.exe

Download source code and model

Download source code from GitHub, and extract it to any folder. In this post, all files are extracted under the next folder. Hereafter in this post, this folder is referred to as the working folder.

D:\user temp\ggml-master\

And download the next language model from Hugging Face.

for gpt-2.exe	ggml-model-gpt-2-117M.bin
for gpt-j.exe	ggml-model-gpt-j-6B.bin

117M model is for 8GB RAM PC. There is no need to download both models. 6B model is for gpt-j and it requires 16GB RAM.

github.com
huggingface.co

Build execution file

Workaround for whisper.c

"whisper.c" has problems with UTF-8 encoding that prevents build process. For workaround, delete the next token from "non_speech_tokes" as following.

"「", "」", "『", "』",

//static const std::vector<std::string> non_speech_tokens = {
//    "\"", "#", "(", ")", "*", "+", "/", ":", ";", "<", "=", ">", "@", "[", "\\", "]", "^",
//    "_", "`", "{", "|", "}", "~", "「", "」", "『", "』", "<<", ">>", "<<<", ">>>", "--",
//    "---", "-(", "-[", "('", "(\"", "((", "))", "(((", ")))", "[[", "]]", "{{", "}}", "♪♪",
//    "♪♪♪","♩", "♪", "♫", "♬", "♭", "♮", "♯"
//};

static const std::vector<std::string> non_speech_tokens = {
    "\"", "#", "(", ")", "*", "+", "/", ":", ";", "<", "=", ">", "@", "[", "\\", "]", "^",
    "_", "`", "{", "|", "}", "~", "<<", ">>", "<<<", ">>>", "--",
    "---", "-(", "-[", "('", "(\"", "((", "))", "(((", ")))", "[[", "]]", "{{", "}}", "♪♪",
    "♪♪♪","♩", "♪", "♫", "♬", "♭", "♮", "♯"
};

Build with Visual Studio

Build an execution file with Visual Studio. Version in this post is "Visual Studio 2022 Version 17.7.4". Operate as following.

Open the working folder in Explorer, open the context menu with right clicking
Select "Open Visual Studio" from the menu
Select "CMake settings editor"

CMake configuration

Open "CMake Settings"
Select "Release" at "Configuration type"

Build

Right click "CMakeLists.txt" at solution explorer, and open context menu
Select "Build", then new CMake is generated
Select "Build" again, then release build of binaries is generated

After the 2nd build, binaries are generated under the next folder. Copy the previously downloaded models to this folder, too.

D:\user temp\ggml-master\out\build\x64-Debug\bin

Run gpt-2.exe

Run gpt-2.exe with the prompt "Actions speak louder than" and "When in Rome, do as the" as

.\gpt-2.exe -m .\ggml-model-gpt-2-117M.bin -p "Actions speak louder than"
.\gpt-2.exe -m .\ggml-model-gpt-2-117M.bin -p "When in Rome, do as the"

Inference performance on the 6th generation Intel Core i7 is around 21~22ms/token. Size of the language model is 239MB and enough small to put on RAM. Although performance is not bad, quality of output is relatively not good.
Token just after the prompt is correctly predicted. However, the generated sentences are less contextually consistent with the given prompts.

Run gpt-j.exe

Run gpt-j.exe with the same prompts as

.\gpt-j.exe -m .\ggml-model-gpt-2-117M.bin -p "Actions speak louder than"
.\gpt-j.exe -m .\ggml-model-gpt-2-117M.bin -p "When in Rome, do as the"

Inference performance on the 6th generation Intel Core i7 is around 484~487ms/token. Although size of the language model is enough small to put on RAM, it doesn't contribute making performance better. The quality of the output results is better than gpt2, but not commensurate with the poor performance.

*1:

I've sucefully runned LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It's super slow about 10sec/token. But it looks we can run powerful cognitive pipelines on a cheap hardware. pic.twitter.com/XDbvM2U5GY
— Artem Andreenko 🇺🇦 (@miolini) March 12, 2023