2023-09-15

GPU無し、コンテナもPythonも使わない、RAM=8GBのWindows PCでGPT - ggml編

AI IT Windows

いわゆる「AI」をPCで運用するには、GPUとVRAMをはじめとする潤沢な計算リソースが求められるところ、推論を実行するだけならばRaspberry Piでも対応できる*1。それが実用的であるかはおいて、普及機レベルのPCでも対応可能だ。

"ggerganov/ggml"*2, "gpt-2"は、RAM搭載量が8GBでも実行可能とし、 RAMを16GB搭載していれば、"gpt-j"も実行可能だ。

この投稿では、その手順と実行パフォーマンスを紹介する。テスト環境にはMicrosoft Surface Pro 4と自作のデスクトップPCを用いている。それぞれのスペックに加え、"gpt-2"、"gpt-j"は単一トークン当たりの推論パフォーマンスを示している。

	Surface Pro 4	desktop
OS	Windows 11 Pro 22H2	Windows 11 Pro 22H2
CPU	Intel Core i7-6650U	Intel Core i7-6700T
RAM	8GB	32GB
gpt-2	21~22ms
gpt-j		484~487ms

ソースコードとモデルのダウンロード
実行ファイルのビルド
- whisper.cの修正
- Visual Studioでのビルド
gpt-2.exeの実行
gpt-j.exeの実行
余談

*1:

I've sucefully runned LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It's super slow about 10sec/token. But it looks we can run powerful cognitive pipelines on a cheap hardware. pic.twitter.com/XDbvM2U5GY
— Artem Andreenko (@miolini) March 12, 2023

twitter.com

*2:github.com

2023-09-14

GPT on Windows PC with 8GB RAM, No GPU, No Container, and No Python - case of ggml

AI English post IT Windows

Abstract

One of the popular topic in "AI" is GPT (Generative Pre-trained Transformer). Although it usually requires rich computational resources, running it on minimal environment as Raspberry Pi is technically possible now*1. And regardless of whether it is practical, it can run on consumer PC as well. "ggerganov/ggml"*2, "gpt-2" there made running GPT on PC with 8GB RAM. With 16GB RAM, running "gpt-j" is also possible.

This post introduces its how-to and performance. Specification of the test machine in this post is Microsoft Surface Pro 4 and the desktop PC with specs below. "gpt-2" and "gpt-j" are prediction performance per a single token.

	Surface Pro 4	desktop
OS	Windows 11 Pro 22H2	Windows 11 Pro 22H2
CPU	Intel Core i7-6650U	Intel Core i7-6700T
RAM	8GB	32GB
gpt-2	21~22ms
gpt-j		484~487ms

Abstract
Download source code and model
Build execution file
- Workaround for whisper.c
- Build with Visual Studio
Run gpt-2.exe
Run gpt-j.exe

*1:

I've sucefully runned LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It's super slow about 10sec/token. But it looks we can run powerful cognitive pipelines on a cheap hardware. pic.twitter.com/XDbvM2U5GY
— Artem Andreenko 🇺🇦 (@miolini) March 12, 2023

twitter.com

*2:github.com

2023-08-29

Llama2.c on PC with 8GB RAM, No GPU, No Container - pretokenization, training and inference

AI Clear Linux English post IT Python Windows

Abstract

If performance is not concern, whether it is GPT or Diffusion, they work on a machine with only 8GB RAM withuot GPU, be it on a PC or and Android smartphone. This is my finding from generative AI exploration in this spring*1. But there is a condition that inference only. In other words, it was conditional on using a small-scale model provided by someone else, and it was not realistic to conduct fine-tuning with own data. Llama2.c may be able to expand what user can be done.

github.com

Andrej Karpathy provides simple and minumum inference engine based on llama-2 architecture. This is the outcome from his weekend leisure, it supports not only inference, but also pretokenization and training of data and model. And they work on resource limited machine as RAM=8GB.

Leaving aside practical aspect, trace entire process from pretokenization and training of model to inference with it. This post outlines its steps.

Inference easily works on Windows and Linux commonly, pretokenization and training on Windows is challenging due to followings. For Windows, this post mentions only inference.

Rewriting shell scripts called from programs
Handling encoding for training data text
Building sentencepiece for Windows
Compiling training data model (torch.compile is not compatible with Windows)

Abstract
Assumption
Inference engine
- Windows
- Linux
Python and sentencepiece
Machine learning: preprocessing
Machine learning: generating a model
- Swap
- train.py
- Training
Aside: Training with custom data and JSON file
Reference

*1:impsbl.hatenablog.jp

Technically Impossible

Lets look at the weak link in your statement. Anything "Technically Impossible" basically means we haven't figured out how yet.

GPU無し、コンテナもPythonも使わない、RAM=8GBのWindows PCでGPT - ggml編

GPT on Windows PC with 8GB RAM, No GPU, No Container, and No Python - case of ggml

Abstract

Llama2.c on PC with 8GB RAM, No GPU, No Container - pretokenization, training and inference

Abstract