Technically Impossible

Lets look at the weak link in your statement. Anything "Technically Impossible" basically means we haven't figured out how yet.

GPU無し、コンテナもPythonも使わない、RAM=8GBのWindows PCでGPT - ggml編

いわゆる「AI」をPCで運用するには、GPUとVRAMをはじめとする潤沢な計算リソースが求められるところ、推論を実行するだけならばRaspberry Piでも対応できる*1。それが実用的であるかはおいて、普及機レベルのPCでも対応可能だ。

"ggerganov/ggml"*2, "gpt-2"は、RAM搭載量が8GBでも実行可能とし、 RAMを16GB搭載していれば、"gpt-j"も実行可能だ。

この投稿では、その手順と実行パフォーマンスを紹介する。テスト環境にはMicrosoft Surface Pro 4と自作のデスクトップPCを用いている。それぞれのスペックに加え、"gpt-2"、"gpt-j"は単一トークン当たりの推論パフォーマンスを示している。

Surface Pro 4 desktop
OS Windows 11 Pro 22H2 Windows 11 Pro 22H2
CPU Intel Core i7-6650U Intel Core i7-6700T
RAM 8GB 32GB
gpt-2 21~22ms
gpt-j 484~487ms
  • ソースコードとモデルのダウンロード
  • 実行ファイルのビルド
  • gpt-2.exeの実行
  • gpt-j.exeの実行
  • 余談

*1:

twitter.com

*2:github.com

続きを読む

GPT on Windows PC with 8GB RAM, No GPU, No Container, and No Python - case of ggml


Abstract

One of the popular topic in "AI" is GPT (Generative Pre-trained Transformer). Although it usually requires rich computational resources, running it on minimal environment as Raspberry Pi is technically possible now*1. And regardless of whether it is practical, it can run on consumer PC as well. "ggerganov/ggml"*2, "gpt-2" there made running GPT on PC with 8GB RAM. With 16GB RAM, running "gpt-j" is also possible.

This post introduces its how-to and performance. Specification of the test machine in this post is Microsoft Surface Pro 4 and the desktop PC with specs below. "gpt-2" and "gpt-j" are prediction performance per a single token.

Surface Pro 4 desktop
OS Windows 11 Pro 22H2 Windows 11 Pro 22H2
CPU Intel Core i7-6650U Intel Core i7-6700T
RAM 8GB 32GB
gpt-2 21~22ms
gpt-j 484~487ms
  • Abstract
  • Download source code and model
  • Build execution file
  • Run gpt-2.exe
  • Run gpt-j.exe

*1:

twitter.com

*2:github.com

続きを読む

Llama2.c on PC with 8GB RAM, No GPU, No Container - pretokenization, training and inference


Abstract

If performance is not concern, whether it is GPT or Diffusion, they work on a machine with only 8GB RAM withuot GPU, be it on a PC or and Android smartphone. This is my finding from generative AI exploration in this spring*1. But there is a condition that inference only. In other words, it was conditional on using a small-scale model provided by someone else, and it was not realistic to conduct fine-tuning with own data. Llama2.c may be able to expand what user can be done.

github.com

Andrej Karpathy provides simple and minumum inference engine based on llama-2 architecture. This is the outcome from his weekend leisure, it supports not only inference, but also pretokenization and training of data and model. And they work on resource limited machine as RAM=8GB.

Leaving aside practical aspect, trace entire process from pretokenization and training of model to inference with it. This post outlines its steps.

Inference easily works on Windows and Linux commonly, pretokenization and training on Windows is challenging due to followings. For Windows, this post mentions only inference.

  • Rewriting shell scripts called from programs
  • Handling encoding for training data text
  • Building sentencepiece for Windows
  • Compiling training data model (torch.compile is not compatible with Windows)

  • Abstract
  • Assumption
  • Inference engine
  • Python and sentencepiece
  • Machine learning: preprocessing
  • Machine learning: generating a model
    • Swap
    • train.py
    • Training
  • Aside: Training with custom data and JSON file
  • Reference
続きを読む