This program is a direct C implementation of the Qwen3 / LLaMa 3.x / Mistral LLM transformer architecture, reusing the tokenizer and the sampler of Andrej Karpathy's llama2.c project and its fork by James Delancey llama3.c (we warmly thank you!). Given an input prompt, StrasGPT can generate a text that continues it. It was initially designed as a parallel programming project for master students in 2025 (students had to parallelize it with OpenMP + MPI). It is now getting continued for fun and (polyhedral) compiler research.
You just need git, a C compiler and make.
git clone git@gitlab.unistra.fr:bastoul/strasgpt.git
cd strasgpt
makeThere are several other building targets:
make parallelto build the faster parallel version, that target requires mpicc compilermake asanfor Clang's address sanitizer support and debug mode, that target requires Clang compilermake debugfor debug mode, ideal when using Valgrind
You can use, e.g., Qwen3, LLaMa 3.x or Mistral checkpoints from HuggingFace. You will need to create an HuggingFace Account, and get an access token (click on your profile icon, then "Access Tokens"). Finally you'll need to login then to download the desired models, e.g. here are some tested models:
pip install 'huggingface_hub[cli]'
huggingface-cli login
git clone https://huggingface.co/meta-llama/Llama-3.2-1B
git clone https://huggingface.co/meta-llama/Llama-3.2-3B
git clone https://huggingface.co/meta-llama/Llama-3.1-8B
git clone https://huggingface.co/mistralai/Mistral-Nemo-Base-2407
git clone https://huggingface.co/mistralai/Ministral-8B-Instruct-2410
git clone https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501
git clone https://huggingface.co/Qwen/Qwen3-0.6B
git clone https://huggingface.co/Qwen/Qwen3-4B
git clone https://huggingface.co/Qwen/Qwen3-14B
git clone https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
git clone https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct
git clone https://huggingface.co/Qwen/Qwen3-VL-8B-InstructRun StrasGPT with -h option to get all possible options. Here is an example of a command line with a 8-token long prompt and asking to generate 16 tokens (beyond the one generated from prompt analysis) and using 10 threads:
./strasgpt -m ../model_zoo/Llama-3.2-1B/ -p "Once upon a time there were three" -n 17 -t 10And here is the output on my M4 Mac:
...
Transformer:
- Configuration:
--- embedding_dim: 2048
--- hidden_dim: 8192
--- layer_count: 16
--- q_head_count: 32
--- kv_head_count: 8
--- vocabulary_len: 128256
--- context_len: 131072
--- aliased_out_weight: true
...
[Once upon a time there were three] little pigs.
Three little pigs went out for a pig walk. They heard music playing
Max memory used (RSS): 2.37 GB
Prompt processing (prefill): 8 tokens in 0.057 s (140.350877 token/s)
Token generation (decode): 16 tokens in 0.213 s (79.207921 token/s)
Actually not that bad!
