The pure C implementation of the Llama 2 model developed by Andrej Karpathy, optimized for efficient inference.
Llama2.c is an open-source project by Andrej Karpathy that provides a minimalist, "fullstack" solution for training and inferencing Llama 2 Large Language Models (LLMs) directly in C. The project's primary goal is to offer a simple, educational, and portable way to work with the Llama 2 architecture without the heavy dependencies often associated with deep learning frameworks. It allows users to train the Llama 2 architecture (using PyTorch for the training part) and then run the inference using a remarkably concise C code implementation (around 700 lines in run.c
).
This project is particularly appealing to those who want to understand the inner workings of LLMs at a fundamental level or require a lightweight, dependency-free inference engine for various applications. It aims to "baby Llama 2" by providing a complete, from-scratch implementation.
run.c
) is written in a single C file with no external dependencies beyond a standard C library and math.h
, making it highly portable and easy to compile..bin
format required by the C inference engine.run.c
focuses on float32 precision for clarity.runq.c
for experimenting with int8 quantization (referencing ggml
's quantize_q8_0
or using a custom quantize_q80_karpathy
method), which can lead to smaller model sizes and faster inference, though it introduces more complexity.-O3
, -Ofast
, -march=native
, and multi-threading with OpenMP (-fopenmp
)..bin
files.Llama2.c is particularly well-suited for:
run.c
code to deeply understand the mechanics of LLM inference, including the Transformer architecture, matrix multiplications, tokenization, and sampling.Here’s a general guide to getting started with llama2.c, based on the information in the GitHub repository:
Clone the Repository:
git clone [https://github.com/karpathy/llama2.c.git](https://github.com/karpathy/llama2.c.git)
cd llama2.c
Obtain Model Checkpoints:
train_tinystories.py
, train_shakespeare_char.py
) to train your own small models. Pre-trained small models like stories15M.bin
, stories42M.bin
, or stories110M.bin
might also be available for download directly (check the repository or Andrej Karpathy's announcements).export.py
script provided in the llama2.c
repository to convert these weights into the custom .bin
format that run.c
can load. This export step requires Python, PyTorch, and SentencePiece.
# Example command structure for exporting (paths are illustrative):
# python export.py llama2_7b.bin --checkpoint_dir /path/to/meta/llama2/7B --tokenizer_path /path/to/meta/llama2/tokenizer.model
Compile the C Code:
Makefile
. The simplest way to compile is:
make run
make runfast
# This typically uses flags like: gcc -O3 -march=native run.c -o run -lm
# Ensure your C compiler (like GCC) supports OpenMP
# The Makefile might have a specific target, or compile manually (example):
# gcc -Ofast -march=native -fopenmp run.c -o run -lm
Run Inference:
Execute the compiled program, providing the path to your .bin
model checkpoint file:
./run stories15M.bin
Command-line Options:
-t <float>
: Temperature for sampling (e.g., 0.8; default is 1.0). Lower values make the output more deterministic; higher values increase randomness.-n <int>
: Number of tokens (steps) to generate (default is 256).-i "<string>"
: Initial prompt to start generation from.-s <long>
: Random seed for reproducibility.-m <string>
: Mode, e.g., "chat"
for Llama 2 Chat models.-p <float>
: Top-p sampling probability (e.g., 0.9; default is 0.9).-z <string>
: Path to the tokenizer binary file (default is tokenizer.bin
).-y "<string>"
: System prompt to use in chat mode.Example with a prompt:
./run stories42M.bin -t 0.8 -n 100 -i "Once upon a time, there was a brave knight who"
Chat Mode (for chat-finetuned Llama 2 models):
# Ensure you have the corresponding chat model .bin file
# ./run llama2_7b_chat.bin -m chat -y "You are a helpful storytelling assistant."
run.c
) has no external library dependencies beyond a standard C library and math.h
. The Python scripts for training and model export require PyTorch, SentencePiece, and other packages listed in requirements.txt
.llama2.c
can load and run larger Llama 2 models (e.g., 7B), inference speed for these large models using the float32 C code on typical consumer CPUs will be slow. The project is more practically demonstrated with smaller, custom-trained models or used for educational exploration of larger ones. Significant speedups for larger models typically require GPU acceleration or highly optimized CPU libraries, which are beyond the scope of this minimalist project.tokenizer.bin
file, which is a binary export of the SentencePiece BPE tokenizer model used by Llama 2. The export_tokenizer.py
script is provided to create this from Meta's tokenizer.model
.run.c
): The primary run.c
file is intentionally written for clarity and educational value. While make runfast
and OpenMP provide some performance improvements, llama2.c
is not designed to compete with heavily optimized, production-grade LLM inference frameworks for raw speed on very large models.llama2.c
project itself is released under the MIT License. However, the use of official Llama 2 models from Meta is subject to Meta's own Llama 2 license and acceptable use policy.Last updated: May 16, 2025
A self-hosted, community-driven, and locally-focused OpenAI alternative. No GPU required, run language models and embeddings on CPU.
A framework for running, setting up, and using large language models on your local machine.
Interact with your documents privately using local LLMs, without the need for internet connection.