Llama2.c: Train and Run Llama 2 Models in Pure C

Introduction

Llama2.c is an open-source project by Andrej Karpathy that provides a minimalist, "fullstack" solution for training and inferencing Llama 2 Large Language Models (LLMs) directly in C. The project's primary goal is to offer a simple, educational, and portable way to work with the Llama 2 architecture without the heavy dependencies often associated with deep learning frameworks. It allows users to train the Llama 2 architecture (using PyTorch for the training part) and then run the inference using a remarkably concise C code implementation (around 700 lines in run.c).

This project is particularly appealing to those who want to understand the inner workings of LLMs at a fundamental level or require a lightweight, dependency-free inference engine for various applications. It aims to "baby Llama 2" by providing a complete, from-scratch implementation.

Key Features

Minimalist C Implementation: The core inference engine (run.c) is written in a single C file with no external dependencies beyond a standard C library and math.h, making it highly portable and easy to compile.
Fullstack Approach (Train + Inference): While the C inference is a highlight, the project includes Python scripts (using PyTorch) to:
- Train custom Llama 2 style models from scratch (e.g., on the TinyStories dataset).
- Export Meta's official Llama 2 model weights into the custom .bin format required by the C inference engine.
Llama 2 Architecture Support: Implements the Llama 2 LLM architecture, allowing it to run both custom-trained "baby" Llama models and official Llama 2 models (after conversion).
Educational Value: Provides a clear, straightforward, and relatively simple C implementation of LLM inference, making it an excellent resource for learning how these models (specifically the Transformer architecture) work internally.
Portability: Due to its minimal C nature, the inference engine can be compiled and run on various systems where a C compiler is available.
Performance Considerations:
- The primary run.c focuses on float32 precision for clarity.
- Includes runq.c for experimenting with int8 quantization (referencing ggml's quantize_q8_0 or using a custom quantize_q80_karpathy method), which can lead to smaller model sizes and faster inference, though it introduces more complexity.
- Supports performance optimization through standard C compilation flags like -O3, -Ofast, -march=native, and multi-threading with OpenMP (-fopenmp).
Direct Inference from Custom Binary Format: Loads model weights and tokenizer directly from custom .bin files.
Interactive Modes: Supports direct prompting for text generation, allowing users to specify temperature, steps, and an initial prompt. It also includes a chat mode for interacting with chat-finetuned models.
"No Black Magic": The C code is intentionally kept simple and understandable, avoiding complex abstractions to reveal the core logic.

Specific Use Cases

Llama2.c is particularly well-suited for:

Learning LLM Internals: Developers, students, and researchers can study the run.c code to deeply understand the mechanics of LLM inference, including the Transformer architecture, matrix multiplications, tokenization, and sampling.
Embedding LLMs in C/C++ Applications: Useful for integrating LLM capabilities into existing C or C++ projects where adding large Python dependencies is undesirable or impractical.
Resource-Constrained Environments: Running smaller LLM inference on devices or systems with limited resources, although performance will heavily depend on the model size and CPU capabilities.
Experimenting with Custom-Trained Models: Training small Llama 2 style models on domain-specific datasets (like the included TinyStories examples) and then running them efficiently with the C engine.
Prototyping and Research: Quickly testing new ideas or modifications to the LLM architecture or inference process with a simple, modifiable C codebase.
Educational Demonstrations: Providing a tangible example of how LLMs operate at a low level.
Personal Projects & Exploration: For hobbyists and enthusiasts looking for a hands-on, accessible way to run and interact with language models.

Usage Guide (Compiling & Running)

Here’s a general guide to getting started with llama2.c, based on the information in the GitHub repository:

Clone the Repository:

git clone [https://github.com/karpathy/llama2.c.git](https://github.com/karpathy/llama2.c.git)
cd llama2.c

Obtain Model Checkpoints:
- Custom "Baby" Llama Models: The repository provides Python scripts (train_tinystories.py, train_shakespeare_char.py) to train your own small models. Pre-trained small models like stories15M.bin, stories42M.bin, or stories110M.bin might also be available for download directly (check the repository or Andrej Karpathy's announcements).
- Official Llama 2 Models: To run Meta's official Llama 2 models (e.g., Llama 2 7B), you first need to obtain the original model weights from Meta. Then, use the export.py script provided in the llama2.c repository to convert these weights into the custom .bin format that run.c can load. This export step requires Python, PyTorch, and SentencePiece.
```
# Example command structure for exporting (paths are illustrative):
# python export.py llama2_7b.bin --checkpoint_dir /path/to/meta/llama2/7B --tokenizer_path /path/to/meta/llama2/tokenizer.model
```
Compile the C Code:
- The repository includes a Makefile. The simplest way to compile is:
```
make run
```
- For potentially better performance, especially on CPUs with specific instruction sets like AVX2, you can compile with optimizations:
```
make runfast
# This typically uses flags like: gcc -O3 -march=native run.c -o run -lm
```
- To enable multi-threading with OpenMP (which can significantly speed up inference on multi-core CPUs):
```
# Ensure your C compiler (like GCC) supports OpenMP
# The Makefile might have a specific target, or compile manually (example):
# gcc -Ofast -march=native -fopenmp run.c -o run -lm
```
Run Inference:
- Execute the compiled program, providing the path to your .bin model checkpoint file:
```
./run stories15M.bin
```
- Command-line Options:
  - -t <float>: Temperature for sampling (e.g., 0.8; default is 1.0). Lower values make the output more deterministic; higher values increase randomness.
  - -n <int>: Number of tokens (steps) to generate (default is 256).
  - -i "<string>": Initial prompt to start generation from.
  - -s <long>: Random seed for reproducibility.
  - -m <string>: Mode, e.g., "chat" for Llama 2 Chat models.
  - -p <float>: Top-p sampling probability (e.g., 0.9; default is 0.9).
  - -z <string>: Path to the tokenizer binary file (default is tokenizer.bin).
  - -y "<string>": System prompt to use in chat mode.
- Example with a prompt:
```
./run stories42M.bin -t 0.8 -n 100 -i "Once upon a time, there was a brave knight who"
```
- Chat Mode (for chat-finetuned Llama 2 models):
```
# Ensure you have the corresponding chat model .bin file
# ./run llama2_7b_chat.bin -m chat -y "You are a helpful storytelling assistant."
```

Key Aspects & Considerations

Dependencies: The core C inference engine (run.c) has no external library dependencies beyond a standard C library and math.h. The Python scripts for training and model export require PyTorch, SentencePiece, and other packages listed in requirements.txt.
Model Size & Performance: While llama2.c can load and run larger Llama 2 models (e.g., 7B), inference speed for these large models using the float32 C code on typical consumer CPUs will be slow. The project is more practically demonstrated with smaller, custom-trained models or used for educational exploration of larger ones. Significant speedups for larger models typically require GPU acceleration or highly optimized CPU libraries, which are beyond the scope of this minimalist project.
Tokenizer: The C inference engine requires a tokenizer.bin file, which is a binary export of the SentencePiece BPE tokenizer model used by Llama 2. The export_tokenizer.py script is provided to create this from Meta's tokenizer.model.
Simplicity Over Absolute Speed (for run.c): The primary run.c file is intentionally written for clarity and educational value. While make runfast and OpenMP provide some performance improvements, llama2.c is not designed to compete with heavily optimized, production-grade LLM inference frameworks for raw speed on very large models.
License: The llama2.c project itself is released under the MIT License. However, the use of official Llama 2 models from Meta is subject to Meta's own Llama 2 license and acceptable use policy.

GitHub Repository (llama2.c): https://github.com/karpathy/llama2.c
Andrej Karpathy's Website/Social Media: Often a source for updates, educational content, and context on his projects (e.g., Twitter/X, YouTube).
Meta's Llama 2 Resources: For information on the official Llama 2 models, access requests, and their usage policies.
TinyStories Dataset Paper: Referenced in the repository for training smaller models.

llama2.c

Llama2.c: Train and Run Llama 2 Models in Pure C

Introduction

Key Features

Specific Use Cases

Usage Guide (Compiling & Running)

Key Aspects & Considerations

Related Projects

LocalAI

Ollama

privateGPT

llama2.c

Llama2.c: Train and Run Llama 2 Models in Pure C

Introduction

Key Features

Specific Use Cases

Usage Guide (Compiling & Running)

Key Aspects & Considerations

Related Links

Related Projects

LocalAI

Ollama

privateGPT