FastChat: Open Platform for Training, Serving, and Evaluating LLM Chatbots

Introduction

FastChat (github.com/lm-sys/FastChat) is an open-source platform developed by the Large Model Systems Organization (LMSYS Org), a research collective often associated with institutions like UC Berkeley. Its core mission is to provide a comprehensive and accessible framework for training, serving, and evaluating large language model (LLM)-based chatbots. FastChat is renowned for its role in developing and releasing capable open-source chatbots like Vicuna (fine-tuned from Llama models) and for powering Chatbot Arena, a unique platform for crowdsourced, human-preference-based evaluation of LLMs.

The platform is designed for AI researchers, developers, and organizations looking to fine-tune their own chatbots, deploy various open-source LLMs efficiently, or contribute to the ongoing evaluation and understanding of LLM capabilities. FastChat emphasizes openness, providing tools and codebases that are compatible with many popular LLMs and offering an OpenAI-compatible API for ease of integration.

Key Features

FastChat offers a suite of tools and capabilities for the LLM chatbot lifecycle:

LLM Chatbot Serving Framework:
- Distributed Multi-Model Serving: Architecture includes a central controller, one or more model workers (each hosting an LLM), and web servers (API and UI). This allows for serving multiple different models simultaneously or scaling a single model with multiple workers.
- OpenAI-Compatible API: Provides RESTful API endpoints that mimic the OpenAI API schema (e.g., for chat completions), making it a local drop-in replacement for applications already using OpenAI's SDKs.
- Gradio Web UI: Includes a user-friendly web interface built with Gradio for direct interaction with the hosted chatbots. Supports features like conversation history and model selection.
Training & Fine-tuning Capabilities:
- Provides scripts and methodologies for fine-tuning various LLMs on conversational datasets.
- Vicuna Model: Famous for releasing Vicuna, a high-quality chatbot fine-tuned from Llama models using user-shared conversations from ShareGPT.com. FastChat provides the recipe for training such models.
- Supports techniques like LoRA (Low-Rank Adaptation) for more efficient fine-tuning.
Broad Model Support:
- Designed to be compatible with a wide range of open-source LLMs, especially those based on the Llama architecture (e.g., Llama 2, Llama 3, Vicuna, Alpaca).
- Supports models from various sources, often in formats compatible with Hugging Face Transformers.
- Integrates with different model backends and quantization techniques (e.g., ExLlamaV2, GPTQ, AWQ) for optimized inference and reduced memory footprint.
Chatbot Arena (lmarena.ai):
- A flagship project powered by FastChat, Chatbot Arena is a crowdsourced platform for evaluating LLMs.
- Users engage in anonymous, randomized side-by-side battles between two different chatbot models and vote for the one that provides a better response.
- This human preference data is used to calculate Elo ratings and generate a live leaderboard, offering valuable insights into the relative performance of various LLMs.
MT-Bench:
- A benchmark featuring challenging, multi-turn, and open-ended questions designed by LMSYS Org for evaluating the conversational and instruction-following capabilities of chatbots.
Open Source & Community Driven:
- The entire FastChat framework is open-source (Apache 2.0 license), encouraging contributions and adaptations from the research and developer community.

Specific Use Cases

FastChat and its associated projects are valuable for a wide range of applications and research endeavors:

Research in Conversational AI & LLMs: Providing a platform for studying LLM behavior, developing new training techniques, and evaluating model performance.
Training and Fine-tuning Custom Chatbots: Enabling users to fine-tune base LLMs on their own datasets to create specialized chatbots for specific tasks or domains.
Serving Open-Source LLMs: Offering a scalable and flexible system for deploying open-source LLMs for use in applications, research, or internal tools.
Evaluating Chatbot Performance: Chatbot Arena provides a unique and widely recognized method for assessing the quality and helpfulness of different LLMs based on human preferences.
Developing Local or Private AI Chat Solutions: The framework can be deployed locally, allowing for private interaction with LLMs.
Backend for Custom Chatbot Interfaces: The OpenAI-compatible API allows developers to connect FastChat-served models to custom frontends or applications.
Educational Purposes: Helping students and researchers learn about LLM architecture, fine-tuning, serving, and evaluation.

Installation and Setup Guide

Setting up FastChat typically involves installing the Python package and then configuring the serving components or training scripts:

Prerequisites:
- Python (versions 3.8 - 3.10 are commonly used; check the latest requirements).
- pip for package installation.
- Git for cloning the repository if installing from source.
- Sufficient hardware (see Hardware Requirements section).
- CUDA or ROCm for GPU acceleration if desired.
Installation:
- Method 1: With pip (Recommended):
```
pip3 install "fschat[model_worker,webui]"
```
  This installs FastChat along with dependencies for model workers and the web UI. You can install only fschat for core functionalities if you don't need the web UI or all worker dependencies immediately.
- Method 2: From Source:
```
git clone [https://github.com/lm-sys/FastChat.git](https://github.com/lm-sys/FastChat.git)
cd FastChat
pip3 install --upgrade pip  # To ensure pip is up-to-date
pip3 install -e ".[model_worker,webui]"
```
Downloading Base LLM Weights:
- FastChat itself provides the framework; you need to download the weights for the LLMs you intend to serve or fine-tune (e.g., Llama 3, Vicuna, Mistral) from sources like Hugging Face Hub. Ensure you comply with the license terms of these models.
Setting Up the Serving System (Example for Vicuna-7B): The FastChat serving system consists of three main components: Controller, Model Worker(s), and a Web Server (API and/or UI).
- Step 1: Launch the Controller:
```
python3 -m fastchat.serve.controller
```
  The controller manages the distributed model workers. It typically runs on http://localhost:21001.
- Step 2: Launch the Model Worker(s): This worker hosts the LLM. You need to specify the path to your downloaded model weights.
```
# Example for Vicuna-7B (replace with your actual model path)
python3 -m fastchat.serve.model_worker --model-path /path/to/your/vicuna-7b-v1.5
```
  You can launch multiple model workers for different models or to scale a single model. Each worker needs to register with the controller. For multiple workers, assign different ports and GPUs if necessary.
- Step 3: Launch the Web UI Server (Gradio):
```
python3 -m fastchat.serve.gradio_web_server
```
  This starts the Gradio web interface, usually accessible at http://localhost:7860. You should see the connected models available for chat.
- Step 4: Launch the OpenAI-Compatible RESTful API Server:
```
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```
  This makes the LLMs accessible via an API endpoint (e.g., http://localhost:8000/v1/) that mimics the OpenAI API structure.
Fine-tuning (Example - Vicuna from Llama):
- The FastChat repository contains scripts and instructions for fine-tuning models like Llama to create Vicuna-like chatbots. This typically involves:
  - Preparing a conversational dataset (e.g., in ShareGPT format).
  - Running the fine-tuning script (e.g., fastchat/train/train_mem.py or similar) with appropriate parameters for the base model, dataset, and training configuration.
- Detailed instructions are usually found in the docs/training.md file within the GitHub repository.

Hardware Requirements

Running and especially fine-tuning LLMs with FastChat can be resource-intensive:

GPU (Graphics Processing Unit):
- Serving:
  - For 7B parameter models (like Vicuna-7B): Approximately 14GB of GPU VRAM for full precision (FP16). This can be reduced by about half (to ~7-8GB VRAM) with 8-bit quantization.
  - For 13B parameter models (like Vicuna-13B): Approximately 28GB of GPU VRAM (FP16), reducible to ~14-15GB with 8-bit quantization.
  - Larger models will require proportionally more VRAM.
- Fine-tuning: Requires significantly more VRAM than inference, often necessitating multiple high-end GPUs for larger models or full fine-tuning. LoRA fine-tuning can reduce VRAM requirements.
RAM (System Memory):
- If running in CPU-only mode (very slow for inference, impractical for training):
  - Vicuna-7B: Around 30GB of CPU RAM.
  - Vicuna-13B: Around 60GB of CPU RAM.
- Even with GPU usage, having ample system RAM (e.g., 32GB, 64GB, or more) is beneficial for data loading and other processes.
Storage: Sufficient disk space for the FastChat installation, Python environment, downloaded base model weights (which can range from ~13GB for a 7B model to over 100GB for very large models), datasets for fine-tuning, and saved fine-tuned models. SSDs are highly recommended.
CPU: A modern multi-core CPU is beneficial, but the GPU does the heavy lifting for model operations.

License

FastChat is released under the Apache 2.0 License. This is a permissive open-source license that allows for commercial use, modification, and distribution, subject to the terms of the license.

Note: The models you use with FastChat (e.g., Llama 2, Llama 3, Vicuna) have their own separate licenses that you must also comply with.

Frequently Asked Questions (FAQ)

Q1: What is FastChat? A1: FastChat is an open-source platform by LMSYS Org for training, serving, and evaluating large language model-based chatbots. It provides tools and code to work with models like Vicuna and powers the popular Chatbot Arena for LLM evaluation.

Q2: What is Vicuna? A2: Vicuna is a series of open-source chatbots fine-tuned by LMSYS Org, typically by fine-tuning Llama base models on user-shared conversations from ShareGPT.com. Vicuna models are known for their strong conversational abilities and are often used with the FastChat platform.

Q3: What is Chatbot Arena? A3: Chatbot Arena (https://lmarena.ai/) is a research project powered by FastChat where users can anonymously chat with two different LLMs side-by-side and vote for which one provides a better response. This crowdsourced human feedback is used to rank LLMs using an Elo rating system, providing a valuable public leaderboard.

Q4: Can I run FastChat and its models locally? A4: Yes, FastChat is designed to allow users to serve and fine-tune LLMs on their own hardware, enabling local and private AI chatbot applications.

Q5: Does FastChat provide an API similar to OpenAI? A5: Yes, FastChat includes an OpenAI-compatible RESTful API server. This allows developers to use FastChat-served models as a local drop-in replacement for OpenAI APIs in their applications, often by just changing the API base URL.

Q6: What hardware do I need to run FastChat with models like Vicuna-7B? A6: For inference with a 7B parameter model like Vicuna-7B, you'd ideally want an NVIDIA GPU with at least 14GB of VRAM (or ~7-8GB with 8-bit quantization). For CPU-only, around 30GB of RAM is needed, but it will be much slower. Fine-tuning requires more substantial GPU resources.

Q7: Is FastChat free? A7: Yes, FastChat is free and open-source software, licensed under Apache 2.0. You will incur costs for hardware and, if applicable, for any proprietary base models you might choose to use (though FastChat primarily focuses on open models).

Q8: How can I contribute to or get support for FastChat? A8: You can contribute to the project via its GitHub repository. For support, refer to the GitHub issues, discussions, and the LMSYS Org community channels (like Discord, if available).

Here are examples of official and community resources to learn more about FastChat, Vicuna, and Chatbot Arena:

LMSYS Org Official Website & Blog: The primary source for research announcements, project updates, and insights from the creators.
- LMSYS Org Website: https://lmsys.org/
- Vicuna Release Blog Post (Example): The initial Vicuna announcement blog post from LMSYS Org (search "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality lmsys blog" - typically found on lmsys.org/blog/).
- Chatbot Arena Blog Posts: LMSYS Org publishes updates and analyses from Chatbot Arena on their blog. Example: "Announcing a New Site for Chatbot Arena" (https://lmsys.org/blog/2024-09-20-arena-new-site/) or "From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline" (https://lmsys.org/blog/2024-04-19-arena-hard/).
Hugging Face - Chatbot Arena Leaderboard: The live leaderboard and further details about Chatbot Arena.
- https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard (Note: The initial browse gave lmarena-ai as the Hugging Face user, which is also correct and points to the same resources).
Tutorials on Setting up and Using FastChat:
- Intel IPEX-LLM FastChat Quickstart: While specific to Intel hardware, this guide shows how to serve models using FastChat with IPEX-LLM optimizations. (https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/fastchat_quickstart.md)
- SkyPilot - Train Your Own Vicuna on Llama-2: A guide for fine-tuning Llama 2 to create Vicuna-like models, referencing the methodology used by FastChat contributors. (https://docs.skypilot.co/en/latest/examples/training/llama-2-finetuning.html)
LangChain Integration Documentation: The FastChat GitHub README often points to documentation on how to integrate FastChat-served models with LangChain (docs/langchain_integration.md).
Community Discussions & GitHub:
- The FastChat GitHub repository's "Discussions" and "Issues" tabs are valuable resources.
- Subreddits like r/LocalLLaMA often feature discussions about FastChat, Vicuna, and other local LLM solutions.

Ethical Considerations & Safety

Model Responsibility: Users are responsible for the LLMs they train or serve using FastChat and must adhere to the licenses and use policies of those models.
Data for Fine-tuning: When fine-tuning models (like Vicuna using ShareGPT data), the ethical implications of using publicly shared conversational data should be considered.
Output Quality & Bias: LLMs can generate incorrect, biased, or harmful content. Applications built with FastChat should incorporate safety measures and users should be aware of these limitations.
Chatbot Arena: While providing valuable insights, the evaluation is based on human preference, which can be subjective. It also notes how data access can influence model performance on the leaderboard.

FastChat GitHub Repository (Primary Source): https://github.com/lm-sys/FastChat
LMSYS Org Official Website: https://lmsys.org/
Chatbot Arena: https://lmarena.ai/ (or https://chat.lmsys.org/)
Vicuna Project Page (often on LMSYS Org): For details on the Vicuna models.
Hugging Face Hub (for FastChat models & Vicuna): Search for lmsys organization on https://huggingface.co/models.
FastChat Documentation on GitHub: Within the docs folder of the repository (e.g., docs/openai_api.md, docs/training.md).

FastChat

FastChat: Open Platform for Training, Serving, and Evaluating LLM Chatbots

Introduction

Key Features

Specific Use Cases

Installation and Setup Guide

Hardware Requirements

License

Frequently Asked Questions (FAQ)

Ethical Considerations & Safety

Related Projects

LangChain

Langchain-Chatchat

Fauxpilot