A parameter-efficient framework for converting text to videos without training.
Text2Video-Zero is an innovative open-source project developed by Picsart AI Research (PAIR) that enables zero-shot video generation and editing by leveraging the capabilities of existing pre-trained text-to-image diffusion models, such as Stable Diffusion. Its core contribution is a low-cost approach that adapts these powerful image models for the video domain without requiring any task-specific training on video datasets. This means it can generate or modify video clips based on textual prompts by intelligently manipulating latent codes and attention mechanisms across frames.
The project, available on GitHub, is primarily targeted at researchers, developers, and AI enthusiasts interested in the mechanics of video synthesis and exploring novel ways to create and edit video content using the strong priors learned by large-scale text-to-image models. It emphasizes privacy by design, as all processing can occur locally if the underlying text-to-image models are run locally.
Text2Video-Zero offers a unique set of features for zero-shot video manipulation:
motion_field_strength_x
, motion_field_strength_y
): Control the intensity of generated motion.t0
, t1
): Parameters that influence how latent codes are manipulated between frames, affecting motion and consistency.video_length
): Specify the number of frames to generate (typically short clips, e.g., 8 frames by default, can be extended by chunking).chunk_size
): For generating longer videos in a chunk-by-chunk manner to manage memory.Text2Video-Zero is primarily a research project and tool for developers/enthusiasts, with use cases including:
Using Text2Video-Zero typically involves setting up a Python environment, downloading pre-trained text-to-image models, and running the provided scripts:
Prerequisites & Setup:
chunk_size
parameter for generating longer videos. Performance is significantly better on more powerful GPUs.git clone [https://github.com/Picsart-AI-Research/Text2Video-Zero.git](https://github.com/Picsart-AI-Research/Text2Video-Zero.git)
cd Text2Video-Zero
requirements.txt
file. This will include PyTorch, diffusers
, transformers
, and other libraries.
pip install -r requirements.txt
Download Pre-trained Models:
Running Inference Scripts:
run_t2v.py
or similar) with your text prompt and parameters.python run_t2v.py --prompt "A panda playing guitar on Times Square" --model_id "runwayml/stable-diffusion-v1-5" --output_path "output_video.mp4"
python run_vid2vid_instruct.py --input_video "path/to/camel.mp4" --edit_prompt "Make it look like it's sunset time" --model_id "timbrooks/instruct-pix2pix" --output_path "edited_video.mp4"
run_pose_guided_t2v.py
).prompt
: Your text description.video_length
: Number of frames to generate.motion_field_strength_x
, motion_field_strength_y
: Control motion intensity.t0
, t1
: Latent interpolation timesteps.chunk_size
: For processing longer videos in chunks to save memory.Using with Hugging Face Diffusers Library:
diffusers
library, which provides a TextToVideoZeroPipeline
.from diffusers import TextToVideoZeroPipeline
import torch
model_id = "runwayml/stable-diffusion-v1-5" # Example base model
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
prompt = "A panda is playing guitar on Times Square"
result_frames = pipe(prompt=prompt).images # result_frames will be a list of PIL Images
# Process frames into a video (e.g., using imageio)
Text2Video-Zero is an open-source research project released by Picsart AI Research.
Text2Video-Zero is published under the CreativeML Open RAIL-M license. This license is designed for open and responsible AI development. Key aspects often include:
Users must review the full CreativeML Open RAIL-M license text available in the GitHub repository to understand all terms, conditions, and restrictions before using the software or models derived from it, especially for any public or commercial-facing applications.
Q1: What is Text2Video-Zero? A1: Text2Video-Zero is an open-source AI method and codebase from Picsart AI Research that enables the generation and editing of short video clips from text prompts by leveraging pre-trained text-to-image diffusion models (like Stable Diffusion) without requiring specific video training data.
Q2: How does "zero-shot" video generation work in Text2Video-Zero? A2: It "adapts" existing text-to-image models for video tasks. Key techniques include enriching the latent codes of generated frames with motion dynamics (to ensure temporal consistency in background/scene) and reprogramming frame-level self-attention to a cross-frame attention mechanism (focusing on the first frame to maintain object appearance and context across the sequence).
Q3: Do I need to train a new model to use Text2Video-Zero? A3: No, the "zero-shot" aspect means it's designed to work with existing, pre-trained text-to-image diffusion models (like Stable Diffusion v1.5) without needing to fine-tune them on video data for basic text-to-video generation or Video Instruct-Pix2Pix editing.
Q4: What kind of videos can Text2Video-Zero create? A4: It can generate short video clips (typically a few seconds, e.g., 8 frames by default, extendable by chunking) from text prompts. It also supports text-guided video editing (Video Instruct-Pix2Pix) and conditional generation using poses, edges, or depth maps.
Q5: What hardware is required to run Text2Video-Zero?
A5: A GPU with at least 12GB of VRAM is recommended for reasonable performance. The system can be configured to use less VRAM by processing videos in smaller chunks (chunk_size
parameter), but this will increase generation time.
Q6: Is Text2Video-Zero free? A6: Yes, the code and research are open-source, released under the CreativeML Open RAIL-M license, making it free to use and modify according to the license terms.
Q7: Can I use Text2Video-Zero for commercial purposes? A7: The CreativeML Open RAIL-M license has specific use-based restrictions aimed at preventing harmful applications. While it might not explicitly forbid all commercial uses, users must carefully review the license to ensure their intended application complies with all its clauses and restrictions. The license often prioritizes responsible and ethical use over unrestricted commercial exploitation.
Q8: Where can I find the research paper for Text2Video-Zero? *A8: The paper is titled "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators." It is typically available on arXiv and linked from the project's GitHub page or official project website. * Example arXiv link (check GitHub for the definitive one): https://arxiv.org/abs/2303.13439 (This is the typical format; the search results confirm similar papers).
Here are examples of resources and discussions that can help you understand and use Text2Video-Zero:
diffusers
library.
Last updated: May 16, 2025