chatterbox-vllm: Accelerating Chatterbox TTS with vLLM for Enhanced Performance

Introduction

chatterbox-vllm is an impressive project that ports the Chatterbox Text-to-Speech (TTS) model to vLLM, a high-performance inference engine. Developed by randombk, this repository aims to dramatically enhance the performance and efficiency of the Chatterbox model, making it faster and more memory-friendly on GPUs. It's a personal project focused on leveraging vLLM's capabilities for state-of-the-art speech synthesis. Early benchmarks indicate significant speedups, making it an exciting development for anyone working with TTS models.

Installation

This project primarily supports Linux and WSL2 with Nvidia hardware. While AMD might work with minor adjustments, it has not been tested.

Prerequisites: Ensure git and uv (a fast Python package installer and resolver) are installed on your system.

git clone https://github.com/randombk/chatterbox-vllm.git
cd chatterbox-vllm
uv venv
source .venv/bin/activate
uv sync

The necessary model weights should be automatically downloaded from the Hugging Face Hub. If you encounter CUDA-related issues, try resetting your virtual environment and using uv pip install -e . instead of uv sync.

Examples

To quickly generate audio samples, you can run the provided example-tts.py script. This example demonstrates how to generate speech for multiple prompts using different voices.

import torchaudio as ta
from chatterbox_vllm.tts import ChatterboxTTS


if __name__ == "__main__":
    model = ChatterboxTTS.from_pretrained(
        gpu_memory_utilization = 0.4,
        max_model_len = 1000,

        # Disable CUDA graphs to reduce startup time for one-off generation.
        enforce_eager = True,
    )

    for i, audio_prompt_path in enumerate([None, "docs/audio-sample-01.mp3", "docs/audio-sample-03.mp3"]):
        prompts = [
            "You are listening to a demo of the Chatterbox TTS model running on VLLM.",
            "This is a separate prompt to test the batching implementation.",
            "And here is a third prompt. It's a bit longer than the first one, but not by much.",
        ]
    
        audios = model.generate(prompts, audio_prompt_path=audio_prompt_path, exaggeration=0.8)
        for audio_idx, audio in enumerate(audios):
            ta.save(f"test-{i}-{audio_idx}.mp3", audio, model.sr)

Why Use It

The primary motivation behind chatterbox-vllm is to overcome performance bottlenecks and improve GPU memory utilization of the original Chatterbox TTS model. By porting it to vLLM, the project achieves:

Improved Performance: Early benchmarks show significant speedups, with generation tokens/s increasing by approximately 4x without batching and over 10x with batching. This is a substantial improvement over the original implementation, which was often bottlenecked by CPU-GPU synchronization.
Efficient GPU Memory Use: vLLM's optimized inference infrastructure allows for more efficient use of GPU memory, enabling higher throughput and potentially larger batch sizes.
Easier Integration: The vLLM port facilitates easier integration with modern, high-performance inference systems, streamlining deployment and scaling of TTS applications.
Benchmark-Topping Throughput: The project currently boasts impressive throughput, particularly for the T3 Llama token generation component, which is no longer the bottleneck in the TTS pipeline.

chatterbox-vllm: Accelerating Chatterbox TTS with vLLM for Enhanced Performance

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use It

Links