Overseas access: www.kdjingpai.com
Bookmark Us

RealtimeVoiceChat is an open source project that focuses on real-time, natural conversations with artificial intelligence via voice. Users use the microphone to input voice , the system captures the audio through the browser , quickly converted to text , a large language model (LLM) to generate a reply , and then convert the text to speech output , the whole process is close to real-time . The project uses a client-server architecture that emphasizes low latency and supports WebSocket streaming and dynamic conversation management. It provides Docker deployment, is recommended to run on Linux system and NVIDIA GPU environment, and integrates RealtimeSTT, RealtimeTTS and Ollama and other technologies that are suitable for developers to build voice interaction applications.

RealtimeVoiceChat:低延迟与AI进行自然口语对话-1

 

Function List

  • real time voice interaction: Users input their voice through the browser microphone, and the system transcribes and generates a voice response in real time.
  • Low latency processing: Optimize speech-to-text and text-to-speech latency to 0.5-1 seconds by streaming audio using WebSocket.
  • Speech to text (STT): Quickly convert speech to text using RealtimeSTT (based on Whisper), with dynamic transcription support.
  • Text-to-speech (TTS): Generate natural speech via RealtimeTTS (supports Coqui, Kokoro, Orpheus) with a choice of voice styles.
  • Intelligent Dialog Management: Integrate language models from Ollama or OpenAI to support flexible dialog generation and interrupt handling.
  • Dynamic Speech Detection: By turndetect.py Enables intelligent silence detection and adapts to the rhythm of the conversation.
  • web interface: Provides a clean browser interface, uses Vanilla JS and Web Audio API, and supports real-time feedback.
  • Docker Deployment: Simplified installation with Docker Compose, support for GPU acceleration and model management.
  • Model Customization: Support for switching between STT, TTS and LLM models to adjust speech and dialog parameters.
  • open source and scalable: The code is publicly available and developers are free to modify or extend the functionality.

 

Using Help

Installation process

RealtimeVoiceChat supports both Docker deployment (recommended) and manual installation; Docker for Linux systems, especially with NVIDIA GPUs, and manual installation for Windows or scenarios where more control is needed. Below are the detailed steps:

Docker Deployment (recommended)

Requires Docker Engine, Docker Compose v2+, and NVIDIA Container Toolkit (GPU users). Linux systems are recommended for optimal GPU support.

  1. clone warehouse::
    git clone https://github.com/KoljaB/RealtimeVoiceChat.git
    cd RealtimeVoiceChat
  1. Building a Docker Image::
    docker compose build
    

    This step downloads the base image, installs Python and machine learning dependencies, and pre-downloads the default STT models (Whisper base.en). It takes a long time, so please make sure your network is stable.

  2. Starting services::
    docker compose up -d
    

    Start the application and the Ollama service with the container running in the background. Wait about 1-2 minutes for initialization to complete.

  3. Pull Ollama model::
    docker compose exec ollama ollama pull hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M
    

    This command pulls the default language model. Users can find the default language model in the code/server.py modifications LLM_START_MODEL to use other models.

  4. Validating Model Usability::
    docker compose exec ollama ollama list
    

    Make sure the model is loaded correctly.

  5. Stop or restart the service::
    docker compose down  # 停止服务
    docker compose up -d  # 重启服务
    
  6. View Log::
    docker compose logs -f app  # 查看应用日志
    docker compose logs -f ollama  # 查看 Ollama 日志
    

Manual installation (Windows/Linux/macOS)

Manual installation requires Python 3.9+, CUDA 12.1 (GPU users) and FFmpeg. Windows users can use the provided install.bat Scripts simplify the process.

  1. Installation of basic dependencies::
    • Make sure Python 3.9+ is installed.
    • GPU users install NVIDIA CUDA Toolkit 12.1 and cuDNN.
    • Install FFmpeg:
      # Ubuntu/Debian
      sudo apt update && sudo apt install ffmpeg
      # Windows (使用 Chocolatey)
      choco install ffmpeg
      
  2. Clone the repository and create a virtual environment::
    git clone https://github.com/KoljaB/RealtimeVoiceChat.git
    cd RealtimeVoiceChat
    python -m venv venv
    # Linux/macOS
    source venv/bin/activate
    # Windows
    .\venv\Scripts\activate
    
  3. Installing PyTorch (matching hardware)::
    • GPU (CUDA 12.1):
      pip install torch==2.5.1+cu121 torchaudio==2.5.1+cu121 torchvision --index-url https://download.pytorch.org/whl/cu121
      
    • CPU (slower performance):
      pip install torch torchaudio torchvision
      
  4. Installing additional dependencies::
    cd code
    pip install -r requirements.txt
    

    Note: DeepSpeed can be complicated to install, Windows users can install it via the install.bat Automatic processing.

  5. Install Ollama (non-Docker users)::
    • Refer to the official Ollama documentation for installation.
    • Pull the model:
      ollama pull hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M
      
  6. Running the application::
    python server.py
    

caveat

  • hardware requirement: NVIDIA GPUs (at least 8GB of RAM) are recommended to ensure low latency. running on a CPU can cause significant performance degradation.
  • Docker Configuration: Modification code/*.py maybe docker-compose.yml Before you do so, you need to re-run the docker compose buildThe
  • License Compliance: TTS engines (e.g. Coqui XTTSv2) and LLM models have separate licenses and are subject to their terms.

workflow

  1. Accessing the Web Interface::
    • Open your browser and visit http://localhost:8000(or remote server IP).
    • Grant microphone privileges and click "Start" to begin the conversation.
  2. voice interaction::
    • Speak into the microphone and the system captures the audio via the Web Audio API.
    • Audio is transferred to the backend via WebSocket, RealtimeSTT is converted to text, Ollama/OpenAI generates replies, and RealtimeTTS is converted to speech and played back through the browser.
    • Conversation delays are typically 0.5-1 seconds, with support for interrupting and continuing at any time.
  3. Real-time feedback::
    • The interface displays partially transcribed text and AI responses, making it easy for users to follow the conversation.
    • You can click "Stop" to end the dialog and "Reset" to clear the history.
  4. Configuration adjustments::
    • TTS Engine: in code/server.py set up in START_ENGINE(e.g. coqui,kokoro,orpheus), adjusting the voice style.
    • LLM model: Modification LLM_START_PROVIDER cap (a poem) LLM_START_MODELThe program supports either Ollama or OpenAI.
    • STT Parameters: in code/transcribe.py mid-range adjustment Whisper modeling, language, or silence thresholds.
    • Silence detection: in code/turndetect.py modifications silence_limit_seconds(default 0.2 seconds) to optimize conversation pacing.
  5. Debugging and Optimization::
    • View Log:docker compose logs -f(Docker) or just check out the server.py Output.
    • Performance Issues: Ensure CUDA version matching, reduce realtime_batch_size Or use a lightweight model.
    • Network Configuration: If HTTPS is required, set the USE_SSL = True and provide the certificate path (refer to the official SSL configuration).

Featured Function Operation

  • Low Latency Streaming Processing: Audio chunking via WebSocket, combined with RealtimeSTT and RealtimeTTS, with latency as low as 0.5 seconds. Users can have smooth conversations without waiting.
  • Dynamic dialogue management::turndetect.py Intelligently detects the end of speech and supports natural interruptions. For example, a user can interrupt at any time and the system will pause to generate and process new input.
  • Web Interface Interaction: The browser interface uses Vanilla JS and Web Audio API to provide real-time transcription and response display. Users can control the dialog with "Start/Stop/Reset" buttons.
  • Model Flexibility: Support for switching TTS engines (Coqui/Kokoro/Orpheus) and LLM backends (Ollama/OpenAI). For example, switching TTS:
    START_ENGINE = "kokoro"  # 在 code/server.py 中修改
    
  • Docker Management: The service is managed through Docker Compose, and updating the model is only required:
    docker compose exec ollama ollama pull <new_model>
    

 

application scenario

  1. AI Voice Interaction Research
    Developers can test the integration effect of RealtimeSTT, RealtimeTTS, and LLM to explore the optimization of low-latency speech interaction. The open source code supports customized parameters and is suitable for academic research.
  2. Intelligent Customer Service Prototype
    Enterprises can develop voice customer service systems based on the project. Users ask questions by voice and the system answers common questions in real time, such as technical support or product inquiries.
  3. Language Learning Tools
    Educational institutions can utilize the multilingual TTS functionality to develop voice conversation practice tools. Students talk to an AI to practice pronunciation and conversation, and the system provides real-time feedback.
  4. Personal Voice Assistants
    Technology enthusiasts can deploy projects to experience natural voice interaction with AI and simulate intelligent assistants for personal entertainment or small projects.

 

QA

  1. What hardware support is required?
    Recommended Linux system with NVIDIA GPU (at least 8GB of video memory) and CUDA 12.1. CPU runs feasibly, but with high latency. Minimum requirements: Python 3.9+, 8GB RAM.
  2. How to solve Docker deployment problems?
    Ensure that Docker, Docker Compose, and NVIDIA Container Toolkit are installed correctly. Check docker-compose.yml of the GPU configuration to view the logs:docker compose logs -fThe
  3. How do I switch between voice or model?
    modifications code/server.py hit the nail on the head START_ENGINE(TTS) or LLM_START_MODEL(LLM). the Docker user needs to re-pull the model:docker compose exec ollama ollama pull <model>The
  4. What languages are supported?
    RealtimeTTS supports multiple languages (e.g., English, Chinese, and Japanese), and needs to be configured in the code/audio_module.py Specify the language and speech model in the
0Bookmarked
0kudos

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Top

en_USEnglish