AI-Chatbox is a voice interaction project based on the ESP32S3 development board. Users talk to the Large Model (LLM) by voice, the device converts the voice to text and sends it to the Large Model, which can be further converted to voice broadcast after getting the answer. The project is developed in Rust language and integrated with Vosk speech recognition tool, which is suitable for people who are not convenient to use cell phone applications, such as children, the elderly or visually impaired people. The hardware is based on XIAO ESP32S3 Sense, combined with voice coding hardware, and the software is hosted on GitHub through open source code. the project aims to provide a convenient voice interaction experience, suitable for embedded development enthusiasts and smart hardware developers.
Function List
- Wake to Voice and Command Recognition : Supports recordings triggered by the wake-up word "hi, Loxin" and the command word "I have a question".
- speech-to-text : Convert recorded WAV audio to text using Vosk tool, supports Chinese recognition.
- Large Model Interaction : By DeepSeek API Send a text question to get an intelligent answer.
- Logging : Supports real-time log viewing for easy debugging and monitoring of device status.
- cross-device access : Build a REST service via Flask that allows other devices on the LAN to call the speech-to-text function.
- Embedded Optimization : Rust code is optimized for embedded devices and configured to generate a maximum of 512 tokens, balancing performance and resources.
Using Help
Installation and Configuration
- Preparation Hardware
Requires an XIAO ESP32S3 Sense development board with microphone and speech encoding. External speech encoding hardware can enhance audio processing. Ensure that the development board is connected to an SD card for storing voice models. - Configuring the development environment
- Install the Rust on ESP environment, refer to the official guide (Rust on ESP).
- Install the Python environment for running the Vosk speech-to-text service.
- Download Vosk Chinese model (
vosk-model-cn-0.22.zip
) from the Vosk official site, unzip it to a local directory. - Place the speech model file (
mn7_cn
,nsnet2
,vadnet1_medium
,wn9_hilexin
The following is a copy of the SD card (see below) to the root directory of the SD card.
- Installation of dependencies
Run the following command to install the Python dependency:pip install vosk flask flask-cors
Ensure that the Rust environment is configured and enter the ESPUP environment:
source $HOME/export-esp.sh
- Compiling and Uploading Firmware
- Cloning Project Warehouse:
git clone https://github.com/paul356/ai-chatbox.git
The - Go to the project directory and run the compile command:
cargo build
- After successful compilation, use the following command to upload the firmware to the ESP32S3 development board:
cargo espflash flash -p /dev/ttyACM0 --flash-size 8mb
- Set environment variables (Wi-Fi and DeepSeek API key):
export WIFI_SSID=<your-ssid> export WIFI_PASS=<your-password> export LLM_AUTH_TOKEN=<your-deepseek-token>
- Cloning Project Warehouse:
- Running a speech-to-text service
- exist
vosk-model-cn-0.22
directory is run at the upper level of the directory:python vosk_server.py
- After the service is started, listen to the
http://0.0.0.0:5000/transcribe
If you have a WAV file, you can accept the WAV file and return the text.
- exist
- Testing Services
Use the following command to test the speech-to-text service:curl -X POST -F "file=@record.wav" http://127.0.0.1:5000/transcribe
workflow
- priming device
Connect the development board, run the firmware and then view the logs with the following command:cargo espflash monitor
- voice interaction
- Activate the device by saying the wake-up word "hi, Loxin".
- Say the command word "I have a question" to enter recording mode.
- Name the problem, the device detects 2 seconds of silence and automatically stops recording.
- Speech is converted to text via the Vosk service, sent to the DeepSeek API, answers are obtained and recorded in the log.
- View Log
The log displays device status, speech recognition results, and LLM responses. For example, the question "What is a big model" may return a detailed model definition and functional description.
caveat
- clear voice : Vosk models are small and need to be pronounced clearly to improve recognition accuracy.
- network connection : The device requires a Wi-Fi connection to access the DeepSeek API.
- Model Storage : Ensure that the SD card has enough space to store the speech model (about several hundred MB).
- adjust components during testing : Checks that speech-to-text and LLM interactions are working properly via logs, and error messages are logged as
Error:
Beginning.
application scenario
- intelligent assistant
The user interacts with the device by voice to get answers to questions or assistance with tasks, suitable for children or the elderly. For example, a child can ask "Why does the sun shine?" and the device will return an easy-to-understand answer. - Screenless Device Interaction
Visually impaired people or users who are not convenient to use cell phone can complete information inquiry or daily conversation through voice operation. - Embedded Development Lab
Developers can build on this project to learn about the use of Rust on embedded devices and explore the integration of speech recognition with larger models. - Education and learning
Students can ask academic questions by voice and the device connects to a large model to provide professional answers, suitable for classroom or self-study scenarios.
QA
- What languages are supported by the Vosk model?
Current projects usevosk-model-cn-0.22
Vosk's official website provides other language models, which can be replaced as needed. - How to improve speech recognition accuracy?
Ensure clear pronunciation and avoid background noise. Use a higher performance microphone or upgrade to a larger model (such as thevosk-model-cn-0.22-large
) can enhance the effect. - How to get DeepSeek API key?
Visit the DeepSeek website to register and request an API key, and configure the key as an environment variable.LLM_AUTH_TOKEN
The - Does the device support offline operation?
The speech-to-text service (Vosk) works offline, but LLM interactions require network access to the DeepSeek API.