Voxtral is its first open audio model released on July 15, 2025 by French AI startup Mistral AI. Voxtral is designed to provide production environment out-of-the-box speech understanding capabilities for commercial applications at a highly competitive market price. The Voxtral model is available in two versions, a 24B-parameter version for production-scale applications and a 3B-parameter "Mini" version for local and edge deployments. Both versions are released under the Apache 2.0 license and can be downloaded from Hugging Face and run locally, or integrated into applications via an API. Voxtral does more than just transcribe speech; it also provides deep understanding of audio content, supporting direct questions, summarization, and execution of tasks on audio content. The model supports multiple languages, including English, Spanish, French and Hindi, and can handle up to 30 minutes of audio for transcription or up to 40 minutes of audio for comprehension.
Function List
- dual-version model:: Two model sizes are available, a 24B parameterized version for large-scale production applications and a 3B parameterized "Mini" version for local and edge computing deployments.
- Open source and API access: Both models follow the Apache 2.0 open source license and can be downloaded from Hugging Face. Also, Mistral AI provides an API interface that allows developers to integrate Voxtral's voice intelligence into their applications through simple API calls.
- high quality-price ratio: API pricing starts at $0.001 per minute and is designed to enable high-quality speech transcription and understanding to be used at scale.
- Long Audio Processing: Has a context length of 32k tokens and can handle up to 30 minutes of audio for transcription or up to 40 minutes of audio for comprehension tasks.
- Built-in Q&A and summary features: No need to string together multiple models to ask questions or generate structured summaries of audio content directly.
- Multi-language support: Validated by multiple benchmarks such as FLEURS and Mozilla Common Voice, Voxtral excels in multiple languages, and is especially at the top of its game in European languages, with support for English, French, German, Spanish, Italian, Portuguese, Dutch, and Hindi, among others.
- Local Deployment and Customization: Provides enterprise customers with local deployment options, as well as solutions for fine-tuning and extending functionality for specific domains, such as speaker recognition, emotion detection, and dialog separation.
- Retention of text-processing capabilities: Voxtral retains the text processing capabilities of its language modeling backbone (Mistral Small 3.1), allowing for seamless switching between speech and language tasks.
Using Help
Voxtral is designed to provide developers and organizations with flexible and powerful speech understanding capabilities. Depending on the needs, there are different options for using Voxtral.
1. Rapid integration through APIs
Using the APIs provided by Mistral AI is the most straightforward way for developers looking to quickly integrate voice intelligence into existing applications.
Operation Procedure.
- Getting the API key: First of all, you need to register with Mistral AI's official platform and get the API key.
- Read the API documentation: Visit the official Mistral AI documentation to find the section on the Voxtral API. The documentation will explain in detail how to call the API, including the format of the request, the required parameters, and the structure of the data returned.
- Initiating API requests:
- transcription endpoint: If your need is simply to convert speech to text, you can use the highly optimized transcription-only endpoint provided by Mistral AI. This is usually the most cost-effective option. You will need to send the audio file to the specified URL as part of the request.
- Understanding and Q&A: If you need more advanced functionality, such as asking questions or generating summaries of audio content, you will need to call API endpoints that support these features. In the request, in addition to providing the audio file, you may need to provide additional parameters such as the question you want to ask or the command that requires a summary to be generated.
- Processing return results: The API returns a JSON-formatted data that contains transcribed text, answers to questions, or generated summaries. Your application needs to parse this JSON data to extract the required information.
sample scenario: A customer service application can use the Voxtral API to transcribe a customer's voice message to text in real time and then use the summary function to quickly generate a service ticket, greatly improving processing efficiency.
2. Local deployment and operation
For enterprises and researchers who need data privacy, to run offline, or for deep customization, Voxtral's open source models can be downloaded directly to run on local servers or edge devices.
Installation and Deployment Process.
- environmental preparation:
- You will need a server or computer with sufficient computing resources (especially GPUs). The exact hardware requirements depend on the model version you choose (version 24B requires a higher configuration).
- Install the Python environment and have the necessary machine learning libraries such as PyTorch, Transformers, etc. ready.
- Download model:
- Visit the Hugging Face website (huggingface.co).
- Search for "Voxtral" or "Mistral AI".
- Select the model version you need (Voxtral 24B or Voxtral Mini 3B) and download the model weights file.
- Writing loading and reasoning code:
- Using Hugging Face's
Transformers
library, you can easily load downloaded models. - You need to write Python scripts to load audio files, preprocess them, and then feed them into the model for inference.
- The result of the reasoning will be the transcribed text or the output of the understanding of the audio content.
- Using Hugging Face's
workflow:
- Load Audio: Use
librosa
etc. library to load your audio files. - preprocessing:: Sample rate conversion and formatting of audio data according to model requirements.
- model-based reasoning: Call the loaded Voxtral model for forward propagation to get the output.
- reprocess: Decode the output of the model into human-readable text.
sample scenario: A news organization can deploy Voxtral on its internal servers for rapid transcription of recorded interviews, allowing journalists to do their work directly locally without having to upload sensitive interviews to the cloud.
3. Experience in Le Chat
For regular users, the easiest way to experience this is through Mistral AI's chat app, Le Chat.
Operation Procedure.
- Visit the web version of Le Chat or download its mobile app.
- Switch to voice mode.
- You can record your voice directly, or upload an existing audio file.
- Le Chat will use Voxtral to transcribe your voice into text and display it. You can furthermore have it summarize the content or answer questions about this audio.
This approach is ideal for quickly testing the capabilities of a model or for performing lightweight personal tasks such as recording meeting points or organizing class notes.
application scenario
- Customer Service Automation
Transcribe customer service calls or voice messages and automatically generate summaries or work orders to improve customer service response speed and efficiency. - Content Creation and Media
Quickly transcribe audio content from interviews, podcasts or conferences into transcripts for post-processing and content distribution by reporters, editors and content creators. - Proceedings and analysis
Real-time transcription of meetings with the ability to generate minutes, extract key decision points and to-do lists based on instructions. - Edge computing and IoT devices
Deploy Voxtral Mini models on smart homes, in-vehicle systems or industrial IoT devices to enable localized voice control and interaction without relying on cloud connectivity. - Multilingual Content Processing
Processing and analyzing audio data from different countries and regions, e.g. analyzing multilingual user feedback in international market research.
QA
- How is Voxtral different from other speech recognition tools on the market?
The biggest difference with Voxtral is that it not only does highly accurate voice transcription, but also natively supports deep semantic understanding of audio content. This means that users can ask questions directly to the audio or have it generate summaries without the need to transcribe and then input the text into another language model. In addition, it offers top performance at an open source and highly competitive price, lowering the barriers to the adoption of high-quality speech intelligence. - Do I need strong programming skills to use Voxtral?
Not necessarily. For non-technical users, Voxtral can be experienced directly through Mistral AI's Le Chat app. For developers, it's also relatively easy to use the API interface, just follow the API documentation. Deploying the open source model locally, on the other hand, requires some programming and machine learning background. - What languages does Voxtral support?
Voxtral supports multiple languages including English, French, German, Spanish, Italian, Portuguese, Dutch and Hindi. According to benchmark results published by Mistral AI, it performs very well in multiple languages, especially in European languages. - Is it expensive to use the Voxtral API?
Not too high.Mistral AI's pricing strategy is very competitive, with its transcription API starting at $0.001 per minute, well below some of the mainstream closed-source APIs on the market. This makes it economically feasible to apply high-quality speech transcription and understanding at scale.