qwen3-livetranslate-flash-realtime is a vision-enhanced model that translates between 18 languages in real time. It processes both audio and image input from a real-time video stream or a local video file, uses visual context to improve accuracy, and outputs high-quality translated text and audio.
For an online demo, try the One-click deployment using Function Compute.
Features
Multi-language support: Supports 18 languages and 6 Chinese dialects, including Chinese, English, French, German, Russian, Japanese, and Korean, as well as Mandarin, Cantonese, and Sichuanese.
Visual enhancement: Uses visual context to improve translation accuracy. The model analyzes visual cues like lip movements, gestures, and on-screen text to produce more accurate translations in noisy environments or for words with multiple meanings.
3-second latency: Delivers simultaneous interpretation in as little as 3 seconds.
Lossless simultaneous interpretation: Uses semantic unit prediction to resolve word order differences between languages, delivering real-time translation quality comparable to offline translation.
Natural voice: Generates a natural, human-like voice by automatically adapting its intonation and emotion to match the source audio.
Hotword configuration: Offers hotword configuration to improve translation accuracy for specific terms.
Procedure
1. Configure the connection
The qwen3-livetranslate-flash-realtime model uses the WebSocket protocol. To establish a connection, you need the following:
Parameter | Description |
Endpoint | Chinese Mainland: wss://dashscope.aliyuncs.com/api-ws/v1/realtime International: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
Query parameter | The query parameter is |
Message header | Use a bearer token for authentication: DASHSCOPE_API_KEY is the API key from Model Studio. |
The following Python sample shows how to establish a connection.
2. Set language, output modality, and voice
Send the session.update client event:
Language
Source language: Use the
session.input_audio_transcription.languageparameter. Default:en(English).Target language: Use the
session.translation.languageparameter. Default:en(English).
For supported values, see Supported languages.
Output source language recognition results
Use the
session.input_audio_transcription.modelparameter to configure this feature. If you set the parameter toqwen3-asr-flash-realtime, the server returns the speech recognition result of the input audio (the original text in the source language) along with the translation.After this feature is enabled, the server returns the following events:
conversation.item.input_audio_transcription.text: Streams the speech recognition results.conversation.item.input_audio_transcription.completed: Returns the final result when speech recognition is complete.
Output modality
Set the
session.modalitiesparameter to["text"](text only) or["text","audio"](text and audio).Voice
Configure this with the
session.voiceparameter. For more information, see Supported voices.Hotword
Use the
session.translation.corpus.phrasesparameter to configure hotwords. Hotwords improve the translation accuracy for specific terms by specifying key-value pairs that map source language terms to their target language translations.For example, you can map
"Inteligencia Artificial"to"Artificial Intelligence".
3. Input audio and images
Send Base64-encoded audio (required) and images (optional) using the input_audio_buffer.append and input_image_buffer.append events.
You can use images from local files or capture them in real time from a video stream.
The server automatically detects speech boundaries and triggers the model to respond.
4. Receive the model response
When the server detects the end of the audio input, the model begins to respond. The response format depends on the configured output modality.
Text-only output
The server returns the complete translated text in a response.done event.
Text and audio output
Text
The server returns the complete translated text in a response.audio_transcript.done event.
Audio
The server returns incremental, Base64-encoded audio data in response.audio.delta events.
Supported models
Parameter | Version | Context window | Max input | Max output |
(tokens) | ||||
qwen3-livetranslate-flash-realtime Same capabilities as qwen3-livetranslate-flash-realtime-2025-09-22 | Stable | 53,248 | 49,152 | 4,096 |
qwen3-livetranslate-flash-realtime-2025-09-22 | Snapshot | |||
Getting started
Prepare the environment
Requires Python 3.10 or later.
Install pyaudio:
macOS
brew install portaudio && pip install pyaudioDebian/Ubuntu
sudo apt-get install python3-pyaudio or pip install pyaudioCentOS
sudo yum install -y portaudio portaudio-devel && pip install pyaudioWindows
pip install pyaudioThen, install the WebSocket dependencies:
pip install websocket-client==1.8.0 websocketsCreate the client
Create a file named
livetranslate_client.py:Interact with the model
In the same directory as
livetranslate_client.py, create a file namedmain.py:Run
main.pyand speak into your microphone. The model outputs translated audio and text in real time.
Improve translation with images
The qwen3-livetranslate-flash-realtime model uses images to improve audio translation. This capability is ideal for scenarios with homonyms or rare proper nouns. Maximum 2 images per second.
Download the following sample images: medical mask.png and masquerade mask.png.
Download and run the following code in the same directory as livetranslate_client.py. Say "What is mask?" into your microphone. The model uses the provided image to disambiguate the word "mask". For example, using medical mask.png translates the phrase as "What is a medical mask", while using masquerade mask.png translates it as "What is a masquerade mask".
import os
import time
import json
import asyncio
import contextlib
import functools
from livetranslate_client import LiveTranslateClient
IMAGE_PATH = "medical mask.png"
# IMAGE_PATH = "masquerade mask.png"
def print_banner():
print("=" * 60)
print(" Powered by Qwen qwen3-livetranslate-flash-realtime - Single-turn interaction example (mask)")
print("=" * 60 + "\n")
async def stream_microphone_once(client: LiveTranslateClient, image_bytes: bytes):
pa = client.pyaudio_instance
stream = pa.open(
format=client.input_format,
channels=client.input_channels,
rate=client.input_rate,
input=True,
frames_per_buffer=client.input_chunk,
)
print(f"[INFO] Recording started. Please speak...")
loop = asyncio.get_event_loop()
last_img_time = 0.0
frame_interval = 0.5 # 2 fps
try:
while client.is_connected:
data = await loop.run_in_executor(None, stream.read, client.input_chunk)
await client.send_audio_chunk(data)
# Append an image frame every 0.5 seconds
now = time.time()
if now - last_img_time >= frame_interval:
await client.send_image_frame(image_bytes)
last_img_time = now
finally:
stream.stop_stream()
stream.close()
async def main():
print_banner()
api_key = os.environ.get("DASHSCOPE_API_KEY")
if not api_key:
print("[ERROR] First, configure the API KEY in the DASHSCOPE_API_KEY environment variable.")
return
client = LiveTranslateClient(api_key=api_key, target_language="zh", voice="Cherry", audio_enabled=True)
def on_text(text: str):
print(text, end="", flush=True)
try:
await client.connect()
client.start_audio_player()
message_task = asyncio.create_task(client.handle_server_messages(on_text))
with open(IMAGE_PATH, "rb") as f:
img_bytes = f.read()
await stream_microphone_once(client, img_bytes)
await asyncio.sleep(15)
finally:
await client.close()
if not message_task.done():
message_task.cancel()
with contextlib.suppress(asyncio.CancelledError):
await message_task
if __name__ == "__main__":
asyncio.run(main())One-click deployment
An online demo is not available in the console. To deploy the application with one click, follow these steps:
Open the Function Compute template, enter your API key, and click Create and Deploy Default Environment.
Wait for about a minute. In Environment Details > Environment Context, retrieve the endpoint, change the protocol in the URL from
httptohttps(for example, https://qwen-livetranslate-flash-realtime-intl.fcv3.xxx.ap-southeast-1.fc.devsapp.net/), and use the link to access the application.ImportantThis link uses a self-signed certificate and is for temporary testing only. Your browser will display a security warning on your first visit. This is expected behavior. Do not use this link in a production environment. To proceed, follow the on-screen instructions, such as clicking Advanced → Proceed to (unsafe).
If you need to grant Resource Access Management permissions, follow the on-screen instructions.
To view the project source code, go to Resource Information > Function Resources.
Both Function Compute and Alibaba Cloud Model Studio provide new users with free quotas sufficient for simple debugging. After the free quotas are used up, pay-as-you-go charges apply only when you access the service.
Interaction flow
Real-time speech translation follows a standard WebSocket event-driven model. The server automatically detects speech boundaries and responds.
Lifecycle | Client event | Server event |
Session initialization | session.update Session configuration | session.created Session created session.updated Session configuration updated |
User audio input | input_audio_buffer.append Append audio to the buffer input_image_buffer.append Append image to the buffer | None |
Server audio output | None | response.created Indicates that the server has started to generate a response. response.output_item.added Indicates that a new output item is available. response.content_part.added Indicates that a new content part was added to the assistant message. response.audio_transcript.text Contains an incremental update to the text transcript. response.audio.delta Contains an incremental chunk of the synthesized audio. response.audio_transcript.done Signals that the full text transcript is complete. response.audio.done Signals that the synthesized audio is complete. response.content_part.done Signals that a text or audio content part for the assistant message is complete. response.output_item.done Signals that the entire output item for the assistant message is complete. response.done Signals that the entire response is complete. |
API reference
For details, see Qwen-Livetranslate-Realtime.
Billing
Audio: Each second of audio input or output consumes 12.5 tokens.
Image: Every 28×28 pixels consumes 0.5 tokens.
Text: When source language speech recognition is enabled, the service returns a transcript of the input audio (the original source language text) in addition to the translation. This transcript is billed as output tokens.
For token pricing, see the Model list.
Supported languages
Use the following language codes to specify source and target languages.
Some target languages support text output only.
Language code | Language | Output |
en | English | Audio + text |
zh | Chinese | Audio + text |
ru | Russian | Audio + text |
fr | French | Audio + text |
de | German | Audio + text |
pt | Portuguese | Audio + text |
es | Spanish | Audio + text |
it | Italian | Audio + text |
id | Indonesian | Text |
ko | Korean | Audio + text |
ja | Japanese | Audio + text |
vi | Vietnamese | Text |
th | Thai | Text |
ar | Arabic | Text |
yue | Cantonese | Audio + text |
hi | Hindi | Text |
el | Greek | Text |
tr | Turkish | Text |
Supported voices
Name |
| Sample | Description | Languages |
Cherry | Cherry | A friendly, conversational female voice with a sunny, positive tone. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean | |
Nofish | Nofish | A casual male voice with a non-retroflex Mandarin accent. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean | |
Jada | Jada | An energetic female voice with a Shanghai accent. | Chinese | |
Dylan | Dylan | A young male voice with a Beijing accent. | Chinese | |
Sunny | Sunny | A warm, sweet female voice with a Sichuan accent. | Chinese | |
Peter | Peter | A male comedic voice with a Tianjin accent. | Chinese | |
Kiki | Kiki | A sweet female voice in Cantonese. | Cantonese | |
Eric | Eric | An upbeat male voice with a Chengdu (Sichuan) accent. | Chinese |