All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time audio and video translation - Qwen

Last Updated:Mar 26, 2026

qwen3-livetranslate-flash-realtime is a vision-enhanced model that translates between 18 languages in real time. It processes both audio and image input from a real-time video stream or a local video file, uses visual context to improve accuracy, and outputs high-quality translated text and audio.

For an online demo, try the One-click deployment using Function Compute.

Features

  • Multi-language support: Supports 18 languages and 6 Chinese dialects, including Chinese, English, French, German, Russian, Japanese, and Korean, as well as Mandarin, Cantonese, and Sichuanese.

  • Visual enhancement: Uses visual context to improve translation accuracy. The model analyzes visual cues like lip movements, gestures, and on-screen text to produce more accurate translations in noisy environments or for words with multiple meanings.

  • 3-second latency: Delivers simultaneous interpretation in as little as 3 seconds.

  • Lossless simultaneous interpretation: Uses semantic unit prediction to resolve word order differences between languages, delivering real-time translation quality comparable to offline translation.

  • Natural voice: Generates a natural, human-like voice by automatically adapting its intonation and emotion to match the source audio.

  • Hotword configuration: Offers hotword configuration to improve translation accuracy for specific terms.

Procedure

1. Configure the connection

The qwen3-livetranslate-flash-realtime model uses the WebSocket protocol. To establish a connection, you need the following:

Parameter

Description

Endpoint

Chinese Mainland: wss://dashscope.aliyuncs.com/api-ws/v1/realtime

International: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime

Query parameter

The query parameter is model. Set it to the model name you want to access. Example: ?model=qwen3-livetranslate-flash-realtime

Message header

Use a bearer token for authentication: Authorization: Bearer DASHSCOPE_API_KEY

DASHSCOPE_API_KEY is the API key from Model Studio.

The following Python sample shows how to establish a connection.

Python sample code for WebSocket connection

# pip install websocket-client
import json
import websocket
import os

API_KEY=os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-livetranslate-flash-realtime"

headers = [
    "Authorization: Bearer " + API_KEY
]

def on_open(ws):
    print(f"Connected to server: {API_URL}")
def on_message(ws, message):
    data = json.loads(message)
    print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)

ws.run_forever()

2. Set language, output modality, and voice

Send the session.update client event:

  • Language

    • Source language: Use the session.input_audio_transcription.language parameter. Default: en (English).

    • Target language: Use the session.translation.language parameter. Default: en (English).

    For supported values, see Supported languages.

  • Output source language recognition results

    Use the session.input_audio_transcription.model parameter to configure this feature. If you set the parameter to qwen3-asr-flash-realtime, the server returns the speech recognition result of the input audio (the original text in the source language) along with the translation.

    After this feature is enabled, the server returns the following events:

    • conversation.item.input_audio_transcription.text: Streams the speech recognition results.

    • conversation.item.input_audio_transcription.completed: Returns the final result when speech recognition is complete.

  • Output modality

    Set the session.modalities parameter to ["text"] (text only) or ["text","audio"] (text and audio).

  • Voice

    Configure this with the session.voice parameter. For more information, see Supported voices.

  • Hotword

    Use the session.translation.corpus.phrases parameter to configure hotwords. Hotwords improve the translation accuracy for specific terms by specifying key-value pairs that map source language terms to their target language translations.

    For example, you can map "Inteligencia Artificial" to "Artificial Intelligence".

3. Input audio and images

Send Base64-encoded audio (required) and images (optional) using the input_audio_buffer.append and input_image_buffer.append events.

You can use images from local files or capture them in real time from a video stream.
The server automatically detects speech boundaries and triggers the model to respond.

4. Receive the model response

When the server detects the end of the audio input, the model begins to respond. The response format depends on the configured output modality.

Supported models

Parameter

Version

Context window

Max input

Max output

(tokens)

qwen3-livetranslate-flash-realtime

Same capabilities as qwen3-livetranslate-flash-realtime-2025-09-22

Stable

53,248

49,152

4,096

qwen3-livetranslate-flash-realtime-2025-09-22

Snapshot

Getting started

  1. Prepare the environment

    Requires Python 3.10 or later.

    Install pyaudio:

    macOS

    brew install portaudio && pip install pyaudio

    Debian/Ubuntu

    sudo apt-get install python3-pyaudio
    
    or
    
    pip install pyaudio

    CentOS

    sudo yum install -y portaudio portaudio-devel && pip install pyaudio

    Windows

    pip install pyaudio

    Then, install the WebSocket dependencies:

    pip install websocket-client==1.8.0 websockets
  2. Create the client

    Create a file named livetranslate_client.py:

    Client code - livetranslate_client.py

    import os
    import time
    import base64
    import asyncio
    import json
    import websockets
    import pyaudio
    import queue
    import threading
    import traceback
    
    class LiveTranslateClient:
        def __init__(self, api_key: str, target_language: str = "en", voice: str | None = "Cherry", *, audio_enabled: bool = True):
            if not api_key:
                raise ValueError("API key cannot be empty.")
                
            self.api_key = api_key
            self.target_language = target_language
            self.audio_enabled = audio_enabled
            self.voice = voice if audio_enabled else "Cherry"
            self.ws = None
            self.api_url = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-livetranslate-flash-realtime"
            
            # Audio input configuration (from microphone)
            self.input_rate = 16000
            self.input_chunk = 1600
            self.input_format = pyaudio.paInt16
            self.input_channels = 1
            
            # Audio output configuration (for playback)
            self.output_rate = 24000
            self.output_chunk = 2400
            self.output_format = pyaudio.paInt16
            self.output_channels = 1
            
            # State management
            self.is_connected = False
            self.audio_player_thread = None
            self.audio_playback_queue = queue.Queue()
            self.pyaudio_instance = pyaudio.PyAudio()
    
        async def connect(self):
            """Establish a WebSocket connection to the translation service."""
            headers = {"Authorization": f"Bearer {self.api_key}"}
            try:
                self.ws = await websockets.connect(self.api_url, additional_headers=headers)
                self.is_connected = True
                print(f"Successfully connected to the server: {self.api_url}")
                await self.configure_session()
            except Exception as e:
                print(f"Connection failed: {e}")
                self.is_connected = False
                raise
    
        async def configure_session(self):
            """Configure the translation session, setting the target language, voice, and other parameters."""
            config = {
                "event_id": f"event_{int(time.time() * 1000)}",
                "type": "session.update",
                "session": {
                    # 'modalities' controls the output type.
                    # ["text", "audio"]: Returns both translated text and synthesized audio (recommended).
                    # ["text"]: Returns only the translated text.
                    "modalities": ["text", "audio"] if self.audio_enabled else ["text"],
                    **({"voice": self.voice} if self.audio_enabled and self.voice else {}),
                    "input_audio_format": "pcm",
                    "output_audio_format": "pcm",
                    # 'input_audio_transcription' configures source language recognition.
                    # Set 'model' to 'qwen3-asr-flash-realtime' to also output the source language recognition result.
                    # "input_audio_transcription": {
                    #     "model": "qwen3-asr-flash-realtime",
                    #     "language": "zh"  # Source language, default is 'en'
                    # },
                    "translation": {
                        "language": self.target_language,
                        # 'corpus' configures hotwords (custom vocabulary) to improve the translation accuracy of specific terms.
                        # "corpus": {
                        #     "phrases": {
                        #         "人工智能": "Artificial Intelligence",
                        #         "机器学习": "Machine Learning"
                        #     }
                        # }
                    }
                }
            }
            print(f"Sending session configuration: {json.dumps(config, indent=2, ensure_ascii=False)}")
            await self.ws.send(json.dumps(config))
    
        async def send_audio_chunk(self, audio_data: bytes):
            """Encode and send an audio chunk to the server."""
            if not self.is_connected:
                return
                
            event = {
                "event_id": f"event_{int(time.time() * 1000)}",
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_data).decode()
            }
            await self.ws.send(json.dumps(event))
    
        async def send_image_frame(self, image_bytes: bytes, *, event_id: str | None = None):
            """Send an image frame to the server."""
            if not self.is_connected:
                return
    
            if not image_bytes:
                raise ValueError("image_bytes cannot be empty")
    
            # Encode to Base64
            image_b64 = base64.b64encode(image_bytes).decode()
    
            event = {
                "event_id": event_id or f"event_{int(time.time() * 1000)}",
                "type": "input_image_buffer.append",
                "image": image_b64,
            }
    
            await self.ws.send(json.dumps(event))
    
        def _audio_player_task(self):
            stream = self.pyaudio_instance.open(
                format=self.output_format,
                channels=self.output_channels,
                rate=self.output_rate,
                output=True,
                frames_per_buffer=self.output_chunk,
            )
            try:
                while self.is_connected or not self.audio_playback_queue.empty():
                    try:
                        audio_chunk = self.audio_playback_queue.get(timeout=0.1)
                        if audio_chunk is None: # Termination signal
                            break
                        stream.write(audio_chunk)
                        self.audio_playback_queue.task_done()
                    except queue.Empty:
                        continue
            finally:
                stream.stop_stream()
                stream.close()
    
        def start_audio_player(self):
            """Start the audio player thread (only when audio output is enabled)."""
            if not self.audio_enabled:
                return
            if self.audio_player_thread is None or not self.audio_player_thread.is_alive():
                self.audio_player_thread = threading.Thread(target=self._audio_player_task, daemon=True)
                self.audio_player_thread.start()
    
        async def handle_server_messages(self, on_text_received):
            """Handle incoming messages from the server."""
            try:
                async for message in self.ws:
                    event = json.loads(message)
                    event_type = event.get("type")
                    if event_type == "response.audio.delta" and self.audio_enabled:
                        audio_b64 = event.get("delta", "")
                        if audio_b64:
                            audio_data = base64.b64decode(audio_b64)
                            self.audio_playback_queue.put(audio_data)
    
                    elif event_type == "response.done":
                        print("\n[INFO] Response round complete.")
                        usage = event.get("response", {}).get("usage", {})
                        if usage:
                            print(f"[INFO] token usage: {json.dumps(usage, indent=2, ensure_ascii=False)}")
                    # Process source language recognition results (requires enabling input_audio_transcription.model)
                    # elif event_type == "conversation.item.input_audio_transcription.text":
                    #     stash = event.get("stash", "")  # Pending recognition text
                    #     print(f"[Recognizing] {stash}")
                    # elif event_type == "conversation.item.input_audio_transcription.completed":
                    #     transcript = event.get("transcript", "")  # Complete recognition result
                    #     print(f"[Source language] {transcript}")
                    elif event_type == "response.audio_transcript.done":
                        print("\n[INFO] Text translation complete.")
                        text = event.get("transcript", "")
                        if text:
                            print(f"[INFO] Translated text: {text}")
                    elif event_type == "response.text.done":
                        print("\n[INFO] Text translation complete.")
                        text = event.get("text", "")
                        if text:
                            print(f"[INFO] Translated text: {text}")
    
            except websockets.exceptions.ConnectionClosed as e:
                print(f"[WARNING] Connection closed: {e}")
                self.is_connected = False
            except Exception as e:
                print(f"[ERROR] An unknown error occurred while processing messages: {e}")
                traceback.print_exc()
                self.is_connected = False
    
        async def start_microphone_streaming(self):
            """Capture audio from the microphone and stream it to the server."""
            stream = self.pyaudio_instance.open(
                format=self.input_format,
                channels=self.input_channels,
                rate=self.input_rate,
                input=True,
                frames_per_buffer=self.input_chunk
            )
            print("Microphone is on. Start speaking...")
            try:
                while self.is_connected:
                    audio_chunk = await asyncio.get_event_loop().run_in_executor(
                        None, stream.read, self.input_chunk
                    )
                    await self.send_audio_chunk(audio_chunk)
            finally:
                stream.stop_stream()
                stream.close()
    
        async def close(self):
            """Gracefully close the connection and release resources."""
            self.is_connected = False
            if self.ws:
                await self.ws.close()
                print("WebSocket connection closed.")
            
            if self.audio_player_thread:
                self.audio_playback_queue.put(None) # Send termination signal
                self.audio_player_thread.join(timeout=1)
                print("Audio player thread stopped.")
                
            self.pyaudio_instance.terminate()
            print("PyAudio instance released.")
  3. Interact with the model

    In the same directory as livetranslate_client.py, create a file named main.py:

    main.py

    import os
    import asyncio
    from livetranslate_client import LiveTranslateClient
    
    def print_banner():
        print("=" * 60)
        print("  Powered by Qwen qwen3-livetranslate-flash-realtime")
        print("=" * 60 + "\n")
    
    def get_user_config():
        """Prompt the user for settings and return the configuration."""
        print("Select a mode:")
        print("1. Voice + Text [Default] | 2. Text Only")
        mode_choice = input("Enter your choice (press Enter for Voice + Text): ").strip()
        audio_enabled = (mode_choice != "2")
    
        if audio_enabled:
            lang_map = {
                "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt",
                "7": "es", "8": "it", "9": "ko", "10": "ja", "11": "yue"
            }
            print("Select the target translation language (Voice + Text mode):")
            print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Korean | 10. Japanese | 11. Cantonese")
        else:
            lang_map = {
                "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt", "7": "es", "8": "it",
                "9": "id", "10": "ko", "11": "ja", "12": "vi", "13": "th", "14": "ar",
                "15": "yue", "16": "hi", "17": "el", "18": "tr"
            }
            print("Select the target translation language (Text Only mode):")
            print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Indonesian | 10. Korean | 11. Japanese | 12. Vietnamese | 13. Thai | 14. Arabic | 15. Cantonese | 16. Hindi | 17. Greek | 18. Turkish")
    
        choice = input("Enter your choice (defaults to the first option): ").strip()
        target_language = lang_map.get(choice, next(iter(lang_map.values())))
    
        voice = None
        if audio_enabled:
            print("\nSelect a speech synthesis voice:")
            voice_map = {"1": "Cherry", "2": "Nofish", "3": "Sunny", "4": "Jada", "5": "Dylan", "6": "Peter", "7": "Eric", "8": "Kiki"}
            print("1. Cherry (Female) [Default] | 2. Nofish (Male) | 3. Sunny (Sichuan Female) | 4. Jada (Shanghai Female) | 5. Dylan (Beijing Male) | 6. Peter (Tianjin Male) | 7. Eric (Sichuan Male) | 8. Kiki (Cantonese Female)")
            voice_choice = input("Enter your choice (press Enter for Cherry): ").strip()
            voice = voice_map.get(voice_choice, "Cherry")
        return target_language, voice, audio_enabled
    
    async def main():
        """Main program entry point."""
        print_banner()
        
        api_key = os.environ.get("DASHSCOPE_API_KEY")
        if not api_key:
            print("[ERROR] The DASHSCOPE_API_KEY environment variable is not set.")
            print("  For example: export DASHSCOPE_API_KEY='your_api_key_here'")
            return
            
        target_language, voice, audio_enabled = get_user_config()
        print("\nConfiguration complete:")
        print(f"  - Target language: {target_language}")
        if audio_enabled:
            print(f"  - Synthesized voice: {voice}")
        else:
            print("  - Output mode: Text Only")
        
        client = LiveTranslateClient(api_key=api_key, target_language=target_language, voice=voice, audio_enabled=audio_enabled)
        
        # Define the callback function
        def on_translation_text(text):
            print(text, end="", flush=True)
    
        try:
            print("Connecting to the translation service...")
            await client.connect()
            
            # Start the audio player thread; playback occurs only if audio output is enabled.
            client.start_audio_player()
            
            print("\n" + "-" * 60)
            print("Connection successful! Speak into the microphone.")
            print("The program will translate your speech in real time and play the result. Press Ctrl+C to exit.")
            print("-" * 60 + "\n")
    
            # Run message handling and microphone recording concurrently
            message_handler = asyncio.create_task(client.handle_server_messages(on_translation_text))
            tasks = [message_handler]
            # Capture audio from the microphone for translation, regardless of whether audio output is enabled
            microphone_streamer = asyncio.create_task(client.start_microphone_streaming())
            tasks.append(microphone_streamer)
    
            await asyncio.gather(*tasks)
    
        except KeyboardInterrupt:
            print("\n\nUser interrupted. Exiting...")
        except Exception as e:
            print(f"\nA critical error occurred: {e}")
        finally:
            print("\nCleaning up resources...")
            await client.close()
            print("Program exited.")
    
    if __name__ == "__main__":
        asyncio.run(main())

    Run main.py and speak into your microphone. The model outputs translated audio and text in real time.

Improve translation with images

The qwen3-livetranslate-flash-realtime model uses images to improve audio translation. This capability is ideal for scenarios with homonyms or rare proper nouns. Maximum 2 images per second.

Download the following sample images: medical mask.png and masquerade mask.png.

Download and run the following code in the same directory as livetranslate_client.py. Say "What is mask?" into your microphone. The model uses the provided image to disambiguate the word "mask". For example, using medical mask.png translates the phrase as "What is a medical mask", while using masquerade mask.png translates it as "What is a masquerade mask".

import os
import time
import json
import asyncio
import contextlib
import functools

from livetranslate_client import LiveTranslateClient

IMAGE_PATH = "medical mask.png"
# IMAGE_PATH = "masquerade mask.png"

def print_banner():
    print("=" * 60)
    print("  Powered by Qwen qwen3-livetranslate-flash-realtime - Single-turn interaction example (mask)")
    print("=" * 60 + "\n")

async def stream_microphone_once(client: LiveTranslateClient, image_bytes: bytes):
    pa = client.pyaudio_instance
    stream = pa.open(
        format=client.input_format,
        channels=client.input_channels,
        rate=client.input_rate,
        input=True,
        frames_per_buffer=client.input_chunk,
    )
    print(f"[INFO] Recording started. Please speak...")
    loop = asyncio.get_event_loop()
    last_img_time = 0.0
    frame_interval = 0.5  # 2 fps
    try:
        while client.is_connected:
            data = await loop.run_in_executor(None, stream.read, client.input_chunk)
            await client.send_audio_chunk(data)

            # Append an image frame every 0.5 seconds
            now = time.time()
            if now - last_img_time >= frame_interval:
                await client.send_image_frame(image_bytes)
                last_img_time = now
    finally:
        stream.stop_stream()
        stream.close()

async def main():
    print_banner()
    api_key = os.environ.get("DASHSCOPE_API_KEY")
    if not api_key:
        print("[ERROR] First, configure the API KEY in the DASHSCOPE_API_KEY environment variable.")
        return

    client = LiveTranslateClient(api_key=api_key, target_language="zh", voice="Cherry", audio_enabled=True)

    def on_text(text: str):
        print(text, end="", flush=True)

    try:
        await client.connect()
        client.start_audio_player()
        message_task = asyncio.create_task(client.handle_server_messages(on_text))
        with open(IMAGE_PATH, "rb") as f:
            img_bytes = f.read()
        await stream_microphone_once(client, img_bytes)
        await asyncio.sleep(15)
    finally:
        await client.close()
        if not message_task.done():
            message_task.cancel()
            with contextlib.suppress(asyncio.CancelledError):
                await message_task

if __name__ == "__main__":
    asyncio.run(main())

One-click deployment

An online demo is not available in the console. To deploy the application with one click, follow these steps:

  1. Open the Function Compute template, enter your API key, and click Create and Deploy Default Environment.

  2. Wait for about a minute. In Environment Details > Environment Context, retrieve the endpoint, change the protocol in the URL from http to https (for example, https://qwen-livetranslate-flash-realtime-intl.fcv3.xxx.ap-southeast-1.fc.devsapp.net/), and use the link to access the application.

    Important

    This link uses a self-signed certificate and is for temporary testing only. Your browser will display a security warning on your first visit. This is expected behavior. Do not use this link in a production environment. To proceed, follow the on-screen instructions, such as clicking Advanced → Proceed to (unsafe).

If you need to grant Resource Access Management permissions, follow the on-screen instructions.
To view the project source code, go to Resource Information > Function Resources.
Both Function Compute and Alibaba Cloud Model Studio provide new users with free quotas sufficient for simple debugging. After the free quotas are used up, pay-as-you-go charges apply only when you access the service.

Interaction flow

Real-time speech translation follows a standard WebSocket event-driven model. The server automatically detects speech boundaries and responds.

Lifecycle

Client event

Server event

Session initialization

session.update

Session configuration

session.created

Session created

session.updated

Session configuration updated

User audio input

input_audio_buffer.append

Append audio to the buffer

input_image_buffer.append

Append image to the buffer

None

Server audio output

None

response.created

Indicates that the server has started to generate a response.

response.output_item.added

Indicates that a new output item is available.

response.content_part.added

Indicates that a new content part was added to the assistant message.

response.audio_transcript.text

Contains an incremental update to the text transcript.

response.audio.delta

Contains an incremental chunk of the synthesized audio.

response.audio_transcript.done

Signals that the full text transcript is complete.

response.audio.done

Signals that the synthesized audio is complete.

response.content_part.done

Signals that a text or audio content part for the assistant message is complete.

response.output_item.done

Signals that the entire output item for the assistant message is complete.

response.done

Signals that the entire response is complete.

API reference

For details, see Qwen-Livetranslate-Realtime.

Billing

  • Audio: Each second of audio input or output consumes 12.5 tokens.

  • Image: Every 28×28 pixels consumes 0.5 tokens.

  • Text: When source language speech recognition is enabled, the service returns a transcript of the input audio (the original source language text) in addition to the translation. This transcript is billed as output tokens.

For token pricing, see the Model list.

Supported languages

Use the following language codes to specify source and target languages.

Some target languages support text output only.

Language code

Language

Output

en

English

Audio + text

zh

Chinese

Audio + text

ru

Russian

Audio + text

fr

French

Audio + text

de

German

Audio + text

pt

Portuguese

Audio + text

es

Spanish

Audio + text

it

Italian

Audio + text

id

Indonesian

Text

ko

Korean

Audio + text

ja

Japanese

Audio + text

vi

Vietnamese

Text

th

Thai

Text

ar

Arabic

Text

yue

Cantonese

Audio + text

hi

Hindi

Text

el

Greek

Text

tr

Turkish

Text

Supported voices

Name

voice parameter

Sample

Description

Languages

Cherry

Cherry

A friendly, conversational female voice with a sunny, positive tone.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Nofish

Nofish

A casual male voice with a non-retroflex Mandarin accent.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Jada

Jada

An energetic female voice with a Shanghai accent.

Chinese

Dylan

Dylan

A young male voice with a Beijing accent.

Chinese

Sunny

Sunny

A warm, sweet female voice with a Sichuan accent.

Chinese

Peter

Peter

A male comedic voice with a Tianjin accent.

Chinese

Kiki

Kiki

A sweet female voice in Cantonese.

Cantonese

Eric

Eric

An upbeat male voice with a Chengdu (Sichuan) accent.

Chinese