Qwen3.5 LiveTranslate Real-time Audio Video Translation Model - Alibaba Cloud Model Studio

Features

Multi-language support: Translates between 60 languages — 29 with audio and text output, 31 with text-only output — including Chinese, English, French, German, Russian, Japanese, Korean, Spanish, Portuguese, and Arabic.
Visual enhancement: Analyzes visual cues, such as lip movements, gestures, and on-screen text, to improve translation accuracy, especially in noisy environments or for ambiguous words.
2.8-second latency: Delivers simultaneous interpretation with latency as low as 2.8 seconds.
Lossless simultaneous interpretation: Predicts semantic units to resolve cross-language word order differences, achieving quality comparable to offline translation.
Natural voice: Matches the intonation and emotion of the source audio automatically.
Hotword configuration: Configurable hotwords improve translation accuracy for specific terms.
Voice cloning: Clones the speaker's voice for translated output. Supports server-side real-time cloning and pre-cloned voice profiles.

Procedure

1. Configure the connection

The model connects over WebSocket with the following parameters:

Parameter	Description
endpoint	China (Beijing) region: wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/realtime. Replace {WorkspaceId} with your actual workspace ID. Singapore region: wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime. Replace `{WorkspaceId}` with your actual workspace ID.
query parameter	The model query parameter must be set to the model name. Example: `?model=qwen3.5-livetranslate-flash-realtime`
message header	Use a Bearer Token for authentication: Authorization: Bearer DASHSCOPE_API_KEY DASHSCOPE_API_KEY is your API key from Model Studio.

Sample connection code (Python):

Python sample code for WebSocket connection

# pip install websocket-client
import json
import websocket
import os

API_KEY=os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime?model=qwen3.5-livetranslate-flash-realtime"

headers = [
    "Authorization: Bearer " + API_KEY
]

def on_open(ws):
    print(f"Connected to server: {API_URL}")
def on_message(ws, message):
    data = json.loads(message)
    print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)

ws.run_forever()

2. Configure language, modality, and voice

Send the session.update client event with the following parameters:

Language
- Source language: Configure using the session.input_audio_transcription.language parameter.
  
  The default value is en (English).
- Target language: Configure using the session.translation.language parameter.
  
  The default value is en (English).
See Supported languages.
Output source language recognition results

Set session.input_audio_transcription.model to qwen3-asr-flash-realtime. The server then returns both the translation and the speech recognition result (original text) for the input audio.

The server returns these events:
- conversation.item.input_audio_transcription.text: Streams the recognition results.
- conversation.item.input_audio_transcription.completed: Returns the final result after the recognition is complete.
Output modality

Set the session.modalities parameter to ["text"] (text only) or ["text","audio"] (text and audio).
Voice

Configure using the session.voice parameter. See Supported voices.
Hotword

Configure hotwords using the session.translation.corpus.phrases parameter. Hotwords are key-value pairs that map source terms to target translations, improving accuracy for specific terms.

Example: Map "artificial intelligence" to "Artificial Intelligence".
Voice cloning

Configure using the session.enable_voice_clone, session.voice_clone_options.frequency, and session.voice parameters. Supports three modes: pre-cloned voice profile (frequency: never), server-side clone once at session start (once), or real-time clone before each response (always). See Voice cloning.

3. Input audio and images

Send Base64-encoded audio and image data using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required; image input is optional.

Images can be from a local file or captured in real time from a video stream.

The server automatically detects speech boundaries and triggers the model response.

4. Receive the model response

The model responds when the server detects the end of speech. The response format depends on the output modality.

Text-only output

The server streams incremental translated text (including confirmed text and tentative predicted text) through response.text.text events; upon completion, the full translated text is returned in a response.text.done event.
Text and audio output
- Text
  
  The server streams incremental translated text through response.audio_transcript.text events; upon completion, the full translated text is returned in a response.audio_transcript.done event.
- Audio
  
  The server returns incremental, Base64-encoded audio data in response.audio.delta events.

Important

The real-time translation model uses the response.text.text event for incremental text delivery, which differs from the response.text.delta event used by Omni (full-duplex voice conversation) models. These events have different field structures and semantics — do not use them interchangeably.

5. End the session

After sending all audio, send a Client events event, then wait for the server to return a session.finished event before closing the WebSocket connection.

If you close the WebSocket without sending session.finish, the server's VAD cannot detect the end of the final speech segment. This causes translation results for that segment to be lost entirely, and the connection may hang indefinitely. Always send this event before disconnecting.

Supported models

Model	Version	Context window	Max input	Max output
		(tokens)
qwen3.5-livetranslate-flash-realtime Alias for qwen3.5-livetranslate-flash-realtime-2026-05-19	Stable	53,248	49,152	4,096
qwen3.5-livetranslate-flash-realtime-2026-05-19	Snapshot
qwen3-livetranslate-flash-realtime Alias for qwen3-livetranslate-flash-realtime-2025-09-22	Stable	53,248	49,152	4,096
qwen3-livetranslate-flash-realtime-2025-09-22	Snapshot

Getting started

Prepare the environment

Requires Python 3.10 or later.

First, install pyaudio.

macOS

brew install portaudio && pip install pyaudio

Debian/Ubuntu

sudo apt-get install python3-pyaudio

or

pip install pyaudio

CentOS

sudo yum install -y portaudio portaudio-devel && pip install pyaudio

Windows

pip install pyaudio

Then install the WebSocket dependencies:

pip install websocket-client==1.8.0 websockets

Create the client

Create a file named livetranslate_client.py with the following code:

Client code - livetranslate_client.py

import os
import time
import base64
import asyncio
import json
import websockets
import pyaudio
import queue
import threading
import traceback

class LiveTranslateClient:
    def __init__(self, api_key: str, target_language: str = "en", *, audio_enabled: bool = True):
        if not api_key:
            raise ValueError("API key cannot be empty.")
            
        self.api_key = api_key
        self.target_language = target_language
        self.audio_enabled = audio_enabled
        self.ws = None
        self.api_url = "wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime?model=qwen3.5-livetranslate-flash-realtime"
        
        # Audio input configuration (from microphone)
        self.input_rate = 16000
        self.input_chunk = 1600
        self.input_format = pyaudio.paInt16
        self.input_channels = 1
        
        # Audio output configuration (for playback)
        self.output_rate = 24000
        self.output_chunk = 2400
        self.output_format = pyaudio.paInt16
        self.output_channels = 1
        
        # State management
        self.is_connected = False
        self.audio_player_thread = None
        self.audio_playback_queue = queue.Queue()
        self.pyaudio_instance = pyaudio.PyAudio()
        self.session_finished_event = asyncio.Event()

    async def connect(self):
        """Establish a WebSocket connection to the translation service."""
        headers = {"Authorization": f"Bearer {self.api_key}"}
        try:
            self.ws = await websockets.connect(self.api_url, additional_headers=headers)
            self.is_connected = True
            print(f"Successfully connected to the server: {self.api_url}")
            await self.configure_session()
        except Exception as e:
            print(f"Connection failed: {e}")
            self.is_connected = False
            raise

    async def configure_session(self):
        """Configure the translation session, setting the target language, voice, etc."""
        config = {
            "event_id": f"event_{int(time.time() * 1000)}",
            "type": "session.update",
            "session": {
                # 'modalities' controls the output type.
                # ["text", "audio"]: Returns both translated text and synthesized audio (recommended).
                # ["text"]: Returns only the translated text.
                "modalities": ["text", "audio"] if self.audio_enabled else ["text"],
                "input_audio_format": "pcm",
                "output_audio_format": "pcm",
                # 'input_audio_transcription' configures source language recognition.
                # Set 'model' to 'qwen3-asr-flash-realtime' to also output the source language recognition result.
                # "input_audio_transcription": {
                #     "model": "qwen3-asr-flash-realtime",
                #     "language": "zh"  # source language, default 'en'
                # },
                "translation": {
                    "language": self.target_language,
                    # 'corpus' configures hotwords to improve the translation accuracy of specific terms.
                    # "corpus": {
                    #     "phrases": {
                    #         "Artificial Intelligence": "Artificial Intelligence",
                    #         "Machine Learning": "Machine Learning"
                    #     }
                    # }
                }
            }
        }
        print(f"Sending session configuration: {json.dumps(config, indent=2, ensure_ascii=False)}")
        await self.ws.send(json.dumps(config))

    async def send_audio_chunk(self, audio_data: bytes):
        """Encode and send an audio chunk to the server."""
        if not self.is_connected:
            return
            
        event = {
            "event_id": f"event_{int(time.time() * 1000)}",
            "type": "input_audio_buffer.append",
            "audio": base64.b64encode(audio_data).decode()
        }
        await self.ws.send(json.dumps(event))

    async def send_image_frame(self, image_bytes: bytes, *, event_id: str | None = None):
        # Send an image frame to the server.
        if not self.is_connected:
            return

        if not image_bytes:
            raise ValueError("image_bytes cannot be empty.")

        # Encode to Base64
        image_b64 = base64.b64encode(image_bytes).decode()

        event = {
            "event_id": event_id or f"event_{int(time.time() * 1000)}",
            "type": "input_image_buffer.append",
            "image": image_b64,
        }

        await self.ws.send(json.dumps(event))

    def _audio_player_task(self):
        stream = self.pyaudio_instance.open(
            format=self.output_format,
            channels=self.output_channels,
            rate=self.output_rate,
            output=True,
            frames_per_buffer=self.output_chunk,
        )
        try:
            while self.is_connected or not self.audio_playback_queue.empty():
                try:
                    audio_chunk = self.audio_playback_queue.get(timeout=0.1)
                    if audio_chunk is None: # Termination signal
                        break
                    stream.write(audio_chunk)
                    self.audio_playback_queue.task_done()
                except queue.Empty:
                    continue
        finally:
            stream.stop_stream()
            stream.close()

    def start_audio_player(self):
        """Start the audio player thread (only when audio output is enabled)."""
        if not self.audio_enabled:
            return
        if self.audio_player_thread is None or not self.audio_player_thread.is_alive():
            self.audio_player_thread = threading.Thread(target=self._audio_player_task, daemon=True)
            self.audio_player_thread.start()

    async def handle_server_messages(self, on_text_received):
        """Handle incoming messages from the server in a loop."""
        try:
            async for message in self.ws:
                event = json.loads(message)
                event_type = event.get("type")
                if event_type == "response.audio.delta" and self.audio_enabled:
                    audio_b64 = event.get("delta", "")
                    if audio_b64:
                        audio_data = base64.b64decode(audio_b64)
                        self.audio_playback_queue.put(audio_data)

                elif event_type == "response.done":
                    print("\n[INFO] Response round complete.")
                    usage = event.get("response", {}).get("usage", {})
                    if usage:
                        print(f"[INFO] token usage: {json.dumps(usage, indent=2, ensure_ascii=False)}")
                elif event_type == "session.finished":
                    print("[INFO] Session finished.")
                    self.session_finished_event.set()
                # Process source language recognition results (requires enabling input_audio_transcription.model)
                # elif event_type == "conversation.item.input_audio_transcription.text":
                #     stash = event.get("stash", "")  # Pending recognition text
                #     print(f"[Recognizing] {stash}")
                # elif event_type == "conversation.item.input_audio_transcription.completed":
                #     transcript = event.get("transcript", "")  # Complete recognition result
                #     print(f"[Source language] {transcript}")
                elif event_type == "response.text.text":
                    # Streaming translated text in text-only modality
                    text = event.get("text", "")
                    stash = event.get("stash", "")
                    print(f"\r[Translating] {text}{stash}", end="", flush=True)
                elif event_type == "response.audio_transcript.done":
                    print("\n[INFO] Translation complete.")
                    text = event.get("transcript", "")
                    if text:
                        print(f"[INFO] Translated text: {text}")
                elif event_type == "response.text.done":
                    print("\n[INFO] Translation complete.")
                    text = event.get("text", "")
                    if text:
                        print(f"[INFO] Translated text: {text}")

        except websockets.exceptions.ConnectionClosed as e:
            print(f"[WARNING] Connection closed: {e}")
            self.is_connected = False
        except Exception as e:
            print(f"[ERROR] An unexpected error occurred while processing messages: {e}")
            traceback.print_exc()
            self.is_connected = False

    async def start_microphone_streaming(self):
        """Capture audio from the microphone and stream it to the server."""
        stream = self.pyaudio_instance.open(
            format=self.input_format,
            channels=self.input_channels,
            rate=self.input_rate,
            input=True,
            frames_per_buffer=self.input_chunk
        )
        print("Microphone is on. Start speaking...")
        try:
            while self.is_connected:
                audio_chunk = await asyncio.get_event_loop().run_in_executor(
                    None, stream.read, self.input_chunk
                )
                await self.send_audio_chunk(audio_chunk)
        finally:
            stream.stop_stream()
            stream.close()

    async def close(self):
        """Gracefully close the connection and release resources."""
        # Send session.finish to ensure the server completes translation of the final speech segment
        if self.is_connected and self.ws:
            finish_event = {
                "event_id": f"event_{int(time.time() * 1000)}",
                "type": "session.finish",
            }
            await self.ws.send(json.dumps(finish_event))
            print("Sent session.finish, waiting for server to finish processing...")
            try:
                await asyncio.wait_for(self.session_finished_event.wait(), timeout=15)
                print("Server processing complete.")
            except asyncio.TimeoutError:
                print("Timed out waiting for session.finished.")
        self.is_connected = False
        if self.ws:
            await self.ws.close()
            print("WebSocket connection closed.")
        
        if self.audio_player_thread:
            self.audio_playback_queue.put(None) # Send termination signal
            self.audio_player_thread.join(timeout=1)
            print("Audio player thread stopped.")
            
        self.pyaudio_instance.terminate()
        print("PyAudio instance released.")

Interact with the model

In the same directory, create a file named main.py with the following code:

main.py

import os
import asyncio
from livetranslate_client import LiveTranslateClient

def print_banner():
    print("=" * 60)
    print("  Powered by Qwen qwen3.5-livetranslate-flash-realtime")
    print("=" * 60 + "\n")

def get_user_config():
    """Get user configuration."""
    print("Select a mode:")
    print("1. Voice + Text [Default] | 2. Text Only")
    mode_choice = input("Enter your choice (press Enter for Voice + Text): ").strip()
    audio_enabled = (mode_choice != "2")

    if audio_enabled:
        lang_map = {
            "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt",
            "7": "es", "8": "it", "9": "ko", "10": "ja", "11": "yue"
        }
        print("Select the target language (Voice + Text mode):")
        print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Korean | 10. Japanese | 11. Cantonese")
    else:
        lang_map = {
            "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt", "7": "es", "8": "it",
            "9": "id", "10": "ko", "11": "ja", "12": "vi", "13": "th", "14": "ar",
            "15": "yue", "16": "hi", "17": "el", "18": "tr"
        }
        print("Select the target language (Text Only mode):")
        print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Indonesian | 10. Korean | 11. Japanese | 12. Vietnamese | 13. Thai | 14. Arabic | 15. Cantonese | 16. Hindi | 17. Greek | 18. Turkish")

    choice = input("Enter your choice (defaults to the first option): ").strip()
    target_language = lang_map.get(choice, next(iter(lang_map.values())))

    return target_language, audio_enabled

async def main():
    """Main program entry point."""
    print_banner()
    
    api_key = os.environ.get("DASHSCOPE_API_KEY")
    if not api_key:
        print("[ERROR] Please set the DASHSCOPE_API_KEY environment variable.")
        print("  For example: export DASHSCOPE_API_KEY='your_api_key_here'")
        return
        
    target_language, audio_enabled = get_user_config()
    print("\nConfiguration complete:")
    print(f"  - Target language: {target_language}")
    if not audio_enabled:
        print("  - Output mode: Text Only")

    client = LiveTranslateClient(api_key=api_key, target_language=target_language, audio_enabled=audio_enabled)
    
    # Define the callback function.
    def on_translation_text(text):
        print(text, end="", flush=True)

    try:
        print("Connecting to the translation service...")
        await client.connect()
        
        # Start audio playback based on the mode.
        client.start_audio_player()
        
        print("\n" + "-" * 60)
        print("Connection successful! Speak into the microphone.")
        print("The program will translate your speech in real time and play the translated audio. Press Ctrl+C to exit.")
        print("-" * 60 + "\n")

        # Run message handling and microphone recording concurrently.
        message_handler = asyncio.create_task(client.handle_server_messages(on_translation_text))
        tasks = [message_handler]
        # Capture audio from the microphone for translation, regardless of whether audio output is enabled.
        microphone_streamer = asyncio.create_task(client.start_microphone_streaming())
        tasks.append(microphone_streamer)

        await asyncio.gather(*tasks)

    except KeyboardInterrupt:
        print("\n\nUser interrupted. Exiting...")
    except Exception as e:
        print(f"\nA critical error occurred: {e}")
    finally:
        print("\nCleaning up resources...")
        await client.close()
        print("Program exited.")

if __name__ == "__main__":
    asyncio.run(main())

Run main.py and speak into your microphone. The model translates your speech and outputs audio and text in real time.

Voice cloning

The model clones the speaker's voice from input audio and uses it for translated output. Use a pre-cloned voice profile or let the server clone in real time. Useful for conference interpreting, live streaming, and video dubbing.

Set the following parameters in session.update to enable voice cloning:

session.enable_voice_clone: Set to true to enable voice cloning.
session.voice_clone_options.frequency: Controls when voice cloning occurs. Accepted values:
- never: Does not clone on the server. Uses a pre-cloned voice profile instead. Set session.voice to your custom cloned voice ID.
- once: Clones the voice from the input audio once at session start, then reuses it for all subsequent output. Best for single-speaker scenarios. Set session.voice to default.
- always: Clones the voice before each response, dynamically adapting to speaker changes. Best for multi-speaker conversations. Set session.voice to default.
session.voice: Specifies the output voice. The value depends on the frequency setting:
- Set to default: Use with frequency set to once or always. The server clones the speaker's voice from the input audio. A default voice is used until cloning completes.
- Set to a custom cloned voice ID (for example, qwen-translate-vc-xxx-yyy-zzz): Use with frequency set to never. You must prepare the voice in advance using the Voice Cloning API with targetModel set to qwen3.5-livetranslate-flash-realtime.

When frequency is set to once or always , the voice parameter must be set to default . Any other value causes the server to return an error.

Voice cloning configuration examples

Pre-cloned voice profile (consistent quality; recommended when a stable voice identity is required):

{
    "type": "session.update",
    "session": {
        "modalities": ["text","audio"],
        "voice": "qwen-translate-vc-xxx-yyy-zzz",
        "translation": {
            "language": "en"
        },
        "enable_voice_clone": true,
        "voice_clone_options": {
            "frequency": "never"
        }
    }
}

Server-side cloning, once per session (best for single-speaker scenarios):

{
    "type": "session.update",
    "session": {
        "modalities": ["text","audio"],
        "voice": "default",
        "translation": {
            "language": "en"
        },
        "enable_voice_clone": true,
        "voice_clone_options": {
            "frequency": "once"
        }
    }
}

Server-side cloning, every response (best for multi-speaker conversations):

{
    "type": "session.update",
    "session": {
        "modalities": ["text","audio"],
        "voice": "default",
        "translation": {
            "language": "en"
        },
        "enable_voice_clone": true,
        "voice_clone_options": {
            "frequency": "always"
        }
    }
}

Improve translation with images

Image input helps disambiguate homonyms and recognize uncommon proper nouns during translation. Send no more than 2 images per second.

Download the following sample images: medical mask.png, masquerade mask.png

Download the code below to the same directory as livetranslate_client.py and run it. Say "What is mask?" into your microphone. The model uses the image to disambiguate: medical mask.png yields "What is a medical mask?" and masquerade mask.png yields "What is a masquerade mask?".

import os
import time
import json
import asyncio
import contextlib
import functools

from livetranslate_client import LiveTranslateClient

IMAGE_PATH = "medical mask.png"
# IMAGE_PATH = "masquerade mask.png"

def print_banner():
    print("=" * 60)
    print("  Powered by Qwen qwen3.5-livetranslate-flash-realtime — single-turn interaction example (mask)")
    print("=" * 60 + "\n")

async def stream_microphone_once(client: LiveTranslateClient, image_bytes: bytes):
    pa = client.pyaudio_instance
    stream = pa.open(
        format=client.input_format,
        channels=client.input_channels,
        rate=client.input_rate,
        input=True,
        frames_per_buffer=client.input_chunk,
    )
    print(f"[INFO] Recording started. Please speak...")
    loop = asyncio.get_event_loop()
    last_img_time = 0.0
    frame_interval = 0.5  # 2 fps
    try:
        while client.is_connected:
            data = await loop.run_in_executor(None, stream.read, client.input_chunk)
            await client.send_audio_chunk(data)

            # Append an image frame every 0.5 seconds
            now = time.time()
            if now - last_img_time >= frame_interval:
                await client.send_image_frame(image_bytes)
                last_img_time = now
    finally:
        stream.stop_stream()
        stream.close()

async def main():
    print_banner()
    api_key = os.environ.get("DASHSCOPE_API_KEY")
    if not api_key:
        print("[ERROR] Please set the DASHSCOPE_API_KEY environment variable.")
        return

    client = LiveTranslateClient(api_key=api_key, target_language="zh", audio_enabled=True)

    def on_text(text: str):
        print(text, end="", flush=True)

    try:
        await client.connect()
        client.start_audio_player()
        message_task = asyncio.create_task(client.handle_server_messages(on_text))
        with open(IMAGE_PATH, "rb") as f:
            img_bytes = f.read()
        await stream_microphone_once(client, img_bytes)
        await asyncio.sleep(15)
    finally:
        await client.close()
        if not message_task.done():
            message_task.cancel()
            with contextlib.suppress(asyncio.CancelledError):
                await message_task

if __name__ == "__main__":
    asyncio.run(main())

One-click Function Compute deployment

To deploy the application:

Open the Function Compute template, enter your API key, and click Create and Deploy Default Environment to test the application.
Wait for about a minute. In Environment Details > Environment Context, retrieve the endpoint, change the protocol from http to https (for example, https://qwen-livetranslate-flash-realtime-intl.fcv3.xxx.ap-southeast-1.fc.devsapp.net/), and open the URL in a browser to interact with the model.

Important
This endpoint uses a self-signed certificate and is for temporary testing only. Your browser will display a security warning on your first visit. This is expected behavior. Do not use this endpoint in a production environment. To proceed, follow the on-screen instructions (for example, click Advanced → Proceed to (unsafe)).

If you are prompted to configure Resource Access Management permissions, follow the on-screen instructions.

To view the project source code, go to Resource Information > Function Resources .

Both Function Compute and Model Studio provide a free quota for new users, sufficient for basic debugging. After the free quota is used up, pay-as-you-go billing applies.

Interaction flow

Translation uses an event-driven WebSocket model. The server detects speech boundaries and responds automatically.

Lifecycle	Client event	Server event
Session initialization	session.update Session configuration	session.created Session created session.updated Session configuration updated
User audio input	input_audio_buffer.append Append audio to the buffer	None
Server audio output	None	response.created Signals that the server starts generating a response. response.output_item.added Signals that a new output item is available. response.content_part.added Signals that a new content part has been added to the assistant message. response.text.text Incremental translated text in text-only modality response.audio_transcript.text Incremental translated text in audio+text modality response.audio.delta Contains an incremental chunk of the synthesized audio. response.text.done Translation text complete in text-only modality response.audio_transcript.done Translation text complete in audio+text modality response.audio.done Signals that the synthesized audio is complete. response.content_part.done Signals that a text or audio content part for the assistant message is complete. response.output_item.done Signals that the entire output item for the assistant message is complete. response.done Signals that the entire response is complete.
Session termination	session.finish Notifies the server that audio input is complete	session.finished Server processing complete; session ended

After sending all audio, send a session.finish event and wait for session.finished before closing the WebSocket. If you close the connection without sending session.finish, the server's VAD cannot detect the end of the final speech segment. Translation results for that segment will be lost entirely.

API

Qwen-Livetranslate-Realtime.

Billing

Qwen3.5-LiveTranslate-Flash-Realtime

Audio: 7 tokens per second of input audio; 12.5 tokens per second of output audio.
Image: Every 32×32 pixels consumes 0.5 tokens.
Text: When source language speech recognition is enabled, the service returns a transcript of the input audio in addition to the translation. This transcript is billed as output text tokens.

Qwen3-LiveTranslate-Flash-Realtime

Audio: Each second of audio input or output consumes 12.5 tokens.
Image: Every 28×28 pixels consumes 0.5 tokens.
Text: When source language speech recognition is enabled, the service returns a transcript of the input audio in addition to the translation. This transcript is billed as output text tokens.

Pricing: Model list.

Supported languages

Use the following language codes to specify the source and target languages.

Some target languages only support text. The legacy model qwen3-livetranslate-flash-realtime supports only the following 18 languages: en, zh, ru, fr, de, pt, es, it, id, ko, ja, vi, th, ar, yue, hi, el, tr.

Language code	Language	Output
zh	Chinese	Audio + text
en	English	Audio + text
ar	Arabic	Audio + text
de	German	Audio + text
fr	French	Audio + text
es	Spanish	Audio + text
pt	Portuguese	Audio + text
id	Indonesian	Audio + text
it	Italian	Audio + text
ko	Korean	Audio + text
ru	Russian	Audio + text
th	Thai	Audio + text
vi	Vietnamese	Audio + text
ja	Japanese	Audio + text
tr	Turkish	Audio + text
hi	Hindi	Audio + text
ms	Malay	Audio + text
nl	Dutch	Audio + text
ur	Urdu	Audio + text
nb	Norwegian Bokmål	Audio + text
sv	Swedish	Audio + text
da	Danish	Audio + text
he	Hebrew	Audio + text
fi	Finnish	Audio + text
pl	Polish	Audio + text
is	Icelandic	Audio + text
cs	Czech	Audio + text
fil	Filipino	Audio + text
fa	Persian	Audio + text
yue	Cantonese	Text
el	Greek	Text
af	Afrikaans	Text
ast	Asturian	Text
be	Belarusian	Text
bg	Bulgarian	Text
bn	Bengali	Text
bs	Bosnian	Text
ca	Catalan	Text
ceb	Cebuano	Text
et	Estonian	Text
gl	Galician	Text
gu	Gujarati	Text
hr	Croatian	Text
hu	Hungarian	Text
jv	Javanese	Text
kk	Kazakh	Text
kn	Kannada	Text
ky	Kyrgyz	Text
lv	Latvian	Text
mk	Macedonian	Text
ml	Malayalam	Text
mr	Marathi	Text
pa	Punjabi	Text
ro	Romanian	Text
sk	Slovak	Text
sl	Slovenian	Text
sw	Swahili	Text
tg	Tajik	Text
az	Azerbaijani	Text
uk	Ukrainian	Text

Supported voices

For supported voices and the corresponding voice parameter values, see Voice list.