Voice Design - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

Overview

Voice Design suits rapid prototyping, creative content production, and game character voiceover. Alibaba Cloud Model Studio provides Voice Design through the following model families:

CosyVoice: Supports real-time speech synthesis. Available in the Beijing region (v3.5 series and v3 series).
Qwen-TTS: Supports real-time and non-real-time speech synthesis with a higher voice description limit (2,048 characters). Available in the Beijing and Singapore regions.

If you already have audio samples, see Voice cloning. For guidance on choosing a model, see Speech synthesis.

Prerequisites

Configure an API key and set it as an environment variable.
To call the API through the DashScope SDK, install the latest SDK.

QuickStart

Voice Design follows a three-step workflow: describe, create, and use.

Write a voice description: Describe the desired voice characteristics in natural language. For detailed guidance, see Write voice descriptions.
Create a voice: Call the Voice Design API. The system generates a voice based on your description and returns a preview audio clip. Listen to the preview audio before using the voice in production.
Synthesize speech with the voice: Call the speech synthesis API with the voice ID to generate speech.

CosyVoice Voice Design

The following example shows how to create a CosyVoice voice from a text description and use it for speech synthesis.

Important

CosyVoice Voice Design is available only in the Beijing region (v3.5 series and v3 series).

Step 1: Create a voice from a description

Call the API with two parameters: voice_prompt for the voice description, and preview_text for the text read aloud in the preview audio.

curl -X POST 'https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/audio/tts/customization' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3.5-plus",
        "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
        "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
        "prefix": "announcer"
    },
    "parameters": {
        "sample_rate": 24000,
        "response_format": "wav"
    }
}'

Step 2: Synthesize speech with the designed voice

In the following request, use the voice_id value returned in the previous step.

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *
import os

# The API keys for the Singapore and Beijing regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured the environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following configuration is for the China (Beijing) region. Replace "{WorkspaceId}" with your actual workspace ID.
dashscope.base_websocket_api_url='wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/inference'

# Use the same model for voice design and speech synthesis
model = "cosyvoice-v3.5-plus"
# Replace the voice parameter with the custom voice generated by voice design
voice = "voice_id"

# Instantiate SpeechSynthesizer, passing the model, voice, and other request parameters in the constructor
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send text for synthesis and get binary audio
audio = synthesizer.call("What is the weather like today?")
# Establishing the WebSocket connection is required when sending text for the first time, so the first-package latency includes the connection setup time
print('[Metric] requestId: {}, first-package latency: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# Save the audio to a local file
with open('output.mp3', 'wb') as f:
    f.write(audio)

Qwen-TTS Voice Design

The following examples show how to create a voice and use it for speech synthesis.

Note

Listen to the preview audio before using the voice for synthesis to confirm the result and avoid unnecessary API costs.

Python

import os
import requests
import dashscope

# ======= Constants =======
DEFAULT_TARGET_MODEL = "qwen3-tts-vd-2026-01-26"  # Use the same model for voice design and speech synthesis
DEFAULT_PREFERRED_NAME = "custom_voice"

# Voice description: describe the desired voice characteristics in natural language
VOICE_PROMPT = "A young and lively female voice with a fast speaking rate and a noticeably rising intonation, suitable for introducing fashion products."

def create_voice_by_design(voice_prompt: str,
                           target_model: str = DEFAULT_TARGET_MODEL,
                           preferred_name: str = DEFAULT_PREFERRED_NAME) -> str:
    """
    Create a custom voice by voice description and return the voice parameter.
    """
    # The API keys for the Singapore and Beijing regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key = os.getenv("DASHSCOPE_API_KEY")

    # Singapore region
    url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"
    payload = {
        "model": "qwen-voice-design",
        "input": {
            "action": "create",
            "target_model": target_model,
            "preferred_name": preferred_name,
            "voice_prompt": voice_prompt,
            "preview_text": "Hello everyone, welcome to our live stream! The product we are recommending today is truly amazing."
        },
        "parameters": {
            "sample_rate": 24000,
            "response_format": "wav"
        }
    }
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    resp = requests.post(url, json=payload, headers=headers)
    if resp.status_code != 200:
        raise RuntimeError(f"Failed to create voice: {resp.status_code}, {resp.text}")

    result = resp.json()
    preview_audio = result.get("output", {}).get("preview_audio")
    if preview_audio:
        import base64
        audio_data = base64.b64decode(preview_audio["data"])
        with open("preview_audio.wav", "wb") as f:
            f.write(audio_data)
        print(f"Preview audio saved to preview_audio.wav ({len(audio_data)} bytes)")

    try:
        return result["output"]["voice"]
    except (KeyError, ValueError) as e:
        raise RuntimeError(f"Failed to parse voice response: {e}")

if __name__ == '__main__':
    # Singapore region
    dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

    voice_id = create_voice_by_design(VOICE_PROMPT)
    print(f"Created voice ID: {voice_id}")

    text = "Hello everyone, welcome to our live stream! The product we are recommending today is truly amazing."
    response = dashscope.MultiModalConversation.call(
        model=DEFAULT_TARGET_MODEL,
        # The API keys for the Singapore and Beijing regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key = "sk-xxx"
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        text=text,
        voice=voice_id,
        stream=False
    )
    print(response)

cURL

Step 1: Create a voice from a description

The following configuration is for the Singapore region.

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-voice-design",
    "input": {
        "action": "create",
        "target_model": "qwen3-tts-vd-2026-01-26",
        "preferred_name": "custom_voice",
        "voice_prompt": "A young and lively female voice with a fast speaking rate and a noticeably rising intonation, suitable for introducing fashion products.",
        "preview_text": "Hello everyone, welcome to our live stream! The product we are recommending today is truly amazing."
    },
    "parameters": {
        "sample_rate": 24000,
        "response_format": "wav"
    }
}'

Step 2: Synthesize speech with the designed voice

Replace YOUR_VOICE_ID with the voice value returned in the previous step.

The following configuration is for the Singapore region.

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-tts-vd-2026-01-26",
    "input": {
        "text": "Hello everyone, welcome to our live stream! The product we are recommending today is truly amazing.",
        "voice": "YOUR_VOICE_ID"
    }
}'

Write voice descriptions

The voice description (voice_prompt) determines the quality of the generated voice. The more specific and detailed your description, the more closely the result matches your expectations.

Requirements and limitations

Length limit: The maximum length of voice_prompt varies by model: up to 500 characters for CosyVoice and up to 2,048 characters for Qwen-TTS.
Supported languages: Voice descriptions support Chinese and English only.

Key principles

Be specific, not vague: Use words that describe voice qualities, such as "deep," "crisp," or "fast-paced." Avoid subjective or ambiguous terms like "nice" or "normal."
Be multi-dimensional, not one-dimensional: A good description covers multiple dimensions (such as gender, age, and emotion). Describing only "female voice" is too broad to produce a distinctive result.
Be objective, not subjective: Focus on the physical and perceptual characteristics of the voice. For example, use "high-pitched with an energetic tone" instead of "my favorite voice."
Be original, not imitative: Describe voice qualities instead of requesting imitation of specific individuals (such as celebrities or actors). The model doesn't support imitation, and such requests may raise copyright concerns.
Be concise, not redundant: Avoid repeating synonyms or adding meaningless modifiers. Make sure every word serves a clear purpose.

Description dimensions

Combine the following dimensions to describe a voice. The more dimensions you include, the more accurate the result will be.

Dimension	Examples
Gender	Male, female, neutral
Age	Child (5-12), teenager (13-18), young adult (19-35), middle-aged (36-55), senior (55+)
Pitch	High, medium, low, slightly high, slightly low
Speed	Fast, medium, slow, slightly fast, slightly slow
Emotion	Cheerful, calm, gentle, serious, lively, composed, soothing
Characteristics	Resonant, crisp, husky, mellow, sweet, deep, powerful
Use case	News broadcast, advertising, audiobook, animation character, voice assistant, documentary narration

Examples

Standard broadcast style: clear and precise articulation with perfect enunciation
Young and lively female voice, fast-paced with a noticeable rising intonation, suited for fashion product presentations
Calm, slow-paced middle-aged male voice, deep and resonant, suited for news reading or documentary narration
Gentle and thoughtful female, around 30 years old, even-toned, suited for audio book reading
Cute child's voice, approximately an 8-year-old girl, slightly childish speech, suited for animation character voiceover

Manage custom voices

Voice Design supports listing voices, viewing voice details, and deleting voices. For API endpoints and parameter details, see API reference.

Quota and billing

Voice quota and automatic cleanup

Total voice limit: Each Alibaba Cloud Model Studio account has a separate limit of 1,000 custom voices for CosyVoice and 1,000 for Qwen-TTS. The two quotas are counted independently.

Automatic cleanup: If a voice isn't used in any speech synthesis request for one year, the system automatically deletes it.

Billing rules

CosyVoice: Voice creation is free.
Qwen-TTS: Each voice creation costs USD 0.2. Failed creations aren't charged.

Free quota (Singapore region only):
- You get 10 free voice creations during the first 90 days after activating Alibaba Cloud Model Studio.
- Failed creations don't consume the free quota.
- Deleting a voice doesn't restore the free quota.
- After the free quota is used up or the 90-day window expires, voice creation is billed at USD 0.2 per voice.

Supported models and regions

Singapore

To call the following models, select an API key from the Singapore region:

Qwen-TTS:
- Qwen3-TTS-VD-Realtime: qwen3-tts-vd-realtime-2026-01-15 (latest snapshot), qwen3-tts-vd-realtime-2025-12-16 (snapshot)
- Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)

China (Beijing)

To call the following models, select an API key from the Beijing region:

CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash
Qwen-TTS:
- Qwen3-TTS-VD-Realtime: qwen3-tts-vd-realtime-2026-01-15 (latest snapshot), qwen3-tts-vd-realtime-2025-12-16 (snapshot)
- Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)

Note

CosyVoice Voice Design is powered by the FunAudioGen-VD model.
The same description text (prompt) may produce slightly different voices each time. Generate multiple voices and select the best one.

API reference

Voice Design API reference

FAQ

Does the same voice description always produce the same voice?

Not necessarily. Voice Design involves randomness, so the same description may produce slightly different voices each time. Generate multiple voices, listen to them, and select the best one.

What languages are supported for voice descriptions?

Currently, voice descriptions (voice_prompt) support Chinese and English only. However, the generated voice can synthesize speech in multiple languages.

What's the difference between Voice Design and Voice Cloning?

Voice Design creates a voice from scratch using text descriptions, with no audio samples required. It's suited for designing entirely new voice identities. Voice Cloning replicates a voice based on real audio samples and is suited for reproducing a specific person's voice. For details, see Voice cloning.