All Products
Search
Document Center

Alibaba Cloud Model Studio:Visual understanding (Qwen-VL)

Last Updated:Dec 15, 2025

The Qwen-VL model answers questions based on the images or videos that you provide. It supports single or multiple image inputs and is suitable for various tasks, such as image captioning, visual question answering, and object detection.

Try it online: Vision model (Singapore or China (Beijing))

Getting started

Prerequisites

The following examples show how to call the model to describe image content. For more information about local files and image limits, see How to pass local files and Image limits.

OpenAI compatible

Python

from openai import OpenAI
import os

client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

completion = client.chat.completions.create(
    model="qwen3-vl-plus",  # This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
                    },
                },
                {"type": "text", "text": "What is depicted in the image?"},
            ],
        },
    ],
)
print(completion.choices[0].message.content)

Response

This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand, with the sea and sky in the background. The person and dog appear to be interacting, with the dog's front paw resting on the person's hand. Sunlight is coming from the right side of the frame, adding a warm atmosphere to the scene.

Node.js

import OpenAI from "openai";

const openai = new OpenAI({
  // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
  // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

async function main() {
  const response = await openai.chat.completions.create({
    model: "qwen3-vl-plus",   // This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models 
    messages: [
      {
        role: "user",
        content: [{
            type: "image_url",
            image_url: {
              "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
            }
          },
          {
            type: "text",
            text: "What is depicted in the image?"
          }
        ]
      }
    ]
  });
  console.log(response.choices[0].message.content);
}
main()

Response

This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand, with the sea and sky in the background. The person and dog appear to be interacting, with the dog's front paw resting on the person's hand. Sunlight is coming from the right side of the frame, adding a warm atmosphere to the scene.

curl

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen3-vl-plus",
  "messages": [
    {"role": "user",
     "content": [
        {"type": "image_url", "image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
        {"type": "text", "text": "What is depicted in the image?"}
    ]
  }]
}'

Response

{
  "choices": [
    {
      "message": {
        "content": "This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand, with the sea and sky in the background. The person and dog appear to be interacting, with the dog's front paw resting on the person's hand. Sunlight is coming from the right side of the frame, adding a warm atmosphere to the scene.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 1270,
    "completion_tokens": 54,
    "total_tokens": 1324
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen3-vl-plus",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

Python

import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
{
    "role": "user",
    "content": [
    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
    {"text": "What is depicted in the image?"}]
}]

response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',   # This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Response

This is a photo taken on a beach. In the photo, there is a woman and a dog. The woman is sitting on the sand, smiling and interacting with the dog. The dog is wearing a collar and appears to be shaking hands with the woman. The background is the sea and the sky, and the sunlight shining on them creates a warm atmosphere.

Java

import java.util.Arrays;
import java.util.Collections;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    
    // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation(); 
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "What is depicted in the image?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")  //  This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

This is a photo taken on a beach. In the photo, there is a person in a plaid shirt and a dog with a collar. The person and the dog are sitting face to face, seemingly interacting. The background is the sea and the sky, and the sunlight shining on them creates a warm atmosphere.

curl

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"text": "What is depicted in the image?"}
                ]
            }
        ]
    }
}'

Response

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "This is a photo taken on a beach. In the photo, there is a person in a plaid shirt and a dog with a collar. They are sitting on the sand, with the sea and sky in the background. Sunlight is coming from the right side of the frame, adding a warm atmosphere to the scene."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 55,
    "input_tokens": 1271,
    "image_tokens": 1247
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

Model selection

  • For tasks such as high-precision object recognition and localization (including 3D localization), agent tool calling, document and webpage parsing, complex problem-solving, and long video understanding, Qwen3-VL is the preferred choice. A comparison of the models in this series is as follows:

    • qwen3-vl-plus: The most powerful model.

    • qwen3-vl-flash: This model is faster and more cost-effective. It is a high-value option that balances performance and cost and is suitable for scenarios that are sensitive to response speed.

  • For simple tasks such as image captioning and short video summary extraction, you can choose Qwen2.5-VL. A comparison of the models in this series is as follows:

    • qwen-vl-max: The best-performing model in the Qwen2.5-VL series.

    • qwen-vl-plus: This model is faster and provides a good balance between performance and cost.

For more information about model names, context, pricing, and snapshot versions, see Model list. For more information about concurrent request limits, see Throttling.

Model feature comparison

Model

Deep thinking

Tool calling

Context cache

Structured output

Supported languages

Qwen3-VL series

Supported

Supported

Supported by the stable versions of qwen3-vl-plus and qwen3-vl-flash

Supported in non-thinking mode

33 languages: Chinese, Japanese, Korean, Indonesian, Vietnamese, Thai, English, French, German, Russian, Portuguese, Spanish, Italian, Swedish, Danish, Czech, Norwegian, Dutch, Finnish, Turkish, Polish, Swahili, Romanian, Serbian, Greek, Kazakh, Uzbek, Cebuano, Arabic, Urdu, Persian, Hindi/Devanagari, and Hebrew.

Qwen2.5-VL series

Not supported

Not supported

Supported by the stable versions of qwen-vl-max and qwen-vl-plus

Supported by the stable and latest versions of qwen-vl-max and qwen-vl-plus

11 languages: Chinese, English, Japanese, Korean, Arabic, Vietnamese, French, German, Italian, Spanish, and Russian.

Performance overview

Visual question answering

You can describe the content of an image or classify and label it, such as identifying people, places, animals, and plants.

p1021359

If the sun is glaring, what item from this picture should I use?

When the sun is glaring, you should use the pink sunglasses from the picture. Sunglasses can effectively block strong light, reduce UV damage to your eyes, and help protect your vision and improve visual comfort in bright sunlight.

Creative writing

You can generate vivid text descriptions based on image or video content. This is suitable for creative scenarios such as story writing, copywriting, and short video scripts.

vl-问答

Please help me write an interesting social media post based on the content of the picture.

Sure, this picture is full of a rich Mid-Autumn Festival atmosphere, with a clever combination of classical and modern elements. Based on the main elements in the picture, I have prepared several social media posts in different styles for you to choose from.

Poetic and beautiful style

Tonight, the moon is bright and everyone is watching, wondering where autumn thoughts will fall. Chang'e flies to the moon, and the Jade Rabbit pounds medicine—the romance of the ancients is lit up tonight. May this bright moon light your way home and carry my deepest thoughts. Happy Mid-Autumn Festival!

Warm and blessing style

The moon is full, and people are reunited; the Mid-Autumn night is the gentlest. Watch the fireworks bloom, admire the full moon in the sky, take a bite of a mooncake, and want for peace and health. May all that you and I want for in our hearts come true. Wishing everyone a happy Mid-Autumn Festival and a happy family!

OCR and information extraction

You can recognize text and formulas in images or extract information from receipts, certificates, and forms, with support for formatted text output. The Qwen3-VL model has expanded its language support to 33 languages. For a list of supported languages, see Model feature comparison.

-q2cdz6jy89b6m3kp

Extract the following from the image: ['Invoice Code', 'Invoice Number', 'Destination', 'Fuel Surcharge', 'Fare', 'Travel Date', 'Departure Time', 'Train Number', 'Seat Number']. Please output in JSON format.

{

"Invoice Code": "221021325353",

"Invoice Number": "10283819",

"Destination": "Development Zone",

"Fuel Surcharge": "2.0",

"Fare": "8.00<Full>",

"Travel Date": "2013-06-29",

"Departure Time": "Serial",

"Train Number": "040",

"Seat Number": "371"

}

Multi-disciplinary problem solving

You can solve problems in images, such as math, physics, and chemistry problems. This feature is suitable for primary, secondary, university, and adult education.

-5jwcstcvmdpqghaj

Please solve the math problem in the image step by step.

-答案

Visual coding

You can generate code from images or videos. This can be used to create HTML, CSS, and JS code from design drafts, website screenshots, and more.

code

Design a webpage using HTML and CSS based on my sketch, with black as the main color.

code-预览

Webpage preview

Object detection

The model supports 2D and 3D localization, which can be used to determine object orientation, perspective changes, and occlusion relationships. 3D localization is a new feature of the Qwen3-VL model.

For the Qwen2.5-VL model, object detection is robust within a resolution range of 480 × 480 to 2560 × 2560. Outside this range, detection accuracy may decrease, with occasional bounding box drift.
For more information about how to draw the localization results on the original image, see FAQ.

2D localization

-530xdcos1lqkcfuy

  • Return Box (bounding box) coordinates: Detect all food items in the image and output their bbox coordinates in JSON format.

  • Return Point (centroid) coordinates: Locate all food items in the image as points and output their point coordinates in XML format.

Visualization of 2D localization results

-mu9podu1eyvph1zd

3D localization

3d

Detect the car in the image and predict its 3D position. Output JSON: [{"bbox_3d": [x_center, y_center, z_center, x_size, y_size, z_size, roll, pitch, yaw], "label": "category"}].

Visualization of 3D localization results

3d-结果

Document parsing

You can parse image-based documents, such as scans or image PDFs, into QwenVL HTML or QwenVL Markdown format. This format not only accurately recognizes text but also obtains the position information of elements such as images and tables. The Qwen3-VL model adds the ability to parse documents into Markdown format.

Recommended prompts are: qwenvl html (to parse into HTML format) or qwenvl markdown (to parse into Markdown format)

image

qwenvl markdown.

-结果

Visualization of results

Video understanding

You can analyze video content, such as locating specific events and obtaining timestamps, or generating summaries of key time periods.

Please describe the series of actions of the person in the video. Output in JSON format with start_time, end_time, and event. Please use HH:mm:ss for the timestamp.

{

"events": [

{

"start_time": "00:00:00",

"end_time": "00:00:05",

"event": "The person walks towards the table holding a cardboard box and places it on the table."

},

{

"start_time": "00:00:05",

"end_time": "00:00:15",

"event": "The person picks up a scanner and scans the label on the cardboard box."

},

{

"start_time": "00:00:15",

"end_time": "00:00:21",

"event": "The person puts the scanner back in its place and then picks up a pen to write information in a notebook."}]

}

Core features

Enable or disable thinking mode

  • The qwen3-vl-plus and qwen3-vl-flash series models are hybrid thinking models. They can respond after thinking or respond directly. You can use the enable_thinking parameter to control whether to enable thinking mode:

    • true: Enables thinking mode.

    • false (default): Disables thinking mode.

  • Models with a `thinking` suffix, such as qwen3-vl-235b-a22b-thinking, are thinking-only models. They always think before responding, and you cannot disable this feature.

Important
  • Model configuration: In general conversation scenarios that do not involve agent tool calls, we recommend that you do not set a System Message to maintain optimal performance. You can pass instructions such as model role settings and output format requirements through the User Message.

  • Prioritize streaming output: When thinking mode is enabled, both streaming and non-streaming output are supported. To avoid timeouts due to long response content, we recommend that you prioritize using streaming output.

  • Limit thinking length: Deep thinking models sometimes output lengthy reasoning processes. You can use the thinking_budget parameter to limit the length of the thinking process. If the number of tokens generated during the model's thinking process exceeds the thinking_budget, the reasoning content is truncated, and the model immediately starts to generate the final response. The default value of thinking_budget is the model's maximum chain-of-thought length. For more information, see Model list.

OpenAI compatible

The enable_thinking parameter is not a standard OpenAI parameter. If you use the OpenAI Python SDK, you can pass it through extra_body.

import os
from openai import OpenAI

client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""  # Define the complete thinking process
answer_content = ""     # Define the complete response
is_answering = False   # Determine whether to end the thinking process and start responding
enable_thinking = True
# Create a chat completion request
completion = client.chat.completions.create(
    model="qwen3-vl-plus",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
                    },
                },
                {"type": "text", "text": "How do I solve this problem?"},
            ],
        },
    ],
    stream=True,
    # The enable_thinking parameter enables the thinking process, and the thinking_budget parameter sets the maximum number of tokens for the reasoning process.
    # For qwen3-vl-plus and qwen3-vl-flash, thinking can be enabled or disabled with enable_thinking. For models with the 'thinking' suffix, such as qwen3-vl-235b-a22b-thinking, enable_thinking can only be set to true. This does not apply to other Qwen-VL models.
    extra_body={
        'enable_thinking': enable_thinking,
        "thinking_budget": 81920},

    # Uncomment the following lines to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)

if enable_thinking:
    print("\n" + "=" * 20 + "Thinking Process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print the thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start responding
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20 + "\n")
                is_answering = True
            # Print the response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content

# print("=" * 20 + "Complete Thinking Process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(answer_content)
import OpenAI from "openai";

// Initialize the OpenAI client
const openai = new OpenAI({
  // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
  // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;
let enableThinking = true;

let messages = [
    {
        role: "user",
        content: [
        { type: "image_url", image_url: { "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg" } },
        { type: "text", text: "Solve this problem" },
    ]
}]

async function main() {
    try {
        const stream = await openai.chat.completions.create({
            model: 'qwen3-vl-plus',
            messages: messages,
            stream: true,
          // Note: In the Node.js SDK, non-standard parameters like enableThinking are passed as top-level properties and do not need to be in extra_body.
          enable_thinking: enableThinking,
          thinking_budget: 81920

        });

        if (enableThinking){console.log('\n' + '='.repeat(20) + 'Thinking Process' + '='.repeat(20) + '\n');}

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;

            // Process the thinking process
            if (delta.reasoning_content) {
                process.stdout.write(delta.reasoning_content);
                reasoningContent += delta.reasoning_content;
            }
            // Process the formal response
            else if (delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Complete Response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();
# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen3-vl-plus",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
          }
        },
        {
          "type": "text",
          "text": "Please solve this problem"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{"include_usage":true},
    "enable_thinking": true,
    "thinking_budget": 81920
}'

DashScope

import os
import dashscope
from dashscope import MultiModalConversation

# If you use a model in the Singapore region, uncomment the following line.
# dashscope.base_http_api_url = "https://dashscope-intl.aliyuncs.com/api/v1"
enable_thinking=True
messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
            {"text": "Solve this problem?"}
        ]
    }
]

response = MultiModalConversation.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-vl-plus",  
    messages=messages,
    stream=True,
    # The enable_thinking parameter enables the thinking process.
    # For qwen3-vl-plus and qwen3-vl-flash, thinking can be enabled or disabled with enable_thinking. For models with the 'thinking' suffix, such as qwen3-vl-235b-a22b-thinking, enable_thinking can only be set to true. This does not apply to other Qwen-VL models.
    enable_thinking=enable_thinking,
    # The thinking_budget parameter sets the maximum number of tokens for the reasoning process.
    thinking_budget=81920,

)

# Define the complete thinking process
reasoning_content = ""
# Define the complete response
answer_content = ""
# Determine whether to end the thinking process and start responding
is_answering = False

if enable_thinking:
    print("=" * 20 + "Thinking Process" + "=" * 20)

for chunk in response:
    # If both the thinking process and the response are empty, ignore
    message = chunk.output.choices[0].message
    reasoning_content_chunk = message.get("reasoning_content", None)
    if (chunk.output.choices[0].message.content == [] and
        reasoning_content_chunk == ""):
        pass
    else:
        # If it is currently in the thinking process
        if reasoning_content_chunk != None and chunk.output.choices[0].message.content == []:
            print(chunk.output.choices[0].message.reasoning_content, end="")
            reasoning_content += chunk.output.choices[0].message.reasoning_content
        # If it is currently responding
        elif chunk.output.choices[0].message.content != []:
            if not is_answering:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20)
                is_answering = True
            print(chunk.output.choices[0].message.content[0]["text"], end="")
            answer_content += chunk.output.choices[0].message.content[0]["text"]

# To print the complete thinking process and complete response, uncomment and run the following code
# print("=" * 20 + "Complete Thinking Process" + "=" * 20 + "\n")
# print(f"{reasoning_content}")
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(f"{answer_content}")
// DashScope SDK version >= 2.21.10
import java.util.*;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.exception.InputRequiredException;
import java.lang.System;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";}

    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    private static StringBuilder reasoningContent = new StringBuilder();
    private static StringBuilder finalContent = new StringBuilder();
    private static boolean isFirstPrint = true;

    private static void handleGenerationResult(MultiModalConversationResult message) {
        String re = message.getOutput().getChoices().get(0).getMessage().getReasoningContent();
        String reasoning = Objects.isNull(re)?"":re; // Default value

        List<Map<String, Object>> content = message.getOutput().getChoices().get(0).getMessage().getContent();
        if (!reasoning.isEmpty()) {
            reasoningContent.append(reasoning);
            if (isFirstPrint) {
                System.out.println("====================Thinking Process====================");
                isFirstPrint = false;
            }
            System.out.print(reasoning);
        }

        if (Objects.nonNull(content) && !content.isEmpty()) {
            Object text = content.get(0).get("text");
            finalContent.append(content.get(0).get("text"));
            if (!isFirstPrint) {
                System.out.println("\n====================Complete Response====================");
                isFirstPrint = true;
            }
            System.out.print(text);
        }
    }
    public static MultiModalConversationParam buildMultiModalConversationParam(MultiModalMessage Msg)  {
        return MultiModalConversationParam.builder()
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .messages(Arrays.asList(Msg))
                .enableThinking(true)
                .thinkingBudget(81920)
                .incrementalOutput(true)
                .build();
    }

    public static void streamCallWithMessage(MultiModalConversation conv, MultiModalMessage Msg)
            throws NoApiKeyException, ApiException, InputRequiredException, UploadFileException {
        MultiModalConversationParam param = buildMultiModalConversationParam(Msg);
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(message -> {
            handleGenerationResult(message);
        });
    }
    public static void main(String[] args) {
        try {
            MultiModalConversation conv = new MultiModalConversation();
            MultiModalMessage userMsg = MultiModalMessage.builder()
                    .role(Role.USER.getValue())
                    .content(Arrays.asList(Collections.singletonMap("image", "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"),
                            Collections.singletonMap("text", "Please solve this problem")))
                    .build();
            streamCallWithMessage(conv, userMsg);
//             Print the final result
//            if (reasoningContent.length() > 0) {
//                System.out.println("\n====================Complete Response====================");
//                System.out.println(finalContent.toString());
//            }
        } catch (ApiException | NoApiKeyException | UploadFileException | InputRequiredException e) {
            logger.error("An exception occurred: {}", e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
                    {"text": "Please solve this problem"}
                ]
            }
        ]
    },
    "parameters":{
        "enable_thinking": true,
        "incremental_output": true,
        "thinking_budget": 81920
    }
}'

Multiple image input

The Qwen-VL model supports passing multiple images in a single request, which can be used for tasks such as product comparison and multi-page document processing. To do this, you can include multiple image objects in the content array of the user message.

Important

The number of images is limited by the model's total token limit for text and images. The total number of tokens for all images and text must be less than the model's maximum input.

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-vl-plus",  #  This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {"role": "user","content": [
            {"type": "image_url","image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},},
            {"type": "image_url","image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},},
            {"type": "text", "text": "What do these images depict?"},
            ],
        }
    ],
)

print(completion.choices[0].message.content)

Response

Image 1 shows a scene of a woman and a Labrador retriever interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, shaking hands with the dog. The background is the ocean waves and the sky, and the whole picture is filled with a warm and pleasant atmosphere.

Image 2 shows a scene of a tiger walking in a forest. The tiger's coat is orange with black stripes, and it is stepping forward. The surroundings are dense trees and vegetation, and the ground is covered with fallen leaves. The whole picture gives a feeling of wild nature.

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen3-vl-plus",  // This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
        messages: [
          {role: "user",content: [
            {type: "image_url",image_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
            {type: "image_url",image_url: {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"}},
            {type: "text", text: "What do these images depict?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()

Response

In the first image, a person and a dog are interacting on a beach. The person is wearing a plaid shirt, and the dog is wearing a collar. They seem to be shaking hands or giving a high-five.

In the second image, a tiger is walking in a forest. The tiger's coat is orange with black stripes, and the background is green trees and vegetation.

curl

# ======= Important =======
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-vl-plus",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
          }
        },
        {
          "type": "text",
          "text": "What do these images depict?"
        }
      ]
    }
  ]
}'

Response

{
  "choices": [
    {
      "message": {
        "content": "Image 1 shows a scene of a woman and a Labrador retriever interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, shaking hands with the dog. The background is a sea view and a sunset sky, and the whole picture looks very warm and harmonious.\n\nImage 2 shows a scene of a tiger walking in a forest. The tiger's coat is orange with black stripes, and it is stepping forward. The surroundings are dense trees and vegetation, and the ground is covered with fallen leaves. The whole picture is full of natural wildness and vitality.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 2497,
    "completion_tokens": 109,
    "total_tokens": 2606
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen3-vl-plus",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

Python

import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
            {"text": "What do these images depict?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus', #  This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Response

These images show some animals and natural scenes. In the first image, a person and a dog are interacting on a beach. The second image is of a tiger walking in a forest.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"),
                        Collections.singletonMap("text", "What do these images depict?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")  //  This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

These images show some animals and natural scenes.

1. First image: A woman and a dog are interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, and the dog is wearing a collar and extending its paw to shake hands with the woman.
2. Second image: A tiger is walking in a forest. The tiger's coat is orange with black stripes, and the background is trees and leaves.

curl

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
                    {"text": "What do these images show?"}
                ]
            }
        ]
    }
}'

Response

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "These images show some animals and natural scenes. In the first image, a person and a dog are interacting on a beach. The second image is of a tiger walking in a forest."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 81,
    "input_tokens": 1277,
    "image_tokens": 2497
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

Video understanding

The Qwen-VL model supports understanding video content. You can provide files in the form of an image list (video frames) or a video file.

We recommend that you use the latest or a recent snapshot version of the model for better performance in understanding video files.

Video file

Video frame extraction

The Qwen-VL model analyzes content by extracting a sequence of frames from a video. The frame extraction frequency determines the level of detail in the model's analysis. Different SDKs have different frame extraction frequencies:

  • Using the DashScope SDK:

    You can control the frame extraction interval (one frame is extracted every seconds) using the fps parameter. The range for this parameter is (0.1, 10), with a default value of 2.0. We recommend that you set a higher fps for high-speed motion scenes and a lower fps for static or long videos.

  • Using the OpenAI-compatible SDK: Frames are extracted at a fixed frequency of 1 frame per 0.5 seconds, and customization is not supported.

The following is an example of code for understanding an online video that is specified by a URL. For more information, see How to pass local files.

OpenAI compatible

When you directly input a video file to the Qwen-VL model using the OpenAI SDK or HTTP method, you must set the "type" parameter in the user message to "video_url".

Python

import os
from openai import OpenAI

client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus",
    messages=[
        {"role": "user","content": [{
            # When passing a video file directly, set the value of type to video_url
            # When using the OpenAI SDK, video files are sampled at a default rate of one frame every 0.5 seconds, which cannot be modified. To customize the frame rate, use the DashScope SDK.
            "type": "video_url",            
            "video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
            {"type": "text","text": "What is the content of this video?"}]
         }]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen3-vl-plus",
        messages: [
        {role: "user",content: [
            // When passing a video file directly, set the value of type to video_url
            // When using the OpenAI SDK, video files are sampled at a default rate of one frame every 0.5 seconds, which cannot be modified. To customize the frame rate, use the DashScope SDK.
            {type: "video_url", video_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
            {type: "text", text: "What is the content of this video?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()

curl

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "messages": [
    {"role": "user","content": [{"type": "video_url","video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
    {"type": "text","text": "What is the content of this video?"}]}]
}'

DashScope

Python

import dashscope
import os
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
    {"role": "user",
        "content": [
            # The fps parameter can control the video frame extraction frequency, meaning one frame is extracted every 1/fps seconds. For complete usage, see https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "What is the content of this video?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key ="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
   static {
            // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
        }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // The fps parameter controls the video frame extraction frequency, meaning one frame is extracted every 1/fps seconds. For complete usage, see https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
        Map<String, Object> params = new HashMap<>();
        params.put("video", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4");
        params.put("fps", 2);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "What is the content of this video?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If you use a model in the China (Beijing) region, you need to use an API key from the China (Beijing) region. Get the link: https://bailian.console.alibabacloud.com/?tab=model#/api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {"role": "user","content": [{"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "What is the content of this video?"}]}]}
}'

Image list

Image list quantity limits

  • qwen3-vl-plus series: A minimum of 4 images and a maximum of 2,000 images.

  • qwen3-vl-flash series, Qwen2.5-VL, and QVQ series models: A minimum of 4 images and a maximum of 512 images.

  • Other models: A minimum of 4 images and a maximum of 80 images.

Video frame extraction

When a video is passed as a list of images (pre-extracted video frames), you can use the fps parameter to inform the model of the time interval between video frames. This helps the model better understand the sequence, duration, and dynamic changes of events.

  • DashScope SDK:

    This SDK supports specifying the original video's frame rate using the fps parameter, which indicates that the video frames were extracted from the original video every seconds. This parameter is supported by Qwen2.5-VL and Qwen3-VL models.

  • OpenAI compatible SDK:

    This SDK does not support the fps parameter. The model assumes that the video frames are extracted at a default frequency of one frame every 0.5 seconds.

The following is an example of code for understanding online video frames that are specified by a URL. For more information, see How to pass local files.

OpenAI compatible

When you input a video as a list of images to the Qwen-VL model using the OpenAI SDK or HTTP method, you must set the "type" parameter in the user message to "video".

Python

import os
from openai import OpenAI

client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus", # This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[{"role": "user","content": [
        # When passing an image list, the "type" parameter in the user message is "video".
        # When using the OpenAI SDK, the image list is assumed to be extracted from the video at a default interval of 0.5 seconds, which cannot be modified. To customize the frame rate, use the DashScope SDK.
        {"type": "video","video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
        {"type": "text","text": "Describe the specific process of this video"},
    ]}]
)
print(completion.choices[0].message.content)

Node.js

// Make sure you have specified "type": "module" in package.json before.
import OpenAI from "openai";

const openai = new OpenAI({
    // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
    apiKey: process.env.DASHSCOPE_API_KEY,
    // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen3-vl-plus", // This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
        messages: [{
            role: "user",
            content: [
                {
                    // When passing an image list, the "type" parameter in the user message is "video".
                    // When using the OpenAI SDK, the image list is assumed to be extracted from the video at a default interval of 0.5 seconds, which cannot be modified. To customize the frame rate, use the DashScope SDK.
                    type: "video",
                    video: [
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
                    ]
                },
                {
                    type: "text",
                    text: "Describe the specific process of this video"
                }
            ]
        }]
    });
    console.log(response.choices[0].message.content);
}

main();

curl

# ======= Important =======
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "messages": [{"role": "user",
                "content": [{"type": "video",
                "video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
                {"type": "text",
                "text": "Describe the specific process of this video"}]}]
}'

DashScope

Python

import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{"role": "user",
             "content": [
                  # If the model belongs to the Qwen2.5-VL or Qwen3-VL series and an image list is passed, you can set the fps parameter to indicate that the image list is extracted from the original video every 1/fps seconds.
                 {"video":["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                   "fps":2},
                 {"text": "Describe the specific process of this video"}]}]
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen3-vl-plus',  # This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages
)
print(response.output.choices[0].message.content[0]["text"])

Java

// DashScope SDK version must be 2.18.3 or later
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    private static final String MODEL_NAME = "qwen3-vl-plus";  // This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    public static void videoImageListSample() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        //  If the model belongs to the Qwen2.5-VL or Qwen3-VL series and an image list is passed, you can set the fps parameter to indicate that the image list is extracted from the original video every 1/fps seconds.
        Map<String, Object> params = new HashMap<>();
        params.put("video", Arrays.asList("https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"));
        params.put("fps", 2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "Describe the specific process of this video")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-vl-plus",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
            ],
            "fps":2
                 
          },
          {
            "text": "Describe the specific process of this video"
          }
        ]
      }
    ]
  }
}'

Pass local files (Base64 encoding or file path)

Qwen-VL provides two ways to upload local files: Base64 encoding and direct file path upload. You can choose the upload method based on the file size and SDK type. For specific recommendations, see How to choose a file upload method. Both methods must meet the file requirements described in Image limits.

Base64 encoding upload

You can convert the file to a Base64 encoded string and then pass it to the model. This is applicable for OpenAI and DashScope SDKs and HTTP methods.

Steps to pass a Base64 encoded string (using an image as an example)

  1. File encoding: Convert the local image to a Base64 encoding.

    Example code for converting an image to Base64 encoding

    #  Encoding function: Converts a local file to a Base64 encoded string
    import base64
    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    
    # Replace xxxx/eagle.png with the absolute path of your local image
    base64_image = encode_image("xxx/eagle.png")
  2. Construct a Data URL: The format is as follows: data:[MIME_type];base64,{base64_image}.

    1. Replace MIME_type with the actual media type, ensuring it matches the MIME Type value in the Supported image formats table (such as image/jpeg or image/png).

    2. base64_image is the Base64 string generated in the previous step.

  3. Call the model: Pass the Data URL through the image or image_url parameter and call the model.

File path upload

You can directly pass the local file path to the model. This method is supported only by the DashScope Python and Java SDKs, not by DashScope HTTP or OpenAI compatible methods.

You can refer to the table below to specify the file path based on your programming language and operating system.

Specify the file path (using an image as an example)

System

SDK

File path to pass

Example

Linux or macOS

Python SDK

file://{absolute path of the file}

file:///home/images/test.png

Java SDK

Windows

Python SDK

file://{absolute path of the file}

file://D:/images/test.png

Java SDK

file:///{absolute path of the file}

file:///D:/images/test.pn

Image

Pass by file path

Python

import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace xxx/eagle.png with the absolute path of your local image
local_path = "xxx/eagle.png"
image_path = f"file://{local_path}"
messages = [
                {'role':'user',
                'content': [{'image': image_path},
                            {'text': 'What is depicted in the image?'}]}]
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',  # This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages)
print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>(){{put("image", filePath);}},
                        new HashMap<String, Object>(){{put("text", "What is depicted in the image?");}})).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")  // This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));}

    public static void main(String[] args) {
        try {
            // Replace xxx/eagle.png with the absolute path of your local image
            callWithLocalFile("xxx/eagle.png");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Pass by Base64 encoding

OpenAI compatible

Python

from openai import OpenAI
import os
import base64


#  Encoding function: Converts a local file to a Base64 encoded string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Replace xxxx/eagle.png with the absolute path of your local image
base64_image = encode_image("xxx/eagle.png")
client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus", # This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    # Pass Base64 image data. Note that the image format (i.e., image/{format}) must match the Content Type in the list of supported images. "f" is a string formatting method.
                    # PNG image:  f"data:image/png;base64,{base64_image}"
                    # JPEG image: f"data:image/jpeg;base64,{base64_image}"
                    # WEBP image: f"data:image/webp;base64,{base64_image}"
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                },
                {"type": "text", "text": "What is depicted in the image?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';


const openai = new OpenAI(
    {
        // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
// Replace xxx/eagle.png with the absolute path of your local image
const base64Image = encodeImage("xxx/eagle.png")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen3-vl-plus",  // This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
        messages: [
            {"role": "user",
            "content": [{"type": "image_url",
                            // Note that when passing Base64, the image format (i.e., image/{format}) must match the Content Type in the list of supported images.
                           // PNG image:  data:image/png;base64,${base64Image}
                          // JPEG image: data:image/jpeg;base64,${base64Image}
                         // WEBP image: data:image/webp;base64,${base64Image}
                        "image_url": {"url": `data:image/png;base64,${base64Image}`},},
                        {"type": "text", "text": "What is depicted in the image?"}]}]
    });
    console.log(completion.choices[0].message.content);
} 

main();

curl

  • For more information about a method to convert a file to a Base64 encoded string, see Example code.

  • For display purposes, the Base64 encoded string "..." in the code is truncated. In actual use, you must pass the complete encoded string.

# ======= Important =======
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen3-vl-plus",
  "messages": [
  {
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": ""}},
      {"type": "text", "text": "What is depicted in the image?"}
    ]
  }]
}'

DashScope

Python

import base64
import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

#  Encoding function: Converts a local file to a Base64 encoded string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Replace xxxx/eagle.png with the absolute path of your local image
base64_image = encode_image("xxxx/eagle.png")

messages = [
    {
        "role": "user",
        "content": [
            # Note that when passing Base64, the image format (i.e., image/{format}) must match the Content Type in the list of supported images. "f" is a string formatting method.
            # PNG image:  f"data:image/png;base64,{base64_image}"
            # JPEG image: f"data:image/jpeg;base64,{base64_image}"
            # WEBP image: f"data:image/webp;base64,{base64_image}"
            {"image": f"data:image/png;base64,{base64_image}"},
            {"text": "What is depicted in the image?"},
        ],
    },
]

response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-vl-plus",  # This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages,
)
print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Base64;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    private static String encodeImageToBase64(String imagePath) throws IOException {
        Path path = Paths.get(imagePath);
        byte[] imageBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(imageBytes);
    }

    public static void callWithLocalFile(String localPath) throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Image = encodeImageToBase64(localPath); // Base64 encoding

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        new HashMap<String, Object>() {{ put("image", "data:image/png;base64," + base64Image); }},
                        new HashMap<String, Object>() {{ put("text", "What is depicted in the image?"); }}
                )).build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .messages(Arrays.asList(userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace xxx/eagle.png with the absolute path of your local image
            callWithLocalFile("xxx/eagle.png");
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For more information about a method to convert a file to a Base64 encoded string, see Example code.

  • For display purposes, the Base64 encoded string "..." in the code is truncated. In actual use, you must pass the complete encoded string.

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
             "role": "user",
             "content": [
               {"image": "..."},
               {"text": "What is depicted in the image?"}
                ]
            }
        ]
    }
}'

Video file

This example uses a locally saved test.mp4 file.

Pass by file path

Python

import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace xxxx/test.mp4 with the absolute path of your local video
local_path = "xxx/test.mp4"
video_path = f"file://{local_path}"
messages = [
                {'role':'user',
                # The fps parameter controls the number of frames extracted from the video, meaning one frame is extracted every 1/fps seconds.
                'content': [{'video': video_path,"fps":2},
                            {'text': 'What scene does this video depict?'}]}]
response = MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',  
    messages=messages)
print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>()
                                       {{
                                           put("video", filePath);// The fps parameter controls the number of frames extracted from the video, meaning one frame is extracted every 1/fps seconds.
                                           put("fps", 2);
                                       }}, 
                        new HashMap<String, Object>(){{put("text", "What scene does this video depict?");}})).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")  
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));}

    public static void main(String[] args) {
        try {
            // Replace xxxx/test.mp4 with the absolute path of your local video
            callWithLocalFile("xxx/test.mp4");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Pass by Base64 encoding

OpenAI compatible

Python

from openai import OpenAI
import os
import base64


# Encoding function: Converts a local file to a Base64 encoded string
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# Replace xxxx/test.mp4 with the absolute path of your local video
base64_video = encode_video("xxx/test.mp4")
client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus",  
    messages=[
        {
            "role": "user",
            "content": [
                {
                    # When passing a video file directly, set the value of type to video_url
                    "type": "video_url",
                    "video_url": {"url": f"data:video/mp4;base64,{base64_video}"},
                },
                {"type": "text", "text": "What scene does this video depict?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeVideo = (videoPath) => {
    const videoFile = readFileSync(videoPath);
    return videoFile.toString('base64');
  };
// Replace xxxx/test.mp4 with the absolute path of your local video
const base64Video = encodeVideo("xxx/test.mp4")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen3-vl-plus",  
        messages: [
            {"role": "user",
             "content": [{
                 // When passing a video file directly, set the value of type to video_url
                "type": "video_url", 
                "video_url": {"url": `data:video/mp4;base64,${base64Video}`}},
                 {"type": "text", "text": "What scene does this video depict?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

curl

  • For more information about a method to convert a file to a Base64 encoded string, see Example code.

  • For display purposes, the Base64 encoded string "data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." in the code is truncated. In actual use, you must pass the complete encoded string.

# ======= Important =======
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen3-vl-plus",
  "messages": [
  {
    "role": "user",
    "content": [
      {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."}},
      {"type": "text", "text": "What is depicted in the image?"}
    ]
  }]
}'

DashScope

Python

import base64
import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Encoding function: Converts a local file to a Base64 encoded string
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# Replace xxxx/test.mp4 with the absolute path of your local video
base64_video = encode_video("xxxx/test.mp4")

messages = [{'role':'user',
                # The fps parameter controls the number of frames extracted from the video, meaning one frame is extracted every 1/fps seconds.
             'content': [{'video': f"data:video/mp4;base64,{base64_video}","fps":2},
                            {'text': 'What scene does this video depict?'}]}]
response = MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',
    messages=messages)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.util.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    private static String encodeVideoToBase64(String videoPath) throws IOException {
        Path path = Paths.get(videoPath);
        byte[] videoBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(videoBytes);
    }

    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Video = encodeVideoToBase64(localPath); // Base64 encoding

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>()
                                       {{
                                           put("video", "data:video/mp4;base64," + base64Video);// The fps parameter controls the number of frames extracted from the video, meaning one frame is extracted every 1/fps seconds.
                                           put("fps", 2);
                                       }},
                        new HashMap<String, Object>(){{put("text", "What scene does this video depict?");}})).build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .messages(Arrays.asList(userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace xxx/test.mp4 with the absolute path of your local image
            callWithLocalFile("xxx/test.mp4");
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For more information about a method to convert a file to a Base64 encoded string, see Example code.

  • For display purposes, the Base64 encoded string "f"data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." in the code is truncated. In actual use, you must pass the complete encoded string.

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
             "role": "user",
             "content": [
               {"video": "data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."},
               {"text": "What is depicted in the image?"}
                ]
            }
        ]
    }
}'

Image list

This example uses locally saved files: football1.jpg, football2.jpg, football3.jpg, and football4.jpg.

Pass by file path

Python

import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

local_path1 = "football1.jpg"
local_path2 = "football2.jpg"
local_path3 = "football3.jpg"
local_path4 = "football4.jpg"

image_path1 = f"file://{local_path1}"
image_path2 = f"file://{local_path2}"
image_path3 = f"file://{local_path3}"
image_path4 = f"file://{local_path4}"

messages = [{'role':'user',
                # If the model belongs to the Qwen2.5-VL series and an image list is passed, you can set the fps parameter to indicate that the image list is extracted from the original video every 1/fps seconds. This setting is not effective for other models.
             'content': [{'video': [image_path1,image_path2,image_path3,image_path4],"fps":2},
                         {'text': 'What scene does this video depict?'}]}]
response = MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',  # This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages)

print(response.output.choices[0].message.content[0]["text"])

Java

// The DashScope SDK version must be 2.18.3 or later.
import java.util.Arrays;
import java.util.Map;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following base_url is for the Singapore region. If you use a model in the Beijing region, replace the base_url with https://dashscope.aliyuncs.com/api/v1.
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    private static final String MODEL_NAME = "qwen3-vl-plus";  // In this example, qwen3-vl-plus is used. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    public static void videoImageListSample(String localPath1, String localPath2, String localPath3, String localPath4)
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        String filePath1 = "file://" + localPath1;
        String filePath2 = "file://" + localPath2;
        String filePath3 = "file://" + localPath3;
        String filePath4 = "file://" + localPath4;
        Map<String, Object> params = new HashMap<>();
        params.put("video", Arrays.asList(filePath1,filePath2,filePath3,filePath4));
        // For models in the Qwen2.5-VL series, if you provide an image list, you can set the fps parameter. This parameter indicates that the images are extracted from the original video at an interval of 1/fps seconds. The setting has no effect on other models.
        params.put("fps", 2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(params,
                        Collections.singletonMap("text", "Describe the process shown in this video")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys are different for the Singapore and Beijing regions. To obtain an API key, visit https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample(
                    "xxx/football1.jpg",
                    "xxx/football2.jpg",
                    "xxx/football3.jpg",
                    "xxx/football4.jpg");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Pass by Base64 encoding

OpenAI compatible

Python

import os
from openai import OpenAI
import base64

# Encoding function: Converts a local file to a Base64 encoded string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")
client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus",  # This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[  
    {"role": "user","content": [
        {"type": "video","video": [
            f"data:image/jpeg;base64,{base64_image1}",
            f"data:image/jpeg;base64,{base64_image2}",
            f"data:image/jpeg;base64,{base64_image3}",
            f"data:image/jpeg;base64,{base64_image4}",]},
        {"type": "text","text": "Describe the specific process of this video"},
    ]}]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
  
const base64Image1 = encodeImage("football1.jpg")
const base64Image2 = encodeImage("football2.jpg")
const base64Image3 = encodeImage("football3.jpg")
const base64Image4 = encodeImage("football4.jpg")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen3-vl-plus",  // This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
        messages: [
            {"role": "user",
             "content": [{"type": "video",
                            // Note that when passing Base64, the image format (i.e., image/{format}) must match the Content Type in the list of supported images.
                           // PNG image:  data:image/png;base64,${base64Image}
                          // JPEG image: data:image/jpeg;base64,${base64Image}
                         // WEBP image: data:image/webp;base64,${base64Image}
                        "video": [
                            `data:image/jpeg;base64,${base64Image1}`,
                            `data:image/jpeg;base64,${base64Image2}`,
                            `data:image/jpeg;base64,${base64Image3}`,
                            `data:image/jpeg;base64,${base64Image4}`]},
                        {"type": "text", "text": "What scene does this video depict?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

curl

  • For more information about a method to convert a file to a Base64 encoded string, see Example code.

  • For display purposes, the Base64 encoded string "..." in the code is truncated. In actual use, you must pass the complete encoded string.

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "messages": [{"role": "user",
                "content": [{"type": "video",
                "video": [
                          "...",
                          "...",
                          "...",
                          "..."
                          ]},
                {"type": "text",
                "text": "Describe the specific process of this video"}]}]
}'

DashScope

Python

import base64
import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

#  Encoding function: Converts a local file to a Base64 encoded string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")


messages = [{'role':'user',
            'content': [
                    {'video':
                         [f"data:image/png;base64,{base64_image1}",
                          f"data:image/png;base64,{base64_image2}",
                          f"data:image/png;base64,{base64_image3}",
                          f"data:image/png;base64,{base64_image4}"
                         ]
                    },
                    {'text': 'Please describe the specific process of this video?'}]}]
response = MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen3-vl-plus',  # This example uses qwen3-vl-plus. You can replace the model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.util.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    private static String encodeImageToBase64(String imagePath) throws IOException {
        Path path = Paths.get(imagePath);
        byte[] imageBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(imageBytes);
    }

    public static void videoImageListSample(String localPath1,String localPath2,String localPath3,String localPath4)
            throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Image1 = encodeImageToBase64(localPath1); // Base64 encoding
        String base64Image2 = encodeImageToBase64(localPath2);
        String base64Image3 = encodeImageToBase64(localPath3);
        String base64Image4 = encodeImageToBase64(localPath4);

        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> params = new HashMap<>();
        params.put("video", Arrays.asList(
                        "data:image/jpeg;base64," + base64Image1,
                        "data:image/jpeg;base64," + base64Image2,
                        "data:image/jpeg;base64," + base64Image3,
                        "data:image/jpeg;base64," + base64Image4));
        // If the model belongs to the Qwen2.5-VL series and an image list is passed, you can set the fps parameter to indicate that the image list is extracted from the original video every 1/fps seconds. This setting is not effective for other models.
        params.put("fps", 2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(params,
                        Collections.singletonMap("text", "Describe the specific process of this video")))
                .build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .messages(Arrays.asList(userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace xxx/football1.png and others with the absolute paths of your local images
            videoImageListSample(
                    "xxx/football1.jpg",
                    "xxx/football2.jpg",
                    "xxx/football3.jpg",
                    "xxx/football4.jpg"
            );
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For more information about a method to convert a file to a Base64 encoded string, see Example code.

  • For display purposes, the Base64 encoded string "..." in the code is truncated. In actual use, you must pass the complete encoded string.

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-vl-plus",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
                      "...",
                      "...",
                      "...",
                      "..."
            ],
            "fps":2     
          },
          {
            "text": "Describe the specific process of this video"
          }
        ]
      }
    ]
  }
}'

Process high-resolution images

The Qwen-VL API has a limit on the number of visual tokens for a single image after encoding. With default configurations, high-resolution images are compressed, which may result in a loss of detail and affect understanding accuracy. You can enable vl_high_resolution_images or adjust max_pixels to increase the number of visual tokens, which preserves more image details and improves understanding.

Different models have different pixels per visual token, token limits, and pixel limits. The specific parameters are as follows:

Model

Pixels per token

vl_high_resolution_images

max_pixels

Token limit

Pixel limit

If the pixel limit is exceeded, the total pixels of the image will be scaled down to this limit.

Qwen3-VL series models

32 × 32

true

max_pixels is invalid

16384 tokens

16777216 (which is 16384 × 32 × 32)

false (default)

Customizable, with a maximum value of 16777216

The maximum of 2560 tokens and the result of max_pixels/32/32

2621440 or max_pixels

qwen-vl-max, qwen-vl-max-latest, qwen-vl-max-0813, qwen-vl-plus, qwen-vl-plus-latest, qwen-vl-plus-0815, and models

32 × 32

true

max_pixels is invalid

16384 tokens

16777216 (which is 16384 × 32 × 32)

false (default)

Customizable, with a maximum value of 16777216

The maximum of 1280 tokens and the result of max_pixels/32/32

1310720 or max_pixels

QVQ series and other Qwen2.5-VL models

28 × 28

true

max_pixels is invalid

16384 tokens

12845056 (which is 16384 × 28 × 28)

false (default)

Customizable, with a maximum value of 12845056

The maximum of 1280 tokens and the result of max_pixels/28/28

1003520 or max_pixels

  • When vl_high_resolution_images=true, the API uses a fixed resolution strategy and ignores the max_pixels setting. This is suitable for recognizing fine text, small objects, or rich details in images.

  • When vl_high_resolution_images=false, the actual resolution is determined by both max_pixels and the default limit. The model uses the maximum of the two calculated results.

    • For high processing speed or cost-sensitive scenarios, you can use the default value of max_pixels or set it to a smaller value.

    • When some detail is important and a lower processing speed is acceptable, you can increase the value of max_pixels appropriately.

OpenAI compatible

vl_high_resolution_images is not a standard OpenAI parameter. If you use the OpenAI Python SDK, you can pass it through extra_body.

Python

import os
import time
from openai import OpenAI

client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-vl-plus",
    messages=[
        {"role": "user","content": [
            {"type": "image_url","image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
            # max_pixels represents the maximum pixel threshold for the input image. It is invalid when vl_high_resolution_images=True, but customizable when vl_high_resolution_images=False. The maximum value varies by model.
            # "max_pixels": 16384 * 32 * 32
            },
           {"type": "text", "text": "What festival atmosphere does this picture show?"},
            ],
        }
    ],
    extra_body={"vl_high_resolution_images":True}

)
print(f"Model output: {completion.choices[0].message.content}")
print(f"Total input tokens: {completion.usage.prompt_tokens}")

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const response = await openai.chat.completions.create({
        model: "qwen3-vl-plus",
        messages: [
        {role: "user",content: [
            {type: "image_url",
            image_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
            // max_pixels represents the maximum pixel threshold for the input image. It is not effective when vl_high_resolution_images=True, but customizable when vl_high_resolution_images=False. The maximum value varies by model.
            // "max_pixels": 2560 * 32 * 32
            },
            {type: "text", text: "What festival atmosphere does this picture show?" },
        ]}],
        vl_high_resolution_images:true
    })


console.log("Model output:",response.choices[0].message.content);
console.log("Total input tokens",response.usage.prompt_tokens);

curl

# ======= Important =======
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-vl-plus",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"
          }
        },
        {
          "type": "text",
          "text": "What festival atmosphere does this picture show?"
        }
      ]
    }
  ],
  "vl_high_resolution_images":true
}'

DashScope

Python

import os
import time

import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg",
            # max_pixels represents the maximum pixel threshold for the input image. It is invalid when vl_high_resolution_images=True, but customizable when vl_high_resolution_images=False. The maximum value varies by model.
            # "max_pixels": 16384 * 32 * 32
            },
            {"text": "What festival atmosphere does this picture show?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
        # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
        # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        api_key=os.getenv('DASHSCOPE_API_KEY'),
        model='qwen3-vl-plus',
        messages=messages,
        vl_high_resolution_images=True
    )
    
print("Model output",response.output.choices[0].message.content[0]["text"])
print("Total input tokens:",response.usage.input_tokens)

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg");
        // max_pixels represents the maximum pixel threshold for the input image. It is invalid when vl_high_resolution_images=True, but customizable when vl_high_resolution_images=False. The maximum value varies by model.
        // map.put("min_pixels", 2621440); 
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        Collections.singletonMap("text", "What festival atmosphere does this picture show?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .message(userMessage)
                .vlHighResolutionImages(true)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
        System.out.println(result.getUsage().getInputTokens());
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
             "role": "user",
             "content": [
               {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
               {"text": "What festival atmosphere does this picture show?"}
                ]
            }
        ]
    },
    "parameters": {
        "vl_high_resolution_images": true
    }
}'

More usage

Limits

Input file limits

Image limits

  • Image resolution:

    • Minimum size: The width and height of the image must both be greater than 10 pixels.

    • Aspect ratio: The ratio of the long side to the short side of the image cannot exceed 200:1.

    • Pixel limit:

      • We recommend that you keep the image resolution within 8K (7680 × 4320). Images that exceed this resolution may cause API call timeouts because of large file sizes and long network transmission times.

      • Automatic scaling mechanism: The model automatically scales the input image before processing. Therefore, providing ultra-high-resolution images does not improve recognition accuracy but increases the risk of call failure. We recommend that you scale the image to a reasonable size on the client side in advance.

  • Supported image formats

    • For resolutions below 4K (3840 × 2160), the supported image formats are as follows:

      Image format

      Common extensions

      MIME Type

      BMP

      .bmp

      image/bmp

      JPEG

      .jpe, .jpeg, .jpg

      image/jpeg

      PNG

      .png

      image/png

      TIFF

      .tif, .tiff

      image/tiff

      WEBP

      .webp

      image/webp

      HEIC

      .heic

      image/heic

    • For resolutions between 4K (3840 × 2160) and 8K (7680 × 4320), only the JPEG, JPG, and PNG formats are supported.

  • Image size:

    • When passed as a public URL or local path: The size of a single image cannot exceed 10 MB.

    • When passed as a Base64 encoding: The encoded string cannot exceed 10 MB.

    For more information about how to compress the file size, see How to compress an image or video to the required size.
  • Number of supported images: When you pass multiple images, the number of images is limited by the model's maximum input. The total number of tokens for all images and text must be less than the model's maximum input.

    For example, if you use the qwen3-vl-plus model in thinking mode, the maximum input is 258,048 tokens. If the input text is converted to 100 tokens and the image is converted to 2,560 tokens (for more information about how to calculate image tokens, see Billing and throttling), you can pass a maximum of (258048 - 100) / 2560 = 100 images.

Video limits

  • Video size:

    • When passed as a public URL:

      • Qwen3-VL series, qwen-vl-max (including qwen-vl-max-latest, qwen-vl-max-2025-04-08, and all subsequent versions): Cannot exceed 2 GB.

      • qwen-vl-plus series, other qwen-vl-max models, Qwen2.5-VL open source series, and QVQ series models: Cannot exceed 1 GB.

      • Other models cannot exceed 150 MB.

    • When passed as a Base64 encoding: The encoded string must be less than 10 MB.

    • When passed as a local file path: The video itself cannot exceed 100 MB.

    For more information about how to compress the file size, see How to compress an image or video to the required size.
  • Video duration:

    • qwen3-vl-plus series: 2 seconds to 1 hour.

    • qwen3-vl-flash series, Qwen3-VL open source series, qwen-vl-max (including qwen-vl-max-latest, qwen-vl-max-2025-04-08, and all subsequent versions): 2 seconds to 20 minutes.

    • qwen-vl-plus series, other qwen-vl-max models, Qwen2.5-VL open source series, and QVQ series models: 2 seconds to 10 minutes.

    • Other models: 2 seconds to 40 seconds.

  • Video format: MP4, AVI, MKV, MOV, FLV, WMV, and more.

  • Video dimensions: No specific limit. The model adjusts the video to about 600,000 pixels before processing. Larger video files do not result in better understanding.

  • Audio understanding: The model does not support understanding the audio of video files.

File passing methods

  • Public URL: You can provide a publicly accessible file address that supports the HTTP or HTTPS protocol. For optimal stability and performance, you can upload the file to OSS and obtain a public URL.

    Important

    To ensure that the model can successfully download the file, the request header of the provided public URL must include Content-Length (file size) and Content-Type (media type, such as image/jpeg). Missing or incorrect fields cause the file download to fail.

  • Pass by Base64 encoding: You can convert the file to a Base64 encoded string and then pass it.

  • Pass by local file path (DashScope SDK only): You can pass the path of the local file.

For recommendations on file passing methods, see How to choose a file upload method?

Going live

  • Image/video pre-processing: Qwen-VL has size limits for input files. For more information about how to compress files, see Image or video compression methods.

  • Process text files: Qwen-VL supports processing files only in image format and cannot directly process text files. However, you can use the following alternative solutions:

    • Convert the text file to an image format. We recommend that you use an image processing library, such as Python's pdf2image, to convert the file page by page into multiple high-quality images, and then pass them to the model using the multiple image input method.

    • Qwen-Long supports processing text files and can be used to parse file content.

  • Fault tolerance and stability

    • Timeout handling: In non-streaming calls, if the model does not finish outputting within 180 seconds, a timeout error is usually triggered. To improve the user experience, the generated content is returned in the response body after a timeout. If the response header contains x-dashscope-partialresponse: true, it indicates that this response triggered a timeout. You can use the partial mode feature, which is supported by some models, to add the generated content to the messages array and send the request again. This allows the large model to continue generating content. For more information, see Continue writing based on incomplete output.

    • Retry mechanism: You can design a reasonable API call retry logic, such as exponential backoff, to handle network fluctuations or temporary service unavailability.

Billing and throttling

  • Billing: The total cost is calculated based on the total number of input and output tokens. For more information about input and output prices, see Model list.

    • Token composition: Input tokens consist of text tokens and tokens that are converted from images or videos. Output tokens are the text generated by the model. In thinking mode, the model's thinking process is also counted as output tokens. If the thinking process is not output in thinking mode, billing is based on the non-thinking mode price.

    • Calculate image and video tokens: You can use the following code to calculate the token consumption of an image or video. The estimated result is for reference only. The actual usage is subject to the API response.

      Calculate image and video tokens

      Image

      Calculation formula: Image Token = h_bar * w_bar / token_pixels + 2

      • h_bar, w_bar: The length and width of the scaled image. The model pre-processes the image before processing and scales it down to a specific pixel limit. The pixel limit is related to the values of the max_pixels and vl_high_resolution_images parameters. For more information, see Process high-resolution images.

      • token_pixels: The pixel value corresponding to each visual token, which varies for different models:

        • Qwen3-VL, qwen-vl-max, qwen-vl-max-latest, qwen-vl-max-2025-08-13, qwen-vl-plus, qwen-vl-plus-latest, and qwen-vl-plus-2025-08-15: Each token corresponds to 32 × 32 pixels.

        • QVQ and other Qwen2.5-VL models: Each token corresponds to 28 × 28 pixels.

      The following code demonstrates the approximate image scaling logic within the model, which can be used to estimate the token count of an image. The actual billing is subject to the API response.

      import math
      # Use the following command to install the Pillow library: pip install Pillow
      from PIL import Image
      
      def token_calculate(image_path, max_pixels, vl_high_resolution_images):
          # Open the specified PNG image file
          image = Image.open(image_path)
      
          # Get the original dimensions of the image
          height = image.height
          width = image.width
      
          # Adjust the width and height to be multiples of 32 or 28, depending on the model
          h_bar = round(height / 32) * 32
          w_bar = round(width / 32) * 32
      
          # Lower limit for image tokens: 4 tokens
          min_pixels = 4 * 32 * 32
          # If vl_high_resolution_images is set to True, the upper limit for input image tokens is 16386, and the corresponding maximum pixel value is 16384 * 32 * 32 or 16384 * 28 * 28. Otherwise, it is the value set for max_pixels.
          if vl_high_resolution_images:
              max_pixels = 16384 * 32 * 32
          else:
              max_pixels = max_pixels
      
          # Scale the image so that the total number of pixels is within the range [min_pixels, max_pixels]
          if h_bar * w_bar > max_pixels:
              # Calculate the scaling factor beta so that the total pixels of the scaled image do not exceed max_pixels
              beta = math.sqrt((height * width) / max_pixels)
              # Recalculate the adjusted width and height
              h_bar = math.floor(height / beta / 32) * 32
              w_bar = math.floor(width / beta / 32) * 32
          elif h_bar * w_bar < min_pixels:
              # Calculate the scaling factor beta so that the total pixels of the scaled image are not less than min_pixels
              beta = math.sqrt(min_pixels / (height * width))
              # Recalculate the adjusted height
              h_bar = math.ceil(height * beta / 32) * 32
              w_bar = math.ceil(width * beta / 32) * 32
          return h_bar, w_bar
      
      if __name__ == "__main__":
          # Replace test.png with the path to your local image
          h_bar, w_bar = token_calculate("xxx/test.jpg", vl_high_resolution_images=False, max_pixels=16384*28*28, )
          print(f"The scaled image dimensions are: height {h_bar}, width {w_bar}")
          # The system will automatically add <vision_bos> and <vision_eos> visual markers (1 token each)
          token = int((h_bar * w_bar) / (28 * 28))+2
          print(f"The number of tokens for the image is {token}")

      Video

      When the model processes a video file, it first extracts frames and then calculates the total number of tokens for all video frames. Because this calculation process is complex, you can use the following code to estimate the total token consumption of a video by passing the video path:

      # Install before use: pip install opencv-python
      import math
      import os
      import logging
      import cv2
      
      logger = logging.getLogger(__name__)
      
      FRAME_FACTOR = 2
      
      # For Qwen3-VL, qwen-vl-max-0813, qwen-vl-plus-0815, and qwen-vl-plus-0710 models, the image scaling factor is 32
      IMAGE_FACTOR = 32
      
      #  For other models, the image scaling factor is 28
      # IMAGE_FACTOR = 28
      
      # Maximum aspect ratio for video frames
      MAX_RATIO = 200
      # Lower pixel limit for video frames
      VIDEO_MIN_PIXELS = 4 * 32 * 32
      # Upper pixel limit for video frames. For the Qwen3-VL-Plus model, VIDEO_MAX_PIXELS is 640 * 32 * 32. For other models, it is 768 * 32 * 32.
      VIDEO_MAX_PIXELS = 640 * 32 * 32
      
      # If the user does not pass the FPS parameter, the default value is used
      FPS = 2.0
      # Minimum number of frames to extract
      FPS_MIN_FRAMES = 4
      # Maximum number of frames to extract. For the Qwen3-VL-Plus model, set FPS_MAX_FRAMES to 2000. For Qwen3-VL-Flash and Qwen2.5-VL models, set it to 512. For other models, set it to 80.
      FPS_MAX_FRAMES = 2000
      
      # Maximum pixel value for video input. For the Qwen3-VL-Plus model, set VIDEO_TOTAL_PIXELS to 131072 * 32 * 32. For other models, set it to 65536 * 32 * 32.
      VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 131072 * 32 * 32)))
      
      def round_by_factor(number: int, factor: int) -> int:
          """Returns the integer closest to 'number' that is divisible by 'factor'."""
          return round(number / factor) * factor
      
      def ceil_by_factor(number: int, factor: int) -> int:
          """Returns the smallest integer greater than or equal to 'number' that is divisible by 'factor'."""
          return math.ceil(number / factor) * factor
      
      def floor_by_factor(number: int, factor: int) -> int:
          """Returns the largest integer less than or equal to 'number' that is divisible by 'factor'."""
          return math.floor(number / factor) * factor
      
      def smart_nframes(ele,total_frames,video_fps):
          """Calculates the number of video frames to extract.
      
          Args:
              ele (dict): A dictionary containing video configuration.
                  - fps: fps is used to control the number of input frames extracted by the model.
              total_frames (int): The original total number of frames in the video.
              video_fps (int | float): The original frame rate of the video.
      
          Raises:
              An error is reported if nframes is not within the interval [FRAME_FACTOR, total_frames].
      
          Returns:
              The number of video frames for model input.
          """
          assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
          fps = ele.get("fps", FPS)
          min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
          max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
          duration = total_frames / video_fps if video_fps != 0 else 0
          if duration-int(duration)>(1/fps):
              total_frames = math.ceil(duration * video_fps)
          else:
              total_frames = math.ceil(int(duration)*video_fps)
          nframes = total_frames / video_fps * fps
          if nframes > total_frames:
              logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")
          nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
          if not (FRAME_FACTOR <= nframes and nframes <= total_frames):
              raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
      
          return nframes
      
      def get_video(video_path):
          # Get video information
          cap = cv2.VideoCapture(video_path)
      
          frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
          # Get video height
          frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
          total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
      
          video_fps = cap.get(cv2.CAP_PROP_FPS)
          return frame_height,frame_width,total_frames,video_fps
      
      def smart_resize(ele,path,factor = IMAGE_FACTOR):
          # Get the width and height of the original video
          height, width, total_frames, video_fps = get_video(path)
          # Lower token limit for video frames
          min_pixels = VIDEO_MIN_PIXELS
          total_pixels = VIDEO_TOTAL_PIXELS
          # Number of extracted video frames
          nframes = smart_nframes(ele, total_frames, video_fps)
          max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR),int(min_pixels * 1.05))
      
          # The aspect ratio of the video should not exceed 200:1 or 1:200
          if max(height, width) / min(height, width) > MAX_RATIO:
              raise ValueError(
                  f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}"
              )
      
          h_bar = max(factor, round_by_factor(height, factor))
          w_bar = max(factor, round_by_factor(width, factor))
          if h_bar * w_bar > max_pixels:
              beta = math.sqrt((height * width) / max_pixels)
              h_bar = floor_by_factor(height / beta, factor)
              w_bar = floor_by_factor(width / beta, factor)
          elif h_bar * w_bar < min_pixels:
              beta = math.sqrt(min_pixels / (height * width))
              h_bar = ceil_by_factor(height * beta, factor)
              w_bar = ceil_by_factor(width * beta, factor)
          return h_bar, w_bar
      
      
      def token_calculate(video_path, fps):
          # Pass the video path and fps frame extraction parameter
          messages = [{"content": [{"video": video_path, "fps":fps}]}]
          vision_infos = extract_vision_info(messages)[0]
      
          resized_height, resized_width=smart_resize(vision_infos,video_path)
      
          height, width, total_frames,video_fps = get_video(video_path)
          num_frames = smart_nframes(vision_infos,total_frames,video_fps)
          print(f"Original video dimensions: {height}*{width}, input model dimensions: {resized_height}*{resized_width}, total video frames: {total_frames}, total frames extracted when fps is {fps}: {num_frames}",end=",")
          video_token = int(math.ceil(num_frames / 2) * resized_height / 32 * resized_width / 32)
          video_token += 2 # The system will automatically add <|vision_bos|> and <|vision_eos|> visual markers (1 token each)
          return video_token
      
      def extract_vision_info(conversations):
          vision_infos = []
          if isinstance(conversations[0], dict):
              conversations = [conversations]
          for conversation in conversations:
              for message in conversation:
                  if isinstance(message["content"], list):
                      for ele in message["content"]:
                          if (
                              "image" in ele
                              or "image_url" in ele
                              or "video" in ele
                              or ele.get("type","") in ("image", "image_url", "video")
                          ):
                              vision_infos.append(ele)
          return vision_infos
      
      
      video_token = token_calculate("xxx/test.mp4", 1)
      print("Video tokens:", video_token)
  • View bills: You can view your bills or top up your account on the Expenses and Costs page in the Alibaba Cloud Management Console.

  • Throttling: For more information about the throttling conditions of the Qwen-VL model, see Throttling.

  • Free quota (Singapore region only): A free quota of 1 million tokens is provided for the Qwen-VL model, valid for 90 days from the date of activating Model Studio or model application approval.

API reference

For more information about the input and output parameters of the Qwen-VL model, see Qwen.

FAQ

How to choose a file upload method?

We recommend that you choose the most suitable upload method based on a combination of the SDK type, file size, and network stability.

File type

File specifications

DashScope SDK (Python, Java)

OpenAI compatible / DashScope HTTP

Image

Greater than 7 MB and less than 10 MB

Pass local path

Only public URLs are supported. We recommend using Alibaba Cloud Object Storage Service

Less than 7 MB

Pass local path

Base64 encoding

Video

Greater than 100 MB

Only public URLs are supported. We recommend using Alibaba Cloud Object Storage Service

Only public URLs are supported. We recommend using Alibaba Cloud Object Storage Service

Greater than 7 MB and less than 100 MB

Pass local path

Only public URLs are supported. We recommend using Alibaba Cloud Object Storage Service

Less than 7 MB

Pass local path

Base64 encoding

Base64 encoding increases the data volume. The original file size should be less than 7 MB.
Using Base64 or a local path can help avoid server-side download timeouts and improve stability.

How to compress an image or video to the required size?

Qwen-VL has size limits for input files. You can compress them using the following methods.

Image compression methods

  • Online tools: You can use online tools such as CompressJPEG or TinyPng for compression.

  • Local software: You can use software such as Photoshop and adjust the quality when you export.

  • Code implementation:

    # pip install pillow
    
    from PIL import Image
    def compress_image(input_path, output_path, quality=85):
        with Image.open(input_path) as img:
            img.save(output_path, "JPEG", optimize=True, quality=quality)
    
    # Pass a local image
    compress_image("/xxx/before-large.jpeg","/xxx/after-min.jpeg")

Video compression methods

  • Online tools: You can use online tools such as FreeConvert.

  • Local software: You can use software such as HandBrake.

  • Code implementation: You can use the FFmpeg tool. For more usage information, see the FFmpeg official website.

    # Basic conversion command
    # -i, function: input file path, common value example: input.mp4
    # -vcodec, function: video encoder, common values include libx264 (recommended for general use), libx265 (higher compression rate)
    # -crf, function: controls video quality, value range: [18-28], the smaller the value, the higher the quality and the larger the file size.
    # --preset, function: controls the balance between encoding speed and compression efficiency. Common values include slow, fast, faster
    # -y, function: overwrite existing file (no value needed)
    # output.mp4, function: output file path
    
    ffmpeg -i input.mp4 -vcodec libx264 -crf 28 -preset slow output.mp4

After the model outputs object detection results, how can I draw the detection boxes on the original image?

After the Qwen-VL model outputs object detection results, you can refer to the following code to draw the detection boxes and their label information on the original image.

  • Qwen2.5-VL: The returned coordinates are absolute values relative to the top-left corner of the scaled image, in pixels. You can refer to the qwen2_5_vl_2d.py code to draw the detection boxes.

  • Qwen3-VL: The returned coordinates are relative coordinates that are normalized to the range [0, 999]. You can refer to the code in qwen3_vl_2d.py (for 2D localization) or qwen3_vl_3d.zip (for 3D localization) to draw the detection boxes.

Error codes

If a call fails, see Error messages for troubleshooting.