All Products
Search
Document Center

Alibaba Cloud Model Studio:Image and video understanding

Last Updated:Mar 26, 2026

A visual understanding model generates responses from one or more images or videos, performing tasks such as image captioning, visual question answering, and object localization.

Supported regions: China (Beijing), China (Hong Kong), Germany (Frankfurt), Singapore, and US (Virginia). Each region has its own API key and endpoint.

Try it online: Go to the Alibaba Cloud Model Studio console, select a region in the top-right corner, and navigate to the Vision page.

Quick start

Prerequisites

The following examples show how to call a model to describe image content. For information about local files and image limits, see Pass a local file and Image limits.

OpenAI compatible

Python

from openai import OpenAI
import os

client = OpenAI(
    # If you have not set the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    # API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Configurations are region-specific. Modify the base_url accordingly.
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

completion = client.chat.completions.create(
    model="qwen3.5-plus",  # This example uses the qwen3.5-plus model. You can replace it as needed. For a list of models, see: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
                    },
                },
                {"type": "text", "text": "What is depicted in the image?"},
            ],
        },
    ],
)
print(completion.choices[0].message.content)

Response

This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand, with the sea and sky in the background. The person and dog appear to be interacting, with the dog's front paw resting on the person's hand. Sunlight is coming from the right side of the frame, adding a warm atmosphere to the scene.

Node.js

import OpenAI from "openai";

const openai = new OpenAI({
  // API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
  // If you have not set the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  // Configurations are region-specific. Modify the baseURL accordingly.
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

async function main() {
  const response = await openai.chat.completions.create({
    model: "qwen3.5-plus",   // This example uses the qwen3.5-plus model. You can replace it as needed. For a list of models, see: https://www.alibabacloud.com/help/model-studio/getting-started/models 
    messages: [
      {
        role: "user",
        content: [{
            type: "image_url",
            image_url: {
              "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
            }
          },
          {
            type: "text",
            text: "What is depicted in the image?"
          }
        ]
      }
    ]
  });
  console.log(response.choices[0].message.content);
}
main()

Response

This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand, with the sea and sky in the background. The person and dog appear to be interacting, with the dog's front paw resting on the person's hand. Sunlight is coming from the right side of the frame, adding a warm atmosphere to the scene.

curl

# ======= Important =======
# Configurations are region-specific. Modify the URL accordingly.
# API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before you run the command ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen3.5-plus",
  "messages": [
    {"role": "user",
     "content": [
        {"type": "image_url", "image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
        {"type": "text", "text": "What is depicted in the image?"}
    ]
  }]
}'

Response

{
  "choices": [
    {
      "message": {
        "content": "This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand, with the sea and sky in the background. The person and dog appear to be interacting, with the dog's front paw resting on the person's hand. Sunlight is coming from the right side of the frame, adding a warm atmosphere to the scene.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 1270,
    "completion_tokens": 54,
    "total_tokens": 1324
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen3.5-plus",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

Python

import os
import dashscope

# Configurations are region-specific. Modify the base_http_api_url accordingly.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
{
    "role": "user",
    "content": [
    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
    {"text": "What is depicted in the image?"}]
}]

response = dashscope.MultiModalConversation.call(
    # API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not set the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3.5-plus',   # This example uses the qwen3.5-plus model. You can replace it as needed. For a list of models, see: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Response

This is a photo taken on a beach. The photo shows a woman and a dog. The woman is sitting on the sand, smiling, and interacting with the dog. The dog is wearing a collar and appears to be shaking hands with the woman. The sea and sky are in the background, and the sunlight shining on them creates a warm atmosphere.

Java

import java.util.Arrays;
import java.util.Collections;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    
    // Configurations are region-specific. Modify the Constants.baseHttpApiUrl accordingly.
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation(); 
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "What is depicted in the image?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not set the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3.5-plus")  //  This example uses the qwen3.5-plus model. You can replace it as needed. For a list of models, see: https://www.alibabacloud.com/help/model-studio/getting-started/models
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

This is a photo taken on a beach. The photo shows a person in a plaid shirt and a dog with a collar. The person and the dog are sitting face-to-face, seemingly interacting. The sea and sky are in the background, and the sunlight shining on them creates a warm atmosphere.

curl

# ======= Important =======
# Configurations are region-specific. Modify the URL accordingly.
# API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before you run the command ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3.5-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"text": "What is depicted in the image?"}
                ]
            }
        ]
    }
}'

Response

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "This is a photo taken on a beach. In the photo, a person in a plaid shirt and a dog with a collar are sitting on the sand with the sea and sky in the background. Sunlight from the right side of the frame adds a warm atmosphere to the scene."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 55,
    "input_tokens": 1271,
    "image_tokens": 1247
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

Model selection

  • Recommended: Qwen3.5: This latest generation of visual understanding models excels in tasks such as multimodal reasoning, 2D/3D image understanding, complex document parsing, visual programming, video understanding, and building multimodal agents. It is supported in the Chinese mainland and Singapore.

    • qwen3.5-plus: The most powerful visual understanding model in the Qwen3.5 series. It is the top recommendation.

    • qwen3.5-flash: A faster and more cost-effective option with an excellent balance between performance and cost, ideal for latency-sensitive scenarios.

    • qwen3.5-397b-a17b, qwen3.5-122b-a10b, qwen3.5-27b, and qwen3.5-35b-a3b: The open-source models in the Qwen3.5 series.

  • The Qwen3-VL series is also suitable for tasks that require high-precision object recognition and localization (including 3D localization), agent tool calling, document and webpage parsing, complex problem-solving, and long video understanding. The models in this series are compared below:

    • qwen3-vl-plus: The most powerful model in the Qwen3-VL series.

    • qwen3-vl-flash: A faster and more cost-effective option with an excellent balance between performance and cost, ideal for latency-sensitive scenarios.

  • The Qwen2.5-VL series is suitable for general-purpose tasks such as simple image captioning and short video summarization. The models in this series are compared below:

    • qwen-vl-max (part of Qwen2.5-VL): The highest-performing model in the Qwen2.5-VL series.

    • qwen-vl-plus (part of Qwen2.5-VL): A faster model that offers a good balance between performance and cost.

For information about model names, context, pricing, and snapshot versions, see the Model list. For concurrency limits, see Rate limiting.

Model feature comparison

Model

Deep thinking

Tool calling

Context cache

Structured output

Language

Qwen3.5 series

Supported

Supported

Supported in the stable versions of qwen3.5-plus and qwen3.5-flash.

Explicit cache only.

Supported when deep thinking is disabled.

33 languages: Chinese, Japanese, Korean, Indonesian, Vietnamese, Thai, English, French, German, Russian, Portuguese, Spanish, Italian, Swedish, Danish, Czech, Norwegian, Dutch, Finnish, Turkish, Polish, Swahili, Romanian, Serbian, Greek, Kazakh, Uzbek, Cebuano, Arabic, Urdu, Persian, Hindi/Devanagari, and Hebrew.

Qwen3-VL series

Supported

Supported

Supported in the stable versions of qwen3-vl-plus and qwen3-vl-flash.

Supported when deep thinking is disabled.

33 languages: Chinese, Japanese, Korean, Indonesian, Vietnamese, Thai, English, French, German, Russian, Portuguese, Spanish, Italian, Swedish, Danish, Czech, Norwegian, Dutch, Finnish, Turkish, Polish, Swahili, Romanian, Serbian, Greek, Kazakh, Uzbek, Cebuano, Arabic, Urdu, Persian, Hindi/Devanagari, and Hebrew.

Qwen2.5-VL series

Not supported

Not supported

Supported in the stable versions of qwen-vl-max and qwen-vl-plus.

Supported in the stable and latest versions of qwen-vl-max and qwen-vl-plus.

11 languages: Chinese, English, Japanese, Korean, Arabic, Vietnamese, French, German, Italian, Spanish, and Russian.

Capabilities

Visual question answering (VQA)

Describes and classifies image content, such as identifying people, places, animals, and plants.

p1021359

If the sun is glaring, what item from this image should I use?

When the sun is glaring, you should use the pink sunglasses from the image. Sunglasses effectively block strong light, reduce UV damage to your eyes, and help protect your vision while improving visual comfort in bright sunlight.

Creative writing

Generates vivid text descriptions from image or video content for creative scenarios like story writing, copywriting, and short video scripts.

vl-问答

Based on the image, write an engaging social media post for me.

Of course. This image has a rich Mid-Autumn Festival atmosphere, with a clever mix of classic and modern elements. Based on the key elements in the image, I've prepared a few social media post ideas in different styles for you to choose from.

Poetic and aesthetic style

Tonight, the moon shines bright for all to see, as autumn thoughts drift to thee. Chang'e's flight and the Jade Rabbit's potion—the romance of ages is rekindled tonight. May this moon light your way home and carry my deepest wishes. Happy Mid-Autumn Festival!

Warm and heartfelt style

A full moon for a full reunion on this gentle Mid-Autumn night. Watching fireworks, gazing at the moon, savoring a mooncake, and wishing you well. May all our hopes and dreams come true. Wishing everyone a happy Mid-Autumn Festival and a joyful family reunion!

OCR and information extraction

Recognizes text and formulas in images or extracts information from documents such as receipts, certificates, and forms, with support for formatted text output. Both the Qwen2.5-VL and Qwen3-VL models now support 33 languages. For a list of supported languages, see Model feature comparison.

-q2cdz6jy89b6m3kp

Extract the following fields from the image: ['Invoice Code', 'Invoice Number', 'Destination', 'Fuel Surcharge', 'Fare', 'Travel Date', 'Departure Time', 'Train Number', 'Seat Number']. Output the result in JSON format.

{

"发票代码": "221021325353",

"发票号码": "10283819",

"到站": "开发区",

"燃油费": "2.0",

"票价": "8.00<全>",

"乘车日期": "2013-06-29",

"开车时间": "流水",

"车次": "040",

"座号": "371"

}

Multi-disciplinary problem solving

Solves problems in images from subjects like mathematics, physics, and chemistry, making it suitable for K-12, university, and adult education.

-5jwcstcvmdpqghaj

Solve the math problem in the image step by step.

-答案

Visual programming

Generates HTML, CSS, and JS code from visual inputs like design mockups, website screenshots, and videos.

code

Create a webpage using HTML and CSS based on my sketch. The main color theme should be black.

code-预览

Webpage preview

Object localization

Supports both 2D and 3D localization to determine object orientation, perspective changes, and occlusion relationships. 3D localization is a new capability of the Qwen3-VL model.

For the Qwen2.5-VL model, object localization is most robust within a resolution range of 480x480 to 2560x2560 pixels. Outside this range, detection accuracy may decrease, with occasional bounding box drift.
To draw the localization results on the original image, see FAQ.

2D localization

-530xdcos1lqkcfuy

  • Returns bounding box coordinates: Detects all food items in an image and returns their bounding box (bbox) coordinates in JSON format.

  • Returns center point coordinates: Locates all food items in an image as points and returns their coordinates in XML format.

2D localization visualization

-mu9podu1eyvph1zd

3D localization

3d

Detects cars in an image and predicts their 3D positions. JSON output: [{"bbox_3d": [x_center, y_center, z_center, x_size, y_size, z_size, roll, pitch, yaw], "label": "category"}].

3D localization visualization

3d-结果

Document parsing

Parses image-based documents, such as scans or image-based PDFs, into QwenVL HTML or QwenVL Markdown format. This format accurately recognizes text and captures the position information of elements like images and tables. The Qwen3-VL model adds the capability to parse into Markdown format.

Recommended prompts: qwenvl html (to parse into HTML format) or qwenvl markdown (to parse into Markdown format).

image

qwenvl markdown.

-结果

Result visualization

Video understanding

Analyzes video content to locate specific events and retrieve their timestamps, or to generate summaries of key time periods.

Describe the series of actions the person performs in the video. Output the result in JSON format with start_time, end_time, and event fields. Use the HH:mm:ss format for the timestamp.

{

"events": [

{

"start_time": "00:00:00",

"end_time": "00:00:05",

"event": "The person walks to a table holding a cardboard box and places it on the table."

},

{

"start_time": "00:00:05",

"end_time": "00:00:15",

"event": "The person picks up a scanner and aims it at the label on the box to scan it."

},

{

"start_time": "00:00:15",

"end_time": "00:00:21",

"event": "The person puts the scanner back in its place and then picks up a pen to write in a notebook."}]

}

Core capabilities

Enable or disable thinking mode

  • The qwen3.5, qwen3-vl-plus, and qwen3-vl-flash series are hybrid models that can respond either directly or after a reasoning process. Use the enable_thinking parameter to enable or disable thinking mode:

    • true: Enables the thinking mode. qwen3.5 series models default to true.

    • false: Disables the thinking mode. The default value for the qwen3-vl-plus and qwen3-vl-flash model series is false.

  • Models with a thinking suffix, such as qwen3-vl-235b-a22b-thinking, are dedicated reasoning models. They always use a reasoning process before responding, and this behavior cannot be disabled.

Important
  • Model configuration: For optimal performance in general conversational scenarios that do not involve agent tool calls, do not set the System Message. Instead, pass instructions such as model role definitions and output format requirements in the User Message.

  • Prioritize streaming output: When thinking mode is enabled, both streaming and non-streaming output are supported. To prevent timeouts from long responses, we recommend using streaming output.

  • Limit reasoning length: Dedicated reasoning models can sometimes produce a verbose reasoning process. The thinking_budget parameter limits the length of this process. If the number of tokens generated during the reasoning process exceeds the thinking_budget, the reasoning is truncated, and the model immediately starts generating the final response. The default value for thinking_budget is the model's maximum chain-of-thought length. For more information, see the model list.

OpenAI compatibility

The enable_thinking parameter is not a standard OpenAI parameter. When using the OpenAI Python SDK, pass it in the extra_body.

import os
from openai import OpenAI

client = OpenAI(
    # API Keys vary by region. Get your API Key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # This endpoint URL varies by region. Adjust it for your region.
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""  # Stores the full reasoning process
answer_content = ""     # Stores the full final response
is_answering = False   # Tracks if the final response has started
enable_thinking = True
# Create a chat completion request.
completion = client.chat.completions.create(
    model="qwen3.5-plus",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
                    },
                },
                {"type": "text", "text": "How do I solve this problem?"},
            ],
        },
    ],
    stream=True,
    # The enable_thinking parameter controls the reasoning process for hybrid models (qwen3.5, qwen3-vl-plus, and qwen3-vl-flash). For dedicated reasoning models (e.g., with a 'thinking' suffix), it is always enabled.
    # The thinking_budget parameter sets the maximum token length for this process.
    extra_body={
        'enable_thinking': enable_thinking,
        "thinking_budget": 81920},

    # Uncomment the following to return token usage in the last chunk.
    # stream_options={
    #     "include_usage": True
    # }
)

if enable_thinking:
    print("\n" + "=" * 20 + "Reasoning Process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print the usage.
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print the reasoning process.
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content is not None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start printing the final response.
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Final Response" + "=" * 20 + "\n")
                is_answering = True
            # Print the incoming response content.
            print(delta.content, end='', flush=True)
            answer_content += delta.content

# print("=" * 20 + "Full Reasoning Process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Final Response" + "=" * 20 + "\n")
# print(answer_content)
import OpenAI from "openai";

// Initialize the OpenAI client.
const openai = new OpenAI({
  // API Keys vary by region. Get your API Key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
  // If you have not configured an environment variable, replace the following line with your Model Studio API Key: apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  // This endpoint URL varies by region. Adjust it for your region.
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;
let enableThinking = true;

let messages = [
    {
        role: "user",
        content: [
        { type: "image_url", image_url: { "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg" } },
        { type: "text", text: "Solve this problem" },
    ]
}]

async function main() {
    try {
        const stream = await openai.chat.completions.create({
            model: 'qwen3.5-plus',
            messages: messages,
            stream: true,
          // Note: In the Node.js SDK, non-standard parameters like enable_thinking are passed as top-level properties and not within extra_body.
          enable_thinking: enableThinking,
          thinking_budget: 81920

        });

        if (enableThinking){console.log('\n' + '='.repeat(20) + 'Reasoning Process' + '='.repeat(20) + '\n');}

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;

            // Handle the reasoning process.
            if (delta.reasoning_content) {
                process.stdout.write(delta.reasoning_content);
                reasoningContent += delta.reasoning_content;
            }
            // Handle the final response.
            else if (delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Final Response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();
# ======= Important Note =======
# This configuration varies by region. Modify it according to your actual region.
# API Keys vary by region. Get your API Key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Please remove this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen3.5-plus",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
          }
        },
        {
          "type": "text",
          "text": "Please solve this problem"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{"include_usage":true},
    "enable_thinking": true,
    "thinking_budget": 81920
}'

DashScope

import os
import dashscope
from dashscope import MultiModalConversation

# This endpoint URL varies by region. Adjust it for your region.
dashscope.base_http_api_url = "https://dashscope-intl.aliyuncs.com/api/v1"

enable_thinking=True

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
            {"text": "How do I solve this problem?"}
        ]
    }
]

response = MultiModalConversation.call(
    # If you have not configured an environment variable, replace the following line with your Model Studio API Key: api_key="sk-xxx",
    # API Keys vary by region. Get your API Key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3.5-plus",  
    messages=messages,
    stream=True,
    # The enable_thinking parameter controls the reasoning process for hybrid models (qwen3.5, qwen3-vl-plus, and qwen3-vl-flash). For dedicated reasoning models (e.g., with a 'thinking' suffix), it is always enabled.
    enable_thinking=enable_thinking,
    # The thinking_budget parameter sets the maximum number of tokens for the reasoning process.
    thinking_budget=81920,

)

# Stores the full reasoning process
reasoning_content = ""
# Stores the full final response
answer_content = ""
# Tracks if the final response has started
is_answering = False

if enable_thinking:
    print("=" * 20 + "Reasoning Process" + "=" * 20)

for chunk in response:
    # Ignore empty chunks.
    message = chunk.output.choices[0].message
    reasoning_content_chunk = message.get("reasoning_content", None)
    if (chunk.output.choices[0].message.content == [] and
        reasoning_content_chunk == ""):
        pass
    else:
        # If the chunk contains reasoning content.
        if reasoning_content_chunk is not None and chunk.output.choices[0].message.content == []:
            print(chunk.output.choices[0].message.reasoning_content, end="")
            reasoning_content += chunk.output.choices[0].message.reasoning_content
        # If the chunk contains response content.
        elif chunk.output.choices[0].message.content != []:
            if not is_answering:
                print("\n" + "=" * 20 + "Final Response" + "=" * 20)
                is_answering = True
            print(chunk.output.choices[0].message.content[0]["text"], end="")
            answer_content += chunk.output.choices[0].message.content[0]["text"]

# To print the full reasoning process and final response, uncomment the following lines.
# print("=" * 20 + "Full Reasoning Process" + "=" * 20 + "\n")
# print(f"{reasoning_content}")
# print("=" * 20 + "Final Response" + "=" * 20 + "\n")
# print(f"{answer_content}")
// Requires DashScope SDK v2.21.10 or later.
import java.util.*;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.exception.InputRequiredException;
import java.lang.System;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    // This endpoint URL varies by region. Adjust it for your region.
    static {Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";}

    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    private static StringBuilder reasoningContent = new StringBuilder();
    private static StringBuilder finalContent = new StringBuilder();
    private static boolean isFirstPrint = true;

    private static void handleGenerationResult(MultiModalConversationResult message) {
        String re = message.getOutput().getChoices().get(0).getMessage().getReasoningContent();
        String reasoning = Objects.isNull(re) ? "" : re; // Default value

        List<Map<String, Object>> content = message.getOutput().getChoices().get(0).getMessage().getContent();
        if (!reasoning.isEmpty()) {
            reasoningContent.append(reasoning);
            if (isFirstPrint) {
                System.out.println("====================Reasoning Process====================");
                isFirstPrint = false;
            }
            System.out.print(reasoning);
        }

        if (Objects.nonNull(content) && !content.isEmpty()) {
            Object text = content.get(0).get("text");
            finalContent.append(content.get(0).get("text"));
            if (!isFirstPrint) {
                System.out.println("\n====================Final Response====================");
                isFirstPrint = true;
            }
            System.out.print(text);
        }
    }
    public static MultiModalConversationParam buildMultiModalConversationParam(MultiModalMessage Msg)  {
        return MultiModalConversationParam.builder()
                // If you have not configured an environment variable, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
                // API Keys vary by region. Get your API Key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3.5-plus")
                .messages(Arrays.asList(Msg))
                .enableThinking(true)
                .thinkingBudget(81920)
                .incrementalOutput(true)
                .build();
    }

    public static void streamCallWithMessage(MultiModalConversation conv, MultiModalMessage Msg)
            throws NoApiKeyException, ApiException, InputRequiredException, UploadFileException {
        MultiModalConversationParam param = buildMultiModalConversationParam(Msg);
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(message -> {
            handleGenerationResult(message);
        });
    }
    public static void main(String[] args) {
        try {
            MultiModalConversation conv = new MultiModalConversation();
            MultiModalMessage userMsg = MultiModalMessage.builder()
                    .role(Role.USER.getValue())
                    .content(Arrays.asList(Collections.singletonMap("image", "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"),
                            Collections.singletonMap("text", "Please solve this problem")))
                    .build();
            streamCallWithMessage(conv, userMsg);
//             To print the full response at the end, uncomment the following lines.
//            if (reasoningContent.length() > 0) {
//                System.out.println("\n====================Final Response====================");
//                System.out.println(finalContent.toString());
//            }
        } catch (ApiException | NoApiKeyException | UploadFileException | InputRequiredException e) {
            logger.error("An exception occurred: {}", e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important Note =======
# API Keys vary by region. Get your API Key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# This configuration varies by region. Modify it according to your actual region.
# === Please remove this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3.5-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
                    {"text": "Please solve this problem"}
                ]
            }
        ]
    },
    "parameters":{
        "enable_thinking": true,
        "incremental_output": true,
        "thinking_budget": 81920
    }
}'

Multiple image input

Visual understanding models can process multiple images in a single request for tasks like product comparison and multi-page document processing. To implement this, include multiple image objects in the content array of the user message.

Important

The model's token limit restricts the number of images per request. The combined token count for all images and text must not exceed the model's maximum input limit.

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # API keys are specific to each Region. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Configuration is Region-specific. Modify the base_url accordingly.
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3.5-plus",  # This example uses the qwen3.5-plus model. You can replace it as needed. For a list of models, see: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {"role": "user","content": [
            {"type": "image_url","image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},},
            {"type": "image_url","image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},},
            {"type": "text", "text": "What do these images depict?"},
            ],
        }
    ],
)

print(completion.choices[0].message.content)

Response

Image 1 shows a woman and a Labrador retriever interacting on a beach. The woman, wearing a plaid shirt, is sitting on the sand and shaking the dog's paw. The background features ocean waves and the sky, creating a warm and pleasant atmosphere.

Image 2 shows a tiger walking in a forest. The tiger's coat is orange with black stripes, and it is stepping forward. The surroundings are dense with trees and vegetation, and the ground is covered with fallen leaves, giving the scene a wild, natural feel.

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // API keys are specific to each Region. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // Configuration is Region-specific. Modify the baseURL accordingly.
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen3.5-plus",  // This example uses the qwen3.5-plus model. You can replace it as needed. For a list of models, see: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
        messages: [
          {role: "user",content: [
            {type: "image_url",image_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
            {type: "image_url",image_url: {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"}},
            {type: "text", text: "What do these images depict?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()

Response

The first image shows a person and a dog interacting on a beach. The person is wearing a plaid shirt, and the dog is wearing a collar. They appear to be shaking hands or giving a high-five.

The second image shows a tiger walking in a forest. The tiger's coat is orange with black stripes, and the background is filled with green trees and vegetation.

curl

# ======= Important =======
# API keys are specific to each Region. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Configuration is Region-specific. Modify the URL accordingly.
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3.5-plus",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
          }
        },
        {
          "type": "text",
          "text": "What do these images depict?"
        }
      ]
    }
  ]
}'

Response

{
  "choices": [
    {
      "message": {
        "content": "Image 1 shows a woman and a Labrador retriever interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, shaking the dog's paw. The background features the ocean and a sunset sky, creating a very warm and peaceful atmosphere.\n\nImage 2 shows a tiger walking in a forest. The tiger's coat is orange with black stripes as it walks forward. The surroundings are dense with trees and vegetation, with fallen leaves on the ground. The scene conveys a sense of wildness and vitality.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 2497,
    "completion_tokens": 109,
    "total_tokens": 2606
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen3.5-plus",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

Python

import os
import dashscope

# Configuration is Region-specific. Modify the base_http_api_url accordingly.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
            {"text": "What do these images depict?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys are specific to each Region. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3.5-plus', #  This example uses the qwen3.5-plus model. You can replace it as needed. For a list of models, see: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Response

The images show animals in natural scenes. The first image shows a person and a dog on a beach, and the second shows a tiger in a forest.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
    // Configuration is Region-specific. Modify Constants.baseHttpApiUrl accordingly.
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"),
                        Collections.singletonMap("text", "What do these images depict?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys are specific to each Region. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3.5-plus")  // This example uses the qwen3.5-plus model. You can replace it as needed. For a list of models, see: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

These images show animals in natural scenes.

1. First image: A woman and a dog interact on a beach. The woman is wearing a plaid shirt and is seated on the sand, while the dog, wearing a collar, extends a paw to shake her hand.
2. Second image: A tiger walks through a forest. Its coat is orange with black stripes, and the background consists of trees and leaves.

curl

# ======= Important =======
# Configuration is Region-specific. Modify the URL accordingly.
# API keys are specific to each Region. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen3.5-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
                    {"text": "What do these images depict?"}
                ]
            }
        ]
    }
}'

Response

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "The images show animals in natural scenes. The first image shows a person and a dog on a beach, and the second shows a tiger in a forest."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 81,
    "input_tokens": 1277,
    "image_tokens": 2497
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

Video understanding

The visual understanding model can analyze video content provided as either a video file or an image list (a sequence of video frames). The following code examples demonstrate how to analyze an online video or an image list specified by URLs. For limitations on videos and image lists, see the Video limits section.

For optimal performance, we recommend using the latest version or a recent snapshot of the model to analyze video files.

Video file

The visual understanding model analyzes content by extracting a sequence of frames from the video. Use the following two parameters to control the frame extraction policy:

  • fps: Controls the frame extraction frequency. The model extracts one frame every seconds. The value range is [0.1, 10], and the default is 2.0.

    • High-speed motion scenes: Set a higher fps value to capture more details.

    • Static or long videos: Set a lower fps value to improve processing efficiency.

  • max_frames: Specifies the maximum number of frames to extract from the video. If the fps setting results in more frames than this limit, the system automatically samples frames evenly to stay within the max_frames limit. This parameter is available only when you use the DashScope SDK.

OpenAI compatible

When passing a video file directly using the OpenAI SDK or an HTTP request, set the "type" parameter in the user message to "video_url".

Python

import os
from openai import OpenAI

client = OpenAI(
    # API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The endpoint varies by region. Modify the base_url based on your actual region.
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3.5-plus",
    messages=[
        {
            "role": "user",
            "content": [
                # When passing a video file directly, set the "type" parameter to "video_url".
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
                    },
                    "fps": 2
                },
                {
                    "type": "text",
                    "text": "What is the content of this video?"
                }
            ]
        }
    ]
)

print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If the environment variable is not configured, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The endpoint varies by region. Modify the baseURL based on your actual region.
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen3.5-plus",
        messages: [
            {
                role: "user",
                content: [
                    // When passing a video file directly, set the "type" parameter to "video_url".
                    {
                        type: "video_url",
                        video_url: {
                            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
                        },
                        "fps": 2
                    },
                    {
                        type: "text",
                        text: "What is the content of this video?"
                    }
                ]
            }
        ]
    });

    console.log(response.choices[0].message.content);
}

main();

Curl

# ======= Important =======
# API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The endpoint varies by region. Modify the URL based on your actual region.
# === Delete this comment before running the command. ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.5-plus",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "video_url",
            "video_url": {
              "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
            },
            "fps":2
          },
          {
            "type": "text",
            "text": "What is the content of this video?"
          }
        ]
      }
    ]
  }'

DashScope

Python

import dashscope
import os

# The endpoint varies by region. Modify the base_http_api_url based on your actual region.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
    {"role": "user",
        "content": [
            # The fps parameter controls the video frame extraction frequency, which means one frame is extracted every 1/fps seconds. For more details, see: https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "What is the content of this video?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key ="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3.5-plus',
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
   static {
            // The endpoint varies by region. Modify Constants.baseHttpApiUrl based on your actual region.
            Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
        }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // The fps parameter controls the video frame extraction frequency, which means one frame is extracted every 1/fps seconds. For more details, see: https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
        Map<String, Object> params = new HashMap<>();
        params.put("video", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4");
        params.put("fps", 2);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "What is the content of this video?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If you use a model in the China (Beijing) region, you must use an API key from that region. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3.5-plus")
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Curl

# ======= Important =======
# The endpoint varies by region. Modify the URL based on your actual region.
# API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before running the command. ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3.5-plus",
    "input":{
        "messages":[
            {"role": "user","content": [{"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "What is the content of this video?"}]}]}
}'

Image list

When a video is provided as an image list of pre-extracted video frames, use the fps parameter to specify the frame extraction rate. This tells the model that frames were extracted from the original video every seconds. This information helps the model more accurately understand the sequence, duration, and dynamic changes of events. This parameter is supported by  Qwen3.5, Qwen3-VL, and Qwen2.5-VL models.

OpenAI compatible

When passing a video as an image list using the OpenAI SDK or an HTTP request, set the "type" parameter in the user message to "video".

Python

import os
from openai import OpenAI

client = OpenAI(
    # API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The endpoint varies by region. Modify the base_url based on your actual region.
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3.5-plus", # This example uses qwen3.5-plus. You can replace it as needed. For a list of available models, see: https://www.alibabacloud.com/help/en/model-studio/models
    messages=[{"role": "user","content": [
        # When passing an image list, set the "type" parameter to "video".
         {"type": "video","video": [
         "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
         "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
         "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
         "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
         "fps":2},
         {"type": "text","text": "Describe the specific process in this video."},
    ]}]
)

print(completion.choices[0].message.content)

Node.js

// Make sure you have specified "type": "module" in your package.json.
import OpenAI from "openai";

const openai = new OpenAI({
    // API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    // If the environment variable is not configured, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
    apiKey: process.env.DASHSCOPE_API_KEY,
    // The endpoint varies by region. Modify the baseURL based on your actual region.
    baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen3.5-plus",  // This example uses qwen3.5-plus. You can replace it as needed. For a list of available models, see: https://www.alibabacloud.com/help/en/model-studio/models
        messages: [{
            role: "user",
            content: [
                {
                    // When passing an image list, set the "type" parameter to "video".
                    type: "video",
                    video: [
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                        "fps": 2
                },
                {
                    type: "text",
                    text: "Describe the specific process in this video."
                }
            ]
        }]
    });
    console.log(response.choices[0].message.content);
}

main();

Curl

# ======= Important =======
# API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The endpoint varies by region. Modify the URL based on your actual region.
# === Delete this comment before running the command. ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3.5-plus",
    "messages": [{"role": "user","content": [{"type": "video","video": [
                  "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                  "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                  "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                  "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                  "fps":2},
                {"type": "text","text": "Describe the specific process in this video."}]}]
}'

DashScope

Python

import os
import dashscope

# The endpoint varies by region. Modify the base_http_api_url based on your actual region.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{"role": "user",
             "content": [
                  # When passing an image list, the fps parameter is supported by the Qwen3.5, Qwen3-VL, and Qwen2.5-VL series models.
                 {"video":["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                   "fps":2},
                 {"text": "Describe the specific process in this video."}]}]
response = dashscope.MultiModalConversation.call(
    # API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen3.5-plus',  # This example uses qwen3.5-plus. You can replace it as needed. For a list of available models, see: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=messages
)
print(response.output.choices[0].message.content[0]["text"])

Java

// DashScope SDK version 2.21.10 or later is required.
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // The endpoint varies by region. Modify Constants.baseHttpApiUrl based on your actual region.
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    private static final String MODEL_NAME = "qwen3.5-plus";  // This example uses qwen3.5-plus. You can replace it as needed. For a list of available models, see: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    public static void videoImageListSample() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // When passing an image list, the fps parameter is supported by the Qwen3.5, Qwen3-VL, and Qwen2.5-VL series models.
        Map<String, Object> params = new HashMap<>();
        params.put("video", Arrays.asList("https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"));
        params.put("fps", 2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "Describe the specific process in this video.")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Curl

# ======= Important =======
# The endpoint varies by region. Modify the URL based on your actual region.
# API keys are region-specific. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before running the command. ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3.5-plus",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
            ],
            "fps":2
                 
          },
          {
            "text": "Describe the specific process in this video."
          }
        ]
      }
    ]
  }
}'

Pass local files (Base64 or file path)

Visual understanding models let you upload local files using two methods: base64 encoding and file path. Choose a method based on the file size and SDK type. For recommendations, see How to choose a file upload method. Both methods must meet the file requirements described in Image limits.

Base64 encoding

Convert a file to a base64-encoded string and pass it to the model. This method works with the DashScope SDK, OpenAI-compatible requests, and standard HTTP requests.

Passing a base64-encoded string (image example)

  1. Convert the local image file to a base64-encoded string.

    Example: Convert an image to a base64-encoded string

    # Function to convert a local file to a base64-encoded string
    import base64
    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    
    # Replace "xxx/eagle.png" with the absolute path to your local image.
    base64_image = encode_image("xxx/eagle.png")
  2. Construct a data URL in the following format: data:[MIME_type];base64,{base64_image}.

    1. Replace MIME_type with its corresponding media type. The value must match an entry in the MIME Type column of the Supported image formats table, such as image/jpeg or image/png.

    2. base64_image is the base64-encoded string from the previous step.

  3. Call the model by passing the data URL to the image or image_url parameter.

File path

You can pass the local file path directly to the model. This method is supported only by the DashScope Python and Java SDKs and is not available for DashScope HTTP or OpenAI-compatible requests.

Use the following table to specify the file path for your programming language and operating system.

File path examples

System

SDK

File path

Example

Linux or macOS

Python SDK

file://{absolute_path_to_file}

file:///home/images/test.png

Java SDK

Windows

Python SDK

file://{absolute_path_to_file}

file://D:/images/test.png

Java SDK

file:///{absolute_path_to_file}

file:///D:/images/test.png

Image

File path

Python

import os
import dashscope

# Configurations are region-specific. Update the endpoint accordingly.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace "xxx/eagle.png" with the absolute path to your local image.
local_path = "xxx/eagle.png"
image_path = f"file://{local_path}"
messages = [
                {'role':'user',
                'content': [{'image': image_path},
                            {'text': 'What does the image depict?'}]}]
response = dashscope.MultiModalConversation.call(
    # API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If not using an environment variable, provide your Model Studio API key directly: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3.5-plus',  # This example uses the qwen3.5-plus model, which you can replace as needed. For the model list, see: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages)
print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // Configurations are region-specific. Update the endpoint accordingly.
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>(){{put("image", filePath);}},
                        new HashMap<String, Object>(){{put("text", "What does the image depict?");}})).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If not using an environment variable, provide your Model Studio API key directly: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3.5-plus")  // This example uses the qwen3.5-plus model, which you can replace as needed. For the model list, see: https://www.alibabacloud.com/help/model-studio/getting-started/models
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));}

    public static void main(String[] args) {
        try {
            // Replace "xxx/eagle.png" with the absolute path to your local image.
            callWithLocalFile("xxx/eagle.png");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Base64 encoding

OpenAI compatible

Python

from openai import OpenAI
import os
import base64


# Converts a local file to a Base64-encoded string.
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Replace "xxx/eagle.png" with the absolute path to your local image.
base64_image = encode_image("xxx/eagle.png")
client = OpenAI(
    # API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If not using an environment variable, provide your Model Studio API key directly: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    # Configurations are region-specific. Update the endpoint accordingly.
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3.5-plus", # This example uses the qwen3.5-plus model, which you can replace as needed. For the model list, see: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    # Note: When providing Base64-encoded data, the image format in the data URI (e.g., image/{format}) must be a supported content type.
                    # PNG image:  f"data:image/png;base64,{base64_image}"
                    # JPEG image: f"data:image/jpeg;base64,{base64_image}"
                    # WEBP image: f"data:image/webp;base64,{base64_image}"
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                },
                {"type": "text", "text": "What does the image depict?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';


const openai = new OpenAI(
    {
        // API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If not using an environment variable, provide your Model Studio API key directly: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // Configurations are region-specific. Update the endpoint accordingly.
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
// Replace "xxx/eagle.png" with the absolute path to your local image.
const base64Image = encodeImage("xxx/eagle.png")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen3.5-plus",  // This example uses the qwen3.5-plus model, which you can replace as needed. For the model list, see: https://www.alibabacloud.com/help/model-studio/getting-started/models
        messages: [
            {"role": "user",
            "content": [{"type": "image_url",
                            // Note: When providing Base64-encoded data, the image format in the data URI (e.g., image/{format}) must be a supported content type.
                           // PNG image:  data:image/png;base64,${base64Image}
                          // JPEG image: data:image/jpeg;base64,${base64Image}
                         // WEBP image: data:image/webp;base64,${base64Image}
                        "image_url": {"url": `data:image/png;base64,${base64Image}`},},
                        {"type": "text", "text": "What does the image depict?"}]}]
    });
    console.log(completion.choices[0].message.content);
} 

main();

curl

  • For an example of converting a file to a Base64-encoded string, see the example code.

  • For display purposes, the Base64-encoded string "data:image/jpg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." in the code is truncated. You must pass the complete encoded string in your request.

# ======= Important =======
# API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Configurations are region-specific. Update the URL accordingly.
# === Delete this comment before you run the command ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen3.5-plus",
  "messages": [
  {
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": "data:image/jpg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA"}},
      {"type": "text", "text": "What does the image depict?"}
    ]
  }]
}'

DashScope

Python

import base64
import os
import dashscope

# Configurations are region-specific. Update the endpoint accordingly.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Converts a local file to a Base64-encoded string.
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Replace "xxx/eagle.png" with the absolute path to your local image.
base64_image = encode_image("xxx/eagle.png")

messages = [
    {
        "role": "user",
        "content": [
            # Note: When providing Base64-encoded data, the image format in the data URI (e.g., image/{format}) must be a supported content type.
            # PNG image:  f"data:image/png;base64,{base64_image}"
            # JPEG image: f"data:image/jpeg;base64,{base64_image}"
            # WEBP image: f"data:image/webp;base64,{base64_image}"
            {"image": f"data:image/png;base64,{base64_image}"},
            {"text": "What does the image depict?"},
        ],
    },
]

response = dashscope.MultiModalConversation.call(
    # API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If not using an environment variable, provide your Model Studio API key directly: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3.5-plus",  # This example uses the qwen3.5-plus model, which you can replace as needed. For the model list, see: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages,
)
print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Base64;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // Configurations are region-specific. Update the endpoint accordingly.
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    private static String encodeImageToBase64(String imagePath) throws IOException {
        Path path = Paths.get(imagePath);
        byte[] imageBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(imageBytes);
    }

    public static void callWithLocalFile(String localPath) throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Image = encodeImageToBase64(localPath); // Base64 encoding

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        new HashMap<String, Object>() {{ put("image", "data:image/png;base64," + base64Image); }},
                        new HashMap<String, Object>() {{ put("text", "What does the image depict?"); }}
                )).build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3.5-plus")
                .messages(Arrays.asList(userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace "xxx/eagle.png" with the absolute path to your local image.
            callWithLocalFile("xxx/eagle.png");
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For an example of converting a file to a Base64-encoded string, see the example code.

  • For display purposes, the Base64-encoded string "data:image/jpg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." in the code is truncated. You must pass the complete encoded string in your request.

# ======= Important =======
# Configurations are region-specific. Update the URL accordingly.
# API keys are region-specific. To get an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before you run the command ===

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3.5-plus",
    "input":{
        "messages":[
            {
             "role": "user",
             "content": [
               {"image": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."},
               {"text": "What does the image depict?"}
                ]
            }
        ]
    }
}'

Video file

This example uses the local file test.mp4.

File path

Python

import os
import dashscope

# Endpoints vary by region. Modify the endpoint accordingly.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace "xxx/test.mp4" with the absolute path to your local video file.
local_path = "xxx/test.mp4"
video_path = f"file://{local_path}"
messages = [
                {'role':'user',
                # The fps parameter controls the frame extraction rate, extracting one frame every 1/fps seconds.
                'content': [{'video': video_path,"fps":2},
                            {'text': 'What does this video depict?'}]}]
response = MultiModalConversation.call(
    # API keys are specific to each region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you are not using an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3.5-plus',  
    messages=messages)
print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // Endpoints vary by region. Modify the Constants.baseHttpApiUrl accordingly.
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>()
                                       {{
                                           put("video", filePath);// The fps parameter controls the frame extraction rate, extracting one frame every 1/fps seconds.
                                           put("fps", 2);
                                       }}, 
                        new HashMap<String, Object>(){{put("text", "What does this video depict?");}})).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys are specific to each region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you are not using an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3.5-plus")  
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));}

    public static void main(String[] args) {
        try {
            // Replace "xxx/test.mp4" with the absolute path to your local video file.
            callWithLocalFile("xxx/test.mp4");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Base64 encoding

OpenAI compatible

Python

from openai import OpenAI
import os
import base64


# Converts a local file to a Base64-encoded string.
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# Replace "xxx/test.mp4" with the absolute path to your local video file.
base64_video = encode_video("xxx/test.mp4")
client = OpenAI(
    # API keys are specific to each region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you are not using an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    # Endpoints vary by region. Modify the endpoint accordingly.
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3.5-plus",  
    messages=[
        {
            "role": "user",
            "content": [
                {
                    # When passing a video file as Base64 data, set the type to "video_url".
                    "type": "video_url",
                    "video_url": {"url": f"data:video/mp4;base64,{base64_video}"},
                    "fps":2
                },
                {"type": "text", "text": "What does this video depict?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // API keys are specific to each region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you are not using an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // Endpoints vary by region. Modify the endpoint accordingly.
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeVideo = (videoPath) => {
    const videoFile = readFileSync(videoPath);
    return videoFile.toString('base64');
  };
// Replace "xxx/test.mp4" with the absolute path to your local video file.
const base64Video = encodeVideo("xxx/test.mp4")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen3.5-plus", 
        messages: [
            {"role": "user",
             "content": [{
                 // When passing a video file as Base64 data, set the type to "video_url".
                "type": "video_url", 
                "video_url": {"url": `data:video/mp4;base64,${base64Video}`},
                "fps":2},
                 {"type": "text", "text": "What does this video depict?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

Curl

  • See the example code for an example of how to Base64-encode a file.

  • For display purposes, the Base64-encoded string "data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." in the code is truncated. You must pass the complete encoded string in your request.

# ======= IMPORTANT =======
# API keys are specific to each region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Endpoints vary by region. Modify the endpoint accordingly.
# === Delete this comment before running the command ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen3.5-plus",
  "messages": [
  {
    "role": "user",
    "content": [
      {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."},"fps":2},
      {"type": "text", "text": "What does this video depict?"}
    ]
  }]
}'

DashScope

Python

import base64
import os
import dashscope

# Endpoints vary by region. Modify the endpoint accordingly.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Converts a local file to a Base64-encoded string.
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# Replace "xxx/test.mp4" with the absolute path to your local video file.
base64_video = encode_video("xxx/test.mp4")

messages = [{'role':'user',
                # The fps parameter controls the frame extraction rate, extracting one frame every 1/fps seconds.
             'content': [{'video': f"data:video/mp4;base64,{base64_video}","fps":2},
                            {'text': 'What does this video depict?'}]}]
response = dashscope.MultiModalConversation.call(
    # API keys are specific to each region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you are not using an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3.5-plus',
    messages=messages)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.util.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // Endpoints vary by region. Modify the Constants.baseHttpApiUrl accordingly.
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    private static String encodeVideoToBase64(String videoPath) throws IOException {
        Path path = Paths.get(videoPath);
        byte[] videoBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(videoBytes);
    }

    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Video = encodeVideoToBase64(localPath);

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>()
                                       {{
                                           put("video", "data:video/mp4;base64," + base64Video);// The fps parameter controls the frame extraction rate, extracting one frame every 1/fps seconds.
                                           put("fps", 2);
                                       }},
                        new HashMap<String, Object>(){{put("text", "What does this video depict?");}})).build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys are specific to each region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you are not using an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3.5-plus")
                .messages(Arrays.asList(userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace "xxx/test.mp4" with the absolute path to your local video file.
            callWithLocalFile("xxx/test.mp4");
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Curl

  • See the example code for an example of how to Base64-encode a file.

  • For display purposes, the Base64-encoded string "data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..." in the code is truncated. You must pass the complete encoded string in your request.

# ======= IMPORTANT =======
# Endpoints vary by region. Modify the endpoint accordingly.
# API keys are specific to each region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before running the command ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3.5-plus",
    "input":{
        "messages":[
            {
             "role": "user",
             "content": [
               {"video": "data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."},
               {"text": "What does this video depict?"}
                ]
            }
        ]
    }
}'

Image list

This example uses the local files football1.jpg, football2.jpg, football3.jpg, and football4.jpg.

File path

Python

import os
import dashscope

# Configurations vary by region. Modify the endpoint accordingly.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

local_path1 = "football1.jpg"
local_path2 = "football2.jpg"
local_path3 = "football3.jpg"
local_path4 = "football4.jpg"

image_path1 = f"file://{local_path1}"
image_path2 = f"file://{local_path2}"
image_path3 = f"file://{local_path3}"
image_path4 = f"file://{local_path4}"

messages = [{'role':'user',
              # For image lists, the fps parameter is available with the Qwen3.5, Qwen3-VL, and Qwen2.5-VL series models.
             'content': [{'video': [image_path1,image_path2,image_path3,image_path4],"fps":2},
                         {'text': 'What does this video depict?'}]}]
response = MultiModalConversation.call(
    # API keys vary by region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not set an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3.5-plus',  # This example uses the qwen3.5-plus model. You can replace it as needed. For a model list, see: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=messages)

print(response.output.choices[0].message.content[0]["text"])

Java

// DashScope SDK version 2.21.10 or later is required.
import java.util.Arrays;
import java.util.Map;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // Configurations vary by region. Modify the endpoint accordingly.
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    private static final String MODEL_NAME = "qwen3.5-plus";  // This example uses the qwen3.5-plus model. You can replace it as needed. For a model list, see: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    public static void videoImageListSample(String localPath1, String localPath2, String localPath3, String localPath4)
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        String filePath1 = "file://" + localPath1;
        String filePath2 = "file://" + localPath2;
        String filePath3 = "file://" + localPath3;
        String filePath4 = "file://" + localPath4;
        Map<String, Object> params = new HashMap<>();
        params.put("video", Arrays.asList(filePath1,filePath2,filePath3,filePath4));
        // For image lists, the fps parameter is available with the Qwen3.5, Qwen3-VL, and Qwen2.5-VL series models.
        params.put("fps", 2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(params,
                        Collections.singletonMap("text", "Describe the process shown in this video.")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys vary by region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not set an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample(
                    "xxx/football1.jpg",
                    "xxx/football2.jpg",
                    "xxx/football3.jpg",
                    "xxx/football4.jpg");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Base64 encoding

OpenAI compatible

Python

import os
from openai import OpenAI
import base64

# Converts a local file to a Base64-encoded string.
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")
client = OpenAI(
    # API keys vary by region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not set an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Configurations vary by region. Modify the endpoint accordingly.
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3.5-plus",  # This example uses the qwen3.5-plus model. You can replace it as needed. For a model list, see: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=[  
    {"role": "user","content": [
        {"type": "video","video": [
            f"data:image/jpeg;base64,{base64_image1}",
            f"data:image/jpeg;base64,{base64_image2}",
            f"data:image/jpeg;base64,{base64_image3}",
            f"data:image/jpeg;base64,{base64_image4}",]},
        {"type": "text","text": "Describe the process shown in this video."},
    ]}]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // API keys vary by region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not set an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // Configurations vary by region. Modify the baseURL accordingly.
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
  
const base64Image1 = encodeImage("football1.jpg")
const base64Image2 = encodeImage("football2.jpg")
const base64Image3 = encodeImage("football3.jpg")
const base64Image4 = encodeImage("football4.jpg")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen3.5-plus",  // This example uses the qwen3.5-plus model. You can replace it as needed. For a model list, see: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
        messages: [
            {"role": "user",
             "content": [{"type": "video",
                        "video": [
                            `data:image/jpeg;base64,${base64Image1}`,
                            `data:image/jpeg;base64,${base64Image2}`,
                            `data:image/jpeg;base64,${base64Image3}`,
                            `data:image/jpeg;base64,${base64Image4}`]},
                        {"type": "text", "text": "What does this video depict?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

Curl

  • See the sample code for an example of converting a file to a Base64-encoded string.

  • For readability, the Base64-encoded string in the code ("data:image/jpg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...") is truncated. You must use the complete string in your request.

# ======= Important =======
# Configurations vary by region. Modify the URL accordingly.
# API keys vary by region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before you run the command ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3.5-plus",
    "messages": [{"role": "user",
                "content": [{"type": "video",
                "video": [
                          "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...",
                          "data:image/jpeg;base64,nEpp6jpnP57MoWSyOWwrkXMJhHRCWYeFYb...",
                          "data:image/jpeg;base64,JHWQnJPc40GwQ7zERAtRMK6iIhnWw4080s...",
                          "data:image/jpeg;base64,adB6QOU5HP7dAYBBOg/Fb7KIptlbyEOu58..."
                          ]},
                {"type": "text",
                "text": "Describe the process shown in this video."}]}]
}'

DashScope

Python

import base64
import os
import dashscope

# Configurations vary by region. Modify the endpoint accordingly.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Converts a local file to a Base64-encoded string.
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")


messages = [{'role':'user',
            'content': [
                    {'video':
                         [f"data:image/jpeg;base64,{base64_image1}",
                          f"data:image/jpeg;base64,{base64_image2}",
                          f"data:image/jpeg;base64,{base64_image3}",
                          f"data:image/jpeg;base64,{base64_image4}"
                         ]
                    },
                    {'text': 'Describe the process shown in this video.'}]}]
response = dashscope.MultiModalConversation.call(
    # API keys vary by region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen3.5-plus',  # This example uses the qwen3.5-plus model. You can replace it as needed. For a model list, see: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=messages)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.util.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // Configurations vary by region. Modify the endpoint accordingly.
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    private static String encodeImageToBase64(String imagePath) throws IOException {
        Path path = Paths.get(imagePath);
        byte[] imageBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(imageBytes);
    }

    public static void videoImageListSample(String localPath1,String localPath2,String localPath3,String localPath4)
            throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Image1 = encodeImageToBase64(localPath1);
        String base64Image2 = encodeImageToBase64(localPath2);
        String base64Image3 = encodeImageToBase64(localPath3);
        String base64Image4 = encodeImageToBase64(localPath4);

        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> params = new HashMap<>();
        params.put("video", Arrays.asList(
                        "data:image/jpeg;base64," + base64Image1,
                        "data:image/jpeg;base64," + base64Image2,
                        "data:image/jpeg;base64," + base64Image3,
                        "data:image/jpeg;base64," + base64Image4));
        // For image lists, the fps parameter is available with the Qwen3.5, Qwen3-VL, and Qwen2.5-VL series models.
        params.put("fps", 2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(params,
                        Collections.singletonMap("text", "Describe the process shown in this video.")))
                .build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys vary by region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3.5-plus")
                .messages(Arrays.asList(userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace placeholders like "xxx/football1.jpg" with the absolute paths to your local image files.
            videoImageListSample(
                    "xxx/football1.jpg",
                    "xxx/football2.jpg",
                    "xxx/football3.jpg",
                    "xxx/football4.jpg"
            );
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Curl

  • See the sample code for an example of converting a file to a Base64-encoded string.

  • For readability, the Base64-encoded string in the code ("data:image/jpg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...") is truncated. You must use the complete string in your request.

# ======= Important =======
# Configurations vary by region. Modify the URL accordingly.
# API keys vary by region. Get an API key at: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before you run the command ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3.5-plus",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
                      "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...",
                      "data:image/jpeg;base64,nEpp6jpnP57MoWSyOWwrkXMJhHRCWYeFYb...",
                      "data:image/jpeg;base64,JHWQnJPc40GwQ7zERAtRMK6iIhnWw4080s...",
                      "data:image/jpeg;base64,adB6QOU5HP7dAYBBOg/Fb7KIptlbyEOu58..."
            ],
            "fps":2     
          },
          {
            "text": "Describe the process shown in this video."
          }
        ]
      }
    ]
  }
}'

Handling high-resolution images

The visual understanding model API limits the number of visual tokens for each encoded image. By default, high-resolution images are compressed, which can cause detail loss and reduce accuracy. You can enable vl_high_resolution_images or adjust max_pixels to increase the visual token count, which preserves more image detail and improves understanding.

Pixels per token, token limits, and pixel limits by model

If an input image's pixel count exceeds the model's pixel limit, the image is scaled down to the limit.

Model

Pixels per token

vl_high_resolution_images

max_pixels

Token limit

Pixel limit

Qwen3.5 and Qwen3-VL series models

32*32

true

max_pixels is ignored.

16384 tokens

16777216 (which is 16384*32*32)

false (default)

Customizable. Defaults to 2621440, with a maximum of 16777216.

Determined by max_pixels, calculated as max_pixels/32/32.

max_pixels 

qwen-vl-max, qwen-vl-max-latest, qwen-vl-max-2025-08-13, qwen-vl-plus, qwen-vl-plus-latest, qwen-vl-plus-2025-08-15 models

32*32

true

max_pixels is ignored.

16384 tokens

16777216 (which is 16384*32*32)

false (default)

Customizable. Defaults to 1310720, with a maximum of 16777216.

Determined by max_pixels, calculated as max_pixels/32/32.

max_pixels 

Other qwen-vl-max and qwen-vl-plus models, the Qwen2.5-VL open-source series, and QVQ series models

28*28

true

max_pixels is ignored.

16384 tokens

12845056 (which is 16384*28*28)

false (default)

Customizable. Defaults to 1003520, with a maximum of 12845056.

Determined by max_pixels, calculated as max_pixels/28/28.

max_pixels 

  • When vl_high_resolution_images=true, the API uses a fixed resolution policy and ignores the max_pixels setting. This is ideal for tasks that require recognizing fine text, small objects, or rich details in images.

  • When vl_high_resolution_images=false, the max_pixels parameter determines the final pixel limit.

    • For applications that require high processing speed or are cost-sensitive, use the default value for max_pixels or set it to a smaller value.

    • If you need to preserve more detail and can accept a lower processing speed, increase the value of max_pixels.

OpenAI compatible

The vl_high_resolution_images parameter is not a standard OpenAI parameter and is passed differently depending on the SDK:

  • Python SDK: Must be passed in the extra_body dictionary.

  • Node.js SDK: Can be passed directly as a top-level parameter.

Python

import os
import time
from openai import OpenAI

client = OpenAI(
    # API keys vary by region. To get an API key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Configurations vary by region.
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3.5-plus",
    messages=[
        {"role": "user","content": [
            {"type": "image_url","image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
            # max_pixels specifies the maximum pixel threshold for the input image. It is ignored when vl_high_resolution_images=True.
            # When vl_high_resolution_images=False, this value is customizable, and its maximum value varies by model.
            # "max_pixels": 16384 * 32 * 32
            },
           {"type": "text", "text": "What holiday is depicted in this image?"},
            ],
        }
    ],
    extra_body={"vl_high_resolution_images":True}

)
print(f"model output: {completion.choices[0].message.content}")
print(f"total input tokens: {completion.usage.prompt_tokens}")

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // API keys vary by region. To get an API key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If the environment variable is not set, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // Configurations vary by region.
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const response = await openai.chat.completions.create({
        model: "qwen3.5-plus",
        messages: [
        {role: "user",content: [
            {type: "image_url",
            image_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
            // max_pixels specifies the maximum pixel threshold for the input image. It is ignored when vl_high_resolution_images=True.
            // When vl_high_resolution_images=False, this value is customizable, and its maximum value varies by model.
            // "max_pixels": 2560 * 32 * 32
            },
            {type: "text", text: "What holiday is depicted in this image?" },
        ]}],
        vl_high_resolution_images:true
    })


console.log("model output:",response.choices[0].message.content);
console.log("total input tokens:",response.usage.prompt_tokens);

Curl

# ======= Important Note =======
# API keys vary by region. To get an API key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Configurations vary by region.
# === Remove this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3.5-plus",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"
          }
        },
        {
          "type": "text",
          "text": "What holiday is depicted in this image?"
        }
      ]
    }
  ],
  "vl_high_resolution_images":true
}'

DashScope

Python

import os
import time

import dashscope

# Configurations vary by region.
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg",
            # max_pixels specifies the maximum pixel threshold for the input image. It is ignored when vl_high_resolution_images=True.
            # When vl_high_resolution_images=False, this value is customizable, and its maximum value varies by model.
            # "max_pixels": 16384 * 32 * 32
            },
            {"text": "What holiday is depicted in this image?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
        # If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx"
        # API keys vary by region. To get an API key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
        api_key=os.getenv('DASHSCOPE_API_KEY'),
        model='qwen3.5-plus',
        messages=messages,
        vl_high_resolution_images=True
    )
    
print("model output:",response.output.choices[0].message.content[0]["text"])
print("total input tokens:",response.usage.input_tokens)

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // Configurations vary by region.
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg");
        // max_pixels specifies the maximum pixel threshold for the input image. It is ignored when vl_high_resolution_images=True.
        // When vl_high_resolution_images=False, this value is customizable, and its maximum value varies by model.
        // map.put("max_pixels", 2621440); 
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        Collections.singletonMap("text", "What holiday is depicted in this image?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3.5-plus")
                .message(userMessage)
                .vlHighResolutionImages(true)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
        System.out.println(result.getUsage().getInputTokens());
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Curl

# ======= Important Note =======
# API keys vary by region. To get an API key, see: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Configurations vary by region.
# === Remove this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3.5-plus",
    "input":{
        "messages":[
            {
             "role": "user",
             "content": [
               {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
               {"text": "What holiday is depicted in this image?"}
                ]
            }
        ]
    },
    "parameters": {
        "vl_high_resolution_images": true
    }
}'

Next steps

Limitations

Input file limits

Image limits

  • Image resolution:

    • Minimum dimensions: The width and height of the image must both be greater than 10 pixels.

    • Aspect ratio: The ratio of the long side to the short side of the image must not exceed 200:1.

    • Pixel limit:

      • Keep the image resolution within 8K (7680x4320). Images exceeding this resolution risk API call timeouts due to large file sizes or long network transmission times.

      • Automatic scaling: The model can adjust the image size using the max_pixels and min_pixels parameters. Providing an ultra-high-resolution image does not improve recognition accuracy but increases the risk of call failures. For best results, scale the image to a reasonable size on the client before sending it.

  • Supported image formats

    • For images with a resolution below 4K (3840x2160), the following image formats are supported:

      Image format

      Common extensions

      Mime type

      BMP

      .bmp

      image/bmp

      JPEG

      .jpe, .jpeg, .jpg

      image/jpeg

      PNG

      .png

      image/png

      TIFF

      .tif, .tiff

      image/tiff

      WEBP

      .webp

      image/webp

      HEIC

      .heic

      image/heic

    • For images with a resolution between 4K (3840x2160) and 8K (7680x4320), only the JPEG, JPG, and PNG formats are supported.

  • Image size:

    • When passed as a public URL: A single image cannot exceed 20 MB for the Qwen3.5 series or 10 MB for other models.

    • When passed as a local file path: A single image cannot exceed 10 MB.

    • When passed as a Base64-encoded string: The encoded string cannot exceed 10 MB.

    To compress the file size, see How to compress an image or video to the required size.
  • Image count limit: The maximum number of images allowed depends on the input method.

    • When passed as a public URL or local file path: Up to 256 images.

    • When passed as Base64-encoded strings: Up to 250 images.

Video limits

  • When you submit a video as a list of images, the following image count limits apply:

    • qwen3.5 series: A minimum of 4 images and a maximum of 8,000 images.

    • qwen3-vl-plus series, qwen3-vl-flash series, qwen3-vl-235b-a22b-thinking, and qwen3-vl-235b-a22b-instruct: A minimum of 4 images and a maximum of 2,000 images.

    • Other Qwen3-VL open-source models, Qwen2.5-VL models (both commercial and open-source versions), and QVQ series models: A minimum of 4 images and a maximum of 512 images.

    • Other models: A minimum of 4 images and a maximum of 80 images.

  • When you submit a video as a single file:

    • Video size:

      • When passed as a public URL:

        • qwen3.5 series, Qwen3-VL series, and qwen-vl-max (including qwen-vl-max-latest, qwen-vl-max-2025-04-08, and all subsequent versions): Cannot exceed 2 GB.

        • qwen-vl-plus series, other qwen-vl-max models, Qwen2.5-VL open-source series, and QVQ series models: Cannot exceed 1 GB.

        • Other models: Cannot exceed 150 MB.

      • When passed as a Base64-encoded string: The encoded string must be less than 10 MB.

      • When passed as a local file path: The video file cannot exceed 100 MB.

      To compress the file size, see How to compress an image or video to the required size.
    • Video duration:

      • qwen3.5 series: 2 seconds to 2 hours.

      • qwen3-vl-plus series, qwen3-vl-flash series, qwen3-vl-235b-a22b-thinking, and qwen3-vl-235b-a22b-instruct: 2 seconds to 1 hour.

      • Other Qwen3-VL open-source series and qwen-vl-max (including qwen-vl-max-latest, qwen-vl-max-2025-04-08, and subsequent versions): 2 seconds to 20 minutes.

      • qwen-vl-plus series, other qwen-vl-max models, Qwen2.5-VL open-source series, and QVQ series models: 2 seconds to 10 minutes.

      • Other models: 2 seconds to 40 seconds.

    • Video format: Supported formats include MP4, AVI, MKV, MOV, FLV, and WMV.

    • Video dimensions: No specific limit. The model automatically adjusts video dimensions using the max_pixels and min_pixels parameters. Larger video dimensions do not improve understanding.

    • Audio understanding: The model does not process audio from video files.

File input methods

  • Public URL: A publicly accessible URL using the HTTP or HTTPS protocol. For optimal stability and performance, upload a file to OSS .

    Important

    To ensure the model can download the file, the public URL's response header must include Content-Length (file size) and Content-Type (media type, such as image/jpeg). If either field is missing or incorrect, the file download fails.

  • Base64-encoded string: The file content, provided as a Base64-encoded string.

  • Local file path (DashScope SDK only): The local path to the file.

For recommendations on which file input method to choose, see How to choose a file upload method?

Production use

  • Image and video preprocessing: The visual understanding model has size limits for input files. To compress files, see Image or video compression methods.

  • Processing text files: The visual understanding model supports only image formats and cannot process text files directly. You can use the following alternatives:

    • Use an image processing library, such as Python's pdf2image, to convert the file page by page into high-quality images, and then use the multiple image input method to pass them to the model.

    • Qwen-Long can process text files and parse their content.

  • Fault tolerance and stability

    • Timeout handling: In a non-streaming call, a timeout occurs if the model fails to generate a complete output within 180 seconds. When this happens, the response body contains any content generated before the timeout, and the response header containsx-dashscope-partialresponse:true. You can use the partial mode feature, available on some models, to append the generated content to the messages array and resend the request. This lets the model continue generating content from where it left off. For details, see: Continue writing based on incomplete output.

    • Retry mechanism: Design a robust API retry strategy, such as exponential backoff, to handle network fluctuations or transient service unavailability.

Billing and rate limiting

  • Billing: The total cost is calculated based on the total number of input and output tokens. For pricing details, see the Model list.

    • Token composition: Input tokens consist of text tokens and tokens converted from images or videos. Output tokens are the text generated by the model. In thinking mode, the model's reasoning process also counts as output tokens. If the reasoning process is not output in thinking mode, the price for non-thinking mode applies.

    • Calculate image and video tokens: You can use the following code to estimate the token consumption for an image or video. This is an estimate for reference only; the actual usage is determined by the API response.

      Calculate image and video tokens

      Image

      Formula: Image tokens = h_bar * w_bar / token_pixels + 2

      • h_bar, w_bar: The height and width of the scaled image. The model preprocesses an image by scaling it to a specific pixel limit. This limit depends on the values of the max_pixels and vl_high_resolution_images parameters. For more information, see Process high-resolution images.

      • token_pixels: The number of pixels per visual token. This value varies by model:

        • Qwen3.5, Qwen3-VL, qwen-vl-max, qwen-vl-max-latest, qwen-vl-max-2025-08-13, qwen-vl-plus, qwen-vl-plus-latest, qwen-vl-plus-2025-08-15: Each token corresponds to 32x32 pixels.

        • QVQ and other Qwen2.5-VL models: Each token corresponds to 28x28 pixels.

      The following code demonstrates the model's approximate image scaling logic. You can use it to estimate the token count for an image. The actual charges are based on the token usage returned in the API response.

      import math
      # Use the following command to install the Pillow library: pip install Pillow
      from PIL import Image
      
      def token_calculate(image_path, max_pixels, vl_high_resolution_images):
          # Open the specified image file.
          image = Image.open(image_path)
      
          # Get the original dimensions of the image.
          height = image.height
          width = image.width
      
          # Adjust the height and width to be multiples of 32 or 28, depending on the model.
          h_bar = round(height / 32) * 32
          w_bar = round(width / 32) * 32
      
          # Minimum token count for an image: 4 tokens.
          min_pixels = 4 * 32 * 32
          # If vl_high_resolution_images is True, the upper limit for input image tokens is 16,386, with a corresponding maximum pixel value of 16384 * 32 * 32 or 16384 * 28 * 28. Otherwise, the value of max_pixels is used.
          if vl_high_resolution_images:
              max_pixels = 16384 * 32 * 32
          else:
              max_pixels = max_pixels
      
          # Scale the image so that the total number of pixels is within the range of [min_pixels, max_pixels].
          if h_bar * w_bar > max_pixels:
              # Calculate the scaling factor beta so that the total pixels of the scaled image do not exceed max_pixels.
              beta = math.sqrt((height * width) / max_pixels)
              # Recalculate the adjusted height and width.
              h_bar = math.floor(height / beta / 32) * 32
              w_bar = math.floor(width / beta / 32) * 32
          elif h_bar * w_bar < min_pixels:
              # Calculate the scaling factor beta so that the total pixels of the scaled image are not less than min_pixels.
              beta = math.sqrt(min_pixels / (height * width))
              # Recalculate the adjusted height.
              h_bar = math.ceil(height * beta / 32) * 32
              w_bar = math.ceil(width * beta / 32) * 32
          return h_bar, w_bar
      
      if __name__ == "__main__":
          # Replace xxx/test.jpg with the path to your local image.
          h_bar, w_bar =  token_calculate("xxx/test.jpg", max_pixels=16384*32*32, vl_high_resolution_images=False)
          print(f"Scaled image dimensions: height {h_bar}, width {w_bar}")
          # The system automatically adds the <|vision_bos|> and <|vision_eos|> visual markers (1 token each).
          token = int((h_bar * w_bar) / (32 * 32)) + 2
          print(f"Image token count: {token}")

      Video

      • Video file:

        To process a video file, the model first extracts frames and then calculates the total tokens for all of them. Because this calculation is complex, you can use the following code to estimate a video's total token consumption by providing its path:

        # Before running, install opencv-python: pip install opencv-python
        import math
        import os
        import logging
        import cv2
        
        logger = logging.getLogger(__name__)
        
        FRAME_FACTOR = 2
        
        # For Qwen3-VL, qwen-vl-max-0813, qwen-vl-plus-0815, and qwen-vl-plus-0710 models, the image scaling factor is 32.
        IMAGE_FACTOR = 32
        
        # For other models, the image scaling factor is 28.
        # IMAGE_FACTOR = 28
        
        # Maximum aspect ratio for video frames.
        MAX_RATIO = 200
        # Minimum pixels for video frames.
        VIDEO_MIN_PIXELS = 4 * 32 * 32
        # Maximum pixels for video frames. For the Qwen3-VL-Plus model, VIDEO_MAX_PIXELS is 640 * 32 * 32. For other models, it is 768 * 32 * 32.
        VIDEO_MAX_PIXELS = 640 * 32 * 32
        
        # If the FPS parameter is not provided, a default value is used.
        FPS = 2.0
        # Minimum number of extracted frames.
        FPS_MIN_FRAMES = 4
        # Maximum number of extracted frames. Set FPS_MAX_FRAMES to 2000 for the Qwen3-VL-Plus model, 512 for Qwen3-VL-Flash and Qwen2.5-VL models, and 80 for others.
        FPS_MAX_FRAMES = 2000
        
        # Maximum pixel value for video input. Set VIDEO_TOTAL_PIXELS to 131072 * 32 * 32 for the Qwen3-VL-Plus model and 65536 * 32 * 32 for other models.
        VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 131072 * 32 * 32)))
        
        def round_by_factor(number: int, factor: int) -> int:
            """Returns the integer closest to 'number' that is divisible by 'factor'."""
            return round(number / factor) * factor
        
        def ceil_by_factor(number: int, factor: int) -> int:
            """Returns the smallest integer greater than or equal to 'number' that is divisible by 'factor'."""
            return math.ceil(number / factor) * factor
        
        def floor_by_factor(number: int, factor: int) -> int:
            """Returns the largest integer less than or equal to 'number' that is divisible by 'factor'."""
            return math.floor(number / factor) * factor
        
        def extract_vision_info(conversations):
            vision_infos = []
            if isinstance(conversations[0], dict):
                conversations = [conversations]
            for conversation in conversations:
                for message in conversation:
                    if isinstance(message["content"], list):
                        for ele in message["content"]:
                            if (
                                "image" in ele
                                or "image_url" in ele
                                or "video" in ele
                                or ele.get("type","") in ("image", "image_url", "video")
                            ):
                                vision_infos.append(ele)
            return vision_infos
        
        def smart_nframes(ele, total_frames, video_fps):
            """Calculates the number of video frames to extract.
        
            Args:
                ele (dict): A dictionary containing the video configuration.
                    - fps: Controls the number of frames extracted for the model input.
                total_frames (int): The original total number of frames in the video.
                video_fps (int | float): The original frame rate of the video.
        
            Raises:
                ValueError: Raised if nframes is not within the interval [FRAME_FACTOR, total_frames].
        
            Returns:
                The number of video frames to use for model input.
            """
            assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
            fps = ele.get("fps", FPS)
            min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
            max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
            duration = total_frames / video_fps if video_fps != 0 else 0
            if duration - int(duration) > (1 / fps):
                total_frames = math.ceil(duration * video_fps)
            else:
                total_frames = math.ceil(int(duration) * video_fps)
            nframes = total_frames / video_fps * fps
            if nframes > total_frames:
                logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")
            nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
            if not (FRAME_FACTOR <= nframes and nframes <= total_frames):
                raise ValueError(f"nframes should be in the interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
        
            return nframes
        
        def get_video(video_path):
            # Get video information.
            cap = cv2.VideoCapture(video_path)
        
            frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
            # Get the video height.
            frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
            total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        
            video_fps = cap.get(cv2.CAP_PROP_FPS)
            return frame_height, frame_width, total_frames, video_fps
        
        def smart_resize(ele, path, factor=IMAGE_FACTOR):
            # Get the original width and height of the video.
            height, width, total_frames, video_fps = get_video(path)
            # Minimum pixels for video frames.
            min_pixels = VIDEO_MIN_PIXELS
            total_pixels = VIDEO_TOTAL_PIXELS
            # Number of extracted video frames.
            nframes = smart_nframes(ele, total_frames, video_fps)
            max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
        
            # The video's aspect ratio must not exceed 200:1 or 1:200.
            if max(height, width) / min(height, width) > MAX_RATIO:
                raise ValueError(
                    f"The absolute aspect ratio must be smaller than {MAX_RATIO}, but got {max(height, width) / min(height, width)}"
                )
        
            h_bar = max(factor, round_by_factor(height, factor))
            w_bar = max(factor, round_by_factor(width, factor))
            if h_bar * w_bar > max_pixels:
                beta = math.sqrt((height * width) / max_pixels)
                h_bar = floor_by_factor(height / beta, factor)
                w_bar = floor_by_factor(width / beta, factor)
            elif h_bar * w_bar < min_pixels:
                beta = math.sqrt(min_pixels / (height * width))
                h_bar = ceil_by_factor(height * beta, factor)
                w_bar = ceil_by_factor(width * beta, factor)
            return h_bar, w_bar
        
        
        def token_calculate(video_path, fps):
            # Pass the video path and the fps parameter for frame extraction.
            messages = [{"content": [{"video": video_path, "fps": fps}]}]
            vision_infos = extract_vision_info(messages)[0]
        
            resized_height, resized_width = smart_resize(vision_infos, video_path)
        
            height, width, total_frames, video_fps = get_video(video_path)
            num_frames = smart_nframes(vision_infos, total_frames, video_fps)
            print(f"Original video dimensions: {height}*{width}, Model input dimensions: {resized_height}*{resized_width}, Total video frames: {total_frames}, Total frames extracted at {fps} fps: {num_frames}", end=", ")
            video_token = int(math.ceil(num_frames / 2) * resized_height / 32 * resized_width / 32)
            video_token += 2   # The system automatically adds the <|vision_bos|> and <|vision_eos|> visual markers (1 token each).
            return video_token
        
        
        video_token = token_calculate("xxx/test.mp4", 1)
        print("Video tokens:", video_token)
      • Image list:

        If you provide a video as a list of images, the model assumes that frame extraction has already been performed. Use the following code to calculate the token consumption for the image list by providing the path to one of the frame images and the total number of frames:

        # Before running, install Pillow: pip install Pillow
        import math
        import os
        import logging
        from typing import Tuple
        from PIL import Image
        
        logger = logging.getLogger(__name__)
        
        # ==================== Constant Definitions ====================
        FRAME_FACTOR = 2
        # For Qwen3-VL, qwen-vl-max-0813, qwen-vl-plus-0815, and qwen-vl-plus-0710 models, the scaling factor is 32.
        IMAGE_FACTOR = 32
        
        # For other models, the scaling factor is 28.
        # IMAGE_FACTOR = 28
        
        # Constants related to token calculation
        TOKEN_DIVISOR = 32  # Divisor for token calculation.
        VISION_SPECIAL_TOKENS = 2  # <|vision_bos|> and <|vision_eos|> markers.
        
        # Maximum aspect ratio for video frames.
        MAX_RATIO = 200
        # Minimum pixels for video frames.
        VIDEO_MIN_PIXELS = 4 * 32 * 32
        # Maximum pixels for video frames. For the Qwen3-VL-Plus model, VIDEO_MAX_PIXELS is 640 * 32 * 32. For other models, it is 768 * 32 * 32.
        VIDEO_MAX_PIXELS = 640 * 32 * 32
        
        # Maximum pixel value for video input. Set VIDEO_TOTAL_PIXELS to 131072 * 32 * 32 for the Qwen3-VL-Plus model and 65536 * 32 * 32 for other models.
        VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 131072 * 32 * 32)))
        
        def round_by_factor(number: int, factor: int) -> int:
            """Returns the integer closest to 'number' that is divisible by 'factor'."""
            return round(number / factor) * factor
        
        def ceil_by_factor(number: int, factor: int) -> int:
            """Returns the smallest integer greater than or equal to 'number' that is divisible by 'factor'."""
            return math.ceil(number / factor) * factor
        
        def floor_by_factor(number: int, factor: int) -> int:
            """Returns the largest integer less than or equal to 'number' that is divisible by 'factor'."""
            return math.floor(number / factor) * factor
        
        
        def get_image_size(image_path: str) -> Tuple[int, int]:
            if not os.path.exists(image_path):
                raise FileNotFoundError(f"Image file not found: {image_path}")
        
            try:
                with Image.open(image_path) as image:
                    height = image.height
                    width = image.width
                    return height, width
            except Exception as e:
                raise ValueError(f"Cannot read image file {image_path}: {str(e)}")
        
        def smart_resize(height: int, width: int, nframes: int, factor: int = IMAGE_FACTOR) -> Tuple[int, int]:
            """
            Calculates the dimensions of the image after scaling.
        
            Args:
                height: The original image height.
                width: The original image width.
                nframes: The number of video frames.
                factor: The scaling factor. Default is IMAGE_FACTOR.
        
            Returns:
                A tuple (resized_height, resized_width) with the scaled height and width.
        
            Raises:
                ValueError: If the aspect ratio exceeds the limit.
            """
            # Minimum pixels for video frames.
            min_pixels = VIDEO_MIN_PIXELS
            total_pixels = VIDEO_TOTAL_PIXELS
            # Number of extracted video frames.
            max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
        
            # The image aspect ratio must be less than 200:1.
            aspect_ratio = max(height, width) / min(height, width)
            if aspect_ratio > MAX_RATIO:
                raise ValueError(
                    f"Image aspect ratio must not exceed {MAX_RATIO}:1, but the current ratio is {aspect_ratio:.2f}:1"
                )
        
            h_bar = max(factor, round_by_factor(height, factor))
            w_bar = max(factor, round_by_factor(width, factor))
            if h_bar * w_bar > max_pixels:
                beta = math.sqrt((height * width) / max_pixels)
                h_bar = floor_by_factor(height / beta, factor)
                w_bar = floor_by_factor(width / beta, factor)
            elif h_bar * w_bar < min_pixels:
                beta = math.sqrt(min_pixels / (height * width))
                h_bar = ceil_by_factor(height * beta, factor)
                w_bar = ceil_by_factor(width * beta, factor)
            return h_bar, w_bar
        
        
        def calculate_video_tokens(image_path: str, nframes: int = 1, factor: int = IMAGE_FACTOR, verbose: bool = True) -> int:
            """
            Calculates the token consumption for a list of images.
        
            Args:
                image_path: The path to a video frame file.
                nframes: The number of video frames.
                factor: The scaling factor. Default is IMAGE_FACTOR.
                verbose: Specifies whether to print detailed information.
        
            Returns:
                The number of tokens consumed.
        
            Raises:
                FileNotFoundError: If the file does not exist.
                ValueError: If the file format is invalid or the aspect ratio exceeds the limit.
            """
            # Get the original image dimensions (read only once).
            height, width = get_image_size(image_path)
        
            # Calculate the scaled dimensions.
            resized_height, resized_width = smart_resize(height, width, nframes, factor)
        
            # Calculate the number of tokens.
            # Formula: ceil(nframes / 2) * (height / TOKEN_DIVISOR) * (width / TOKEN_DIVISOR) + VISION_SPECIAL_TOKENS
            video_token = int(
                math.ceil(nframes / 2) *
                (resized_height / TOKEN_DIVISOR) *
                (resized_width / TOKEN_DIVISOR)
            )
            # Add tokens for visual markers (<|vision_bos|> and <|vision_eos|>).
            video_token += VISION_SPECIAL_TOKENS
        
            if verbose:
                print(f"Original video frame dimensions: {height}×{width}, Model input dimensions: {resized_height}×{resized_width}, ", end="")
        
            return video_token
        
        if __name__ == "__main__":
            try:
                # The first parameter is the path to one of the frame images.
                video_token = calculate_video_tokens("xxx/test.jpg", nframes=30)
                print(f"Video tokens: {video_token}\n")
            except Exception as e:
                print(f"Error: {str(e)}\n")
  • View bills: You can view your bills or add funds to your account on the Expenses and Costs page in the Alibaba Cloud console.

  • Rate limiting: For the rate limits for visual understanding models, see Rate limits.

  • Free quota (Singapore only): Visual understanding models come with a free quota of 1 million tokens. The quota is valid for 90 days after you activate Model Studio or your model request is approved.

API reference

For the input and output parameters of the visual understanding model, see Qwen.

FAQ

File upload method

Choose the upload method based on the SDK type, file size, and network stability.

File type

File size

DashScope SDK (Python, Java)

OpenAI compatible / DashScope HTTP

Image

7 MB to 10 MB

Provide the local path

Use a public URL. We recommend using Object Storage Service (OSS).

Less than 7 MB

Provide the local path

Use Base64 encoding

Video

Greater than 100 MB

Use a public URL. We recommend using Object Storage Service (OSS).

Use a public URL. We recommend using Object Storage Service (OSS).

7 MB to 100 MB

Provide the local path

Use a public URL. We recommend using Object Storage Service (OSS).

Less than 7 MB

Provide the local path

Use Base64 encoding

Since Base64 encoding increases the file size, the original file must be smaller than 7 MB.
Use Base64 encoding or a local path to prevent server-side download timeouts and improve stability.

Image and video compression

Input files for a visual understanding model have size limits. Use the following methods to compress them.

Image compression methods

  • Online tools: Use services like CompressJPEG.

  • Local software: Adjust the export quality with tools like Photoshop.

  • Code implementation:

    # pip install pillow
    
    from PIL import Image
    def compress_image(input_path, output_path, quality=85):
        with Image.open(input_path) as img:
            img.save(output_path, "JPEG", optimize=True, quality=quality)
    
    # Pass a local image.
    compress_image("/xxx/before-large.jpeg","/xxx/after-min.jpeg")

Video compression methods

  • Online tools: Use services like FreeConvert.

  • Local software: Use tools like HandBrake.

  • Code implementation: Use the FFmpeg tool. For more information, see the FFmpeg official website.

    # Basic conversion command
    # -i: Specifies the input file path. Example: input.mp4
    # -vcodec: Specifies the video encoder. Common values include libx264 (recommended for general use) and libx265 (higher compression rate).
    # -crf: Controls the video quality. The value range is 18 to 28. A lower value results in higher quality and a larger file size.
    # -preset: Controls the balance between encoding speed and compression efficiency. Common values include slow, fast, and faster.
    # -y: Overwrites an existing file (no value required).
    # output.mp4: Specifies the output file path.
    
    ffmpeg -i input.mp4 -vcodec libx264 -crf 28 -preset slow output.mp4

Drawing bounding boxes

After the visual understanding model generates object localization results, use the following code to draw the bounding boxes and their labels on the original image.

  • Qwen2.5-VL: Returns absolute pixel coordinates, relative to the top-left corner of the scaled image. To draw the bounding boxes, see the code in qwen2_5_vl_2d.py.

  • Qwen3-VL: Returns relative coordinates normalized to the [0, 999] range. To draw the bounding boxes, see the code in qwen3_vl_2d.py (for 2D localization) or qwen3_vl_3d.zip (for 3D localization).

Error codes

If the model call fails and returns an error message, see Error messages for resolution.