All Products
Search
Document Center

Alibaba Cloud Model Studio:Model monitoring

Last Updated:Feb 06, 2026

The monitoring feature enables you to:

  • View call records

  • Monitor metrics and create alerts for metrics such as token latency, call duration, requests per minute (RPM), tokens per minute (TPM), and failure rate

  • Track token consumption

Supported models

  • Monitoring (basic and advanced) and alerting: This feature supports all models in the Model list.

  • Logging:

    Beijing

    • qwen3-max, qwen3-max-2025-09-23, qwen3-max-2026-01-23, qwen3-max-preview

    • qwen-max, qwen-max-0919, qwen-max-2025-01-25, qwen-max-latest

    • qwen-plus, qwen-plus-2025-04-28, qwen-plus-2025-07-14, qwen-plus-2025-07-28, qwen-plus-2025-09-11, qwen-plus-2025-12-01, qwen-plus-latest

    • qwen-flash, qwen-flash-2025-07-28

    • qwen-turbo, qwen-turbo-2025-07-15, qwen-turbo-2025-04-28, qwen-turbo-latest

    • deepseek-v3.1, deepseek-v3.2, deepseek-v3.2-exp

    • qwen3-235b-a22b, qwen3-235b-a22b-instruct-2507, qwen3-235b-a22b-thinking-2507, qwen3-30b-a3b, qwen3-30b-a3b-instruct-2507, qwen3-30b-a3b-thinking-2507, qwen3-next-80b-a3b-instruct, qwen3-next-80b-a3b-thinking

    • qwen3-coder-480b-a35b-instruct, qwen3-coder-flash, qwen3-coder-flash-2025-07-28, qwen3-coder-plus, qwen3-coder-plus-2025-07-22, qwen3-coder-plus-2025-09-23

    Singapore

    • qwen3-max, qwen3-max-2025-09-23, qwen3-max-2026-01-23, qwen3-max-preview

    • qwen-max, qwen-max-2025-01-25, qwen-max-latest

    • qwen-plus, qwen-plus-2025-04-28, qwen-plus-2025-07-14, qwen-plus-2025-07-28, qwen-plus-2025-09-11, qwen-plus-2025-12-01, qwen-plus-latest

    • qwen-flash, qwen-flash-2025-07-28

    • qwen-turbo, qwen-turbo-2025-04-28, qwen-turbo-latest

    • qwen3-235b-a22b, qwen3-235b-a22b-instruct-2507, qwen3-235b-a22b-thinking-2507, qwen3-30b-a3b, qwen3-30b-a3b-instruct-2507, qwen3-30b-a3b-thinking-2507, qwen3-next-80b-a3b-instruct, qwen3-next-80b-a3b-thinking

    • qwen3-coder-480b-a35b-instruct, qwen3-coder-flash, qwen3-coder-flash-2025-07-28, qwen3-coder-plus, qwen3-coder-plus-2025-07-22, qwen3-coder-plus-2025-09-23

Monitor model operations

The system automatically collects model call data from all workspaces under your Alibaba Cloud account. When a direct or indirect model call occurs, the system automatically collects and syncs the relevant data to the Monitoring (Singapore), Monitoring (Virginia), or Monitoring (Beijing) page.

Records on the page are generated based on the "Model + Workspace" dimension. New models are automatically added to the page after the first data synchronization is complete. The latency for standard monitoring is typically several hours. If you require minute-level data insights, use Advanced Monitoring.
Members of the default workspace can view model call data for all workspaces. Members of a sub-workspace can view data only for the current workspace and cannot switch to view data from other workspaces.

After you find the target model in the list, click Monitor in the Actions column to query the following 4 types of monitoring metrics:

  • Security: Detects content violations in conversations, such as Content Moderation error count.

  • Cost: Evaluates the cost-effectiveness of the model, such as Average usage per request.

  • Performance: Monitors changes in model performance, such as Call duration and Time to first token.

  • Error: Assesses the stability of the model, such as Failure count and Failure rate.

You can create an alert based on the metrics described above to promptly detect and handle anomalies.

Call statistics

On this tab, you can view metrics for Security, Cost, and Faults, such as the number of calls and failures. You can filter the data by API-KEY, Inference Type, and time range.

  • Rate limit error count: Failed calls caused by a 429 status code.

  • Content Moderation error count: The input or output contains sensitive or high-risk content, such as pornography, political content, or advertisements, and is blocked by the Content Moderation service.

Performance metrics

On this tab, you can view performance-related metrics such as RPM, TPM, call duration, and time to first token.

View token consumption

Operations such as adjusting model parameters and system prompts can affect a model's token consumption. To help with cost analysis and fine-grained management, monitoring provides the following cost monitoring features:

  • Aggregation: Aggregates the historical token consumption of models by workspace. You can further filter by time range and API key.

  • Tracking: Records the token consumption for each model call.

  • Alerting: You can set a token consumption threshold. The system sends an alert immediately if a model's consumption becomes abnormal.

View historical token consumption of a model

View the token consumption of a specific call

This feature is currently available only in the China (Beijing) region for certain models.
  1. Log on with an Alibaba Cloud account (or a RAM user with sufficient permissions). On the Monitoring (Beijing) page of the target workspace, click Monitoring Configurations in the upper-right corner and follow the instructions to enable audit logs and inference logs.

    After you enable logs, the system starts recording the input and output of every model call in the workspace. There is a latency of several minutes from when a call occurs to when the log is recorded.
  2. Find the target model in the Model Monitoring list, and click Logs in the Actions column.

  3. The Logs tab displays the real-time inference call records for the model, and the Usage field shows the token consumption for this call.

Create an alert for abnormal consumption

View conversation history (model logs)

Important

This feature is currently available only in the China (Beijing) region for certain models.

Monitoring enables you to view every conversation with a model, including input, output, and duration. It is a key tool for troubleshooting and content auditing.

Step 1: Turn on logging

Log on with an Alibaba Cloud account (or a RAM user with sufficient permissions), go to the Monitoring (Beijing) page of the target workspace, click Monitoring Configurations in the upper-right corner, and follow the instructions to enable audit logs and inference logs.

After you enable logs, the system starts recording the input and output of every model call in the workspace. There is a latency of several minutes from when a call occurs to when the log is recorded.
To stop recording, disable inference logs in the model observability configuration.

Step 2: View conversation history

  1. In the Models list, find the target model and click Logs in the Actions column.

  2. The Logs tab displays the real-time inference call records for the model, where the Request and Response fields correspond to the input and output for each call, respectively.

Set up proactive alerts

Important

This feature is currently available only in the Singapore and China (Beijing) regions.

Silent failures in models, such as timeouts or sudden increases in token consumption, are difficult to detect with traditional application logs. Model monitoring lets you set alerts for monitoring metrics such as cost, failure rate, and response latency. The system sends an alert immediately if a metric becomes abnormal.

Step 1: Turn on advanced monitoring

  1. Log on with an Alibaba Cloud account (or a RAM user with sufficient permissions), and on the Monitoring (Singapore or Beijing) page of the target workspace, click Monitoring Configurations in the upper-right corner.

  2. In the Advanced Monitoring area, turn on Performance and usage metrics monitoring.

Step 2: Create an alert rule

  1. On the Alerts (Singapore or Beijing) page, click Create Alert Rule in the upper-right corner.

  2. In the dialog box, select a model and a monitoring template, and then click Create. If the specified monitoring metrics, such as call statistics or performance metrics, become abnormal, the system notifies your team.

    • Notification method: Supported methods include text message, email, phone call, DingTalk group robot, WeCom Robot, and Webhook.

    • Alert level: Includes General, Warning, Error, and Urgent. These levels cannot be customized. The notification channel for each alert level is as follows:

      • Urgent: Phone call, text message, email

      • Error: Text message, email

      • Warning: Text message, email

      • General: Email

Integrate with Grafana and self-built applications

The monitoring metric data from monitoring is stored in your private Prometheus instance and supports the standard Prometheus HTTP API. You can use this API to integrate with Grafana or your custom applications for visual analytics.

Step 1: Get the data source HTTP API address

  1. Ensure that you have enabled Advanced Monitoring.

  2. On the Monitoring (Singapore), Monitoring (Virginia), or Monitoring (Beijing) page, click Monitoring Configuration in the upper-right corner. Click View Details to the right of the CloudMonitor Prometheus instance.

  3. On the Settings page, copy the HTTP API address that corresponds to your client's network environment (public network or VPC access).

    1

Step 2: Integrate with Grafana or a self-built application

Connect to a custom application

The following example shows how to retrieve monitoring data using the Prometheus HTTP API. For more information about API usage, see the Prometheus HTTP API documentation.

  • Example 1: Query the token consumption (query=model_usage) for all models in all workspaces under your Alibaba Cloud account within a specified time range (all day on November 20, 2025, UTC), with a step size of step=60s.

    Example

    Parameter description

    GET {HTTP API}/api/v1/query_range?query=model_usage&start=2025-11-20T00:00:00Z&end=2025-11-20T23:59:59Z&step=60s
    
    Accept: application/json
    Content-Type: application/json
    Authorization: Basic base64Encode(AccessKey:AccessKeySecret)
    • query: The value of query can be replaced with any metric name from the Monitoring metrics list below.

      View monitoring metrics

      Type

      Metric name

      Description

      Number of calls

      model_call_count

      Total number of model calls

      Call duration

      model_call_duration_total

      Total duration of model calls

      model_call_duration

      Average duration of model calls

      model_call_duration_p50

      p50 latency of model calls

      model_call_duration_p99

      p99 latency of model calls

      model_first_token_duration_total

      Total time to first token

      model_first_token_duration

      Average time to first token

      model_first_token_duration_p50

      p50 time to first token

      model_first_token_duration_p99

      p99 time to first token

      Time per non-first token

      model_generation_duration_per_token_total

      Total time per non-first token

      model_generation_duration_per_token

      Average time per non-first token

      model_generation_duration_per_token_p50

      p50 time per non-first token

      model_generation_duration_per_token_p99

      p99 time per non-first token

      Usage

      model_usage

      Total model usage

    • HTTP API: Replace {HTTP API} with the HTTP API address obtained in Step 1.

    • Authorization: Concatenate your Alibaba Cloud account's AccessKey:AccessKeySecret, Base64-encode the resulting string, and provide it in the format Basic <encoded-string>.

      Example value: Basic TFRBSTV3OWlid0U4XXXXU0xb1dZMFVodmRsNw==
      Note:AccessKey and AccessKey secret must belong to the same Alibaba Cloud account as the Prometheus instance from Step 1.
  • Example 2: Building on Example 1, add filters to retrieve token consumption only for a specific model (model=qwen-plus) in a specific workspace (workspace_id=llm-nymssti2mzww****).

    Example

    Description

    GET {HTTP API}/api/v1/query_range?query=model_usage{workspace_id="llm-nymssti2mzww****",model="qwen-plus"}&start=2025-11-20T00:00:00Z&end=2025-11-20T23:59:59Z&step=60 s
    
    Accept: application/json
    Content-Type: application/json
    Authorization: Basic base64Encode(AccessKey:AccessKeySecret)
    • query: Enclose multiple filter conditions in {} and separate them with commas. For example: {workspace_id="value1",model="value2"}. The following filter conditions (LabelKey) are supported.

      View supported filter conditions

      LabelKey

      Description

      user_id

      The ID of the Alibaba Cloud account.

      For a RAM user, this is the UID. For instructions, see How to obtain it

      apikey_id

      The ID of the API key. This is not the API key itself. You can obtain this ID from the Key Management (Singapore | US | Beijing) page.

      56

      Note

      A value of -1 for apikey_id indicates that the call was made from the Model Studio console, not through an API.

      workspace_id

      The ID of the workspace. For instructions, see How to get.

      model

      Model

      protocol

      The protocol type. Valid values are:

      • HTTP: Non-streaming HTTP.

      • SSE: Streaming HTTP.

      • WS: The WebSocket protocol.

      sub_protocol

      The sub-protocol. Valid values are:

      • DEFAULT: A synchronous call.

      • ASYNC: An asynchronous call.

        This is common for image generation models. For more information, see Text-to-image generation

      status_code

      The HTTP status code.

      This LabelKey is only supported by the model_call_count monitoring metric.

      error_code

      The error code.

      This LabelKey is only supported by the model_call_count monitoring metric.

      usage_type

      The usage type.

      This LabelKey is only supported by the model_usage monitoring metric.

      Valid values:

      • total_tokens

      • input_tokens

      • output_tokens

      • cache_tokens

      • image_tokens

      • audio_tokens

      • video_tokens

      • image_count

      • audio_count

      • video_count

      • duration

      • characters

      • audio_tts

      • times

Connect to Grafana

Add a model monitoring data source in Grafana (self-hosted or Alibaba Cloud Grafana service). This topic uses Grafana 10.x (English version) as an example. The steps for other versions are similar. For more information, see the official Grafana documentation.

  1. Add the data source:

    1. Log on to Grafana using an administrator account. Click the image icon in the upper-left corner of the page and select Administration > Data sources. Click + Add new data source. For the data source type, select Prometheus.

    2. On the Settings tab, configure the data source information:

      • Name: Enter a custom name.

      • Prometheus server URL: Enter the HTTP API address from Step 1.

      • Auth: Enable Basic auth, and set User (your Alibaba Cloud account's AccessKey) and Password (your Alibaba Cloud account's AccessKey secret).

        The AccessKey and AccessKey secret must belong to the same Alibaba Cloud account as the Prometheus instance from Step 1.

      image

    3. Click Save & Test at the bottom of the tab.

  2. Query metrics:

    1. Click the image icon in the upper-left corner of the Grafana page and, in the navigation pane on the left, click Dashboards.

    2. Click New > New dashboard on the right side of the Dashboards page to create a new dashboard.

    3. Click + Add visualization and select the data source that you just created.

    4. On the Edit Panel page, click the Query tab. In the A area, select _name_ and the metric name in the Label filters field. For example, to query the model token consumption model_usage:

      Example

      Description

      image

      In this example, the value of _name_ (model_usage) can be replaced with any metric name from the Monitoring metrics list below.

      View monitoring metrics

      Type

      Metric name

      Description

      Number of calls

      model_call_count

      Total number of model calls

      Call duration

      model_call_duration_total

      Total duration of model calls

      model_call_duration

      Average duration of model calls

      model_call_duration_p50

      p50 latency of model calls

      model_call_duration_p99

      p99 latency of model calls

      model_first_token_duration_total

      Total time to first token

      model_first_token_duration

      Average time to first token

      model_first_token_duration_p50

      p50 time to first token

      model_first_token_duration_p99

      p99 time to first token

      Time per non-first token

      model_generation_duration_per_token_total

      Total time per non-first token

      model_generation_duration_per_token

      Average time per non-first token

      model_generation_duration_per_token_p50

      p50 time per non-first token

      model_generation_duration_per_token_p99

      p99 time per non-first token

      Usage

      model_usage

      Total model usage

      Add the following label filters to refine the query:

      View supported filter conditions

      LabelKey

      Description

      user_id

      The ID of the Alibaba Cloud account.

      For a RAM user, this is the UID. For instructions, see How to obtain it

      apikey_id

      The ID of the API key. This is not the API key itself. You can obtain this ID from the Key Management (Singapore | US | Beijing) page.

      56

      Note

      A value of -1 for apikey_id indicates that the call was made from the Model Studio console, not through an API.

      workspace_id

      The ID of the workspace. For instructions, see How to get.

      model

      Model

      protocol

      The protocol type. Valid values are:

      • HTTP: Non-streaming HTTP.

      • SSE: Streaming HTTP.

      • WS: The WebSocket protocol.

      sub_protocol

      The sub-protocol. Valid values are:

      • DEFAULT: A synchronous call.

      • ASYNC: An asynchronous call.

        This is common for image generation models. For more information, see Text-to-image generation

      status_code

      The HTTP status code.

      This LabelKey is only supported by the model_call_count monitoring metric.

      error_code

      The error code.

      This LabelKey is only supported by the model_call_count monitoring metric.

      usage_type

      The usage type.

      This LabelKey is only supported by the model_usage monitoring metric.

      Valid values:

      • total_tokens

      • input_tokens

      • output_tokens

      • cache_tokens

      • image_tokens

      • audio_tokens

      • video_tokens

      • image_count

      • audio_count

      • video_count

      • duration

      • characters

      • audio_tts

      • times

    5. Click Run queries.

      If data is successfully rendered in the chart, the configuration is successful. Otherwise, check the following: 1) The HTTP API address, AccessKey, and AccessKey secret are correct. 2) The Prometheus instance from Step 1 contains monitoring data.

Comparison of monitoring modes

Monitoring provides two modes: Basic Monitoring and Advanced Monitoring.

Basic Monitoring: This service is automatically enabled when you activate Model Studio and cannot be disabled.
Advanced Monitoring:This feature must be manually enabled by an Alibaba Cloud account (or a RAM user with sufficient permissions) on the Monitoring (Singapore), Monitoring (Virginia), or Monitoring (Beijing) interface of the target workspace. You can also disable this feature. Only call data generated after this feature is enabled is recorded.

Comparison item

Basic Monitoring (Default)

Advanced Monitoring (Manual activation required)

Data latency

Hourly

Minute-level

View call statistics

Supported

Supported

View failed calls (details)

Not supported

Supported

View performance metrics

Supported

Supported

Applicable scope

All workspaces in the Alibaba Cloud account

Effective only in the workspace where it is enabled

Billing

Free

Paid

Quotas and limits

  • Data retention period: Data for both basic and advanced monitoring is retained for 30 days by default. To query earlier usage information, use the Expenses and Costs page.

  • Alert template limit: You can create a maximum of 100 alert templates in each workspace.

  • API limit: You can query model monitoring metrics only through the Prometheus HTTP API.

    • Workaround: To obtain the token consumption for a single call through the API, you can fetch the data from the usage field in the response for each model call. An example of this field's structure is as follows (For more information, see Qwen API reference):

      {
        "prompt_tokens": 3019,
        "completion_tokens": 104,
        "total_tokens": 3123,
        "prompt_tokens_details": {
          "cached_tokens": 2048
        }
      }

Billing description

FAQ

Why can't I find the number of calls and token consumption in monitoring after I call a model?

To troubleshoot the issue, perform the following steps:

  1. Data latency: Check the data synchronization latency. The latency for basic monitoring is several hours, and for advanced monitoring, it is several minutes.

  2. Workspace: If you are in a sub-workspace, you can view the data only within that space. Switch to the default workspace to view all data.

What are the possible reasons for a timeout when calling a model?

The following are common reasons:

  • Excessive output length: The model generates too much content, which causes the total time to exceed the client's waiting limit. You can use the streaming output method to obtain the first token faster.

  • Network issues: Check whether the network connectivity between the client and the Alibaba Cloud service is stable.

How do I configure permissions for a RAM user to enable advanced monitoring?

Procedure:

  1. Grant the RAM user the AliyunBailianFullAccess global management permission.

  2. Grant the RAM user the ModelMonitoring-FullAccess (or Administrator) page permissions to allow the RAM user to perform write operations on the model monitoring page.

  3. Grant the AliyunCloudMonitorFullAccess system policy to the RAM user.

  4. Create and grant the Create Service-Linked Role system policy to the RAM user.

    1. Log on to the RAM console. In the navigation pane on the left, choose Permissions > Policies, and then click Create Policy.

    2. Click JSON. Paste the following content into the policy editor and click OK.

      {
          "Version": "1",
          "Statement": [
              {
                  "Action": "ram:CreateServiceLinkedRole",
                  "Resource": "*",
                  "Effect": "Allow"
              }
          ]
      }
    3. Enter the policy name CreateServiceLinkedRole and click OK.

    4. In the navigation pane on the left, choose Identities > Users. Find the RAM user to which you want to grant permissions and click Add Permissions in the Actions column.

    5. From the Policies list, select the policy that you just created (CreateServiceLinkedRole), and then click Grant permissions. The RAM user now has the permission to create a service-linked role.

  5. After you configure all the preceding permissions, return to the Monitoring (Singapore), Monitoring (Virginia), or Monitoring (Beijing) page, and use your RAM user to enable Advanced Monitoring.

How do I configure permissions for a RAM user to enable inference logs?

Procedure:

  1. Grant the RAM user the AliyunBailianFullAccess global management permission.

  2. Grant the RAM user the ModelMonitoring-FullAccess (or Administrator) page permission, to allow them to perform write operations on the monitoring page.

  3. For the RAM user, grant the AliyunLogFullAccess system policy.

  4. Create and grant the Create Service-Linked Role system policy to the RAM user.

    1. Log on to the RAM console. In the navigation pane on the left, choose Permissions > Policies and then click Create Policy.

    2. Click JSON, paste the following content into the policy editor, and click OK.

      {
          "Version": "1",
          "Statement": [
              {
                  "Action": "ram:CreateServiceLinkedRole",
                  "Resource": "*",
                  "Effect": "Allow"
              }
          ]
      }
    3. Enter CreateServiceLinkedRole as the policy name and click OK.

    4. In the navigation pane on the left, choose Identities > Users. Find the RAM user to which you want to grant permissions and click Add Permissions in the Actions column.

    5. From the Policies list, select the policy that you just created (CreateServiceLinkedRole), and then click Grant permissions. The RAM user now has the permission to create a service-linked role.

  5. After you configure all the preceding permissions, return to the Monitoring (Beijing) page, and use the RAM user to enable Logs.

Appendix

Glossary

Term

Explanation

Real-time Inference

Includes all direct and indirect calls to a model in the following scenarios:

  • API calls through the DashScope SDK or OpenAI compatible interfaces

  • Playground

  • Model Studio applications in test or published states, including agents, workflows, agent orchestration applications, and nodes that involve model calls, such as LLM nodes, intent classification nodes, and agent group nodes

  • Assistant API calls

  • Application calls

Batch Inference

Processes large-scale data offline for scenarios that do not require real-time responses, using the API described in OpenAI compatible - Batch.