All Products
Search
Document Center

Platform For AI:Service monitoring

Last Updated:Mar 11, 2026

View EAS service metrics on the monitoring page to understand service calls and operations.

View service monitoring information

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. Click the target service name to enter the details page. Switch to the Monitoring tab.

  3. View service monitoring information.

    Switch Dashboards

    Dashboards are divided into service and instance dimensions. Switch between them as follows:

    image

    • Service: Service dimension. Default service monitoring dashboard name format is Service-<service_name>, where <service_name> is the EAS service name.

    • Instance: Instance dimension, divided into single instance and multiple instances.

      • Single Instance: Displays monitoring data for a single instance. Switch between different instances to view their data.

        image

      • Multiple Instance: Displays monitoring data for multiple instances. Select multiple instances to compare and view their data.

        image

    Switch Time Range

    Click image on the right side of the Monitoring area to switch the time range displayed on the dashboard.

    image

    Important

    Minute-level monitoring metrics are retained for a maximum of 1 month. Second-level monitoring metrics are retained for a maximum of 1 hour.

    Important

    LLM-related monitoring items display only when the service tag contains "ServiceEngineType": "vllm" or "ServiceEngineType" : "sglang".

Monitoring metrics

Service Monitoring Dashboard (Minute-Level)

Monitor the following metrics on this dashboard:

Metric

Description

QPS

Requests per second for the service. Requests with different return codes are calculated separately. For services with multiple instances, this metric is the sum across all instances. The 1d offset indicates QPS data for the same time on the previous day for sequential analysis.

Response

Total responses received by the service within the selected time range. Responses with different return codes are calculated separately. For services with multiple instances, this metric is the sum across all instances.

RT

Request response time.

  • Avg: Average response time of all requests at that time.

  • TPXX: Maximum response time of the top XX percent of requests after sorting all request times from lowest to highest at that time.

    For example, TP5 indicates maximum response time of the top 5% of requests. TP100 indicates maximum response time of all requests.

    For services with multiple instances, TP100 indicates maximum request response time across all instances. Other TPXX values are the average of TPXX across all instances. For example, TP5 indicates the average of TP5 across all instances.

Daily Invoke

Daily service calls. Calls with different return codes are calculated separately. For services with multiple instances, this metric is the sum across all instances.

More Metrics (CPU | Memory | GPU | Network | Resources)

Metric

Description

CPU

CPU

Average CPU usage of the service at that time. Unit: CPU cores. For services with multiple instances, this metric is the average across all instances.

CPU Utilization

Average CPU utilization of the service at that time. Calculation: Average CPU usage ÷ Maximum available CPU cores. For services with multiple instances, this metric is the average across all instances.

CPU Total

Total available CPU cores for the service at that time. Calculation: Available CPU cores per single instance × Number of service instances.

Memory

Memory

Average memory usage of the service at that time. For services with multiple instances, this metric is the average across all instances.

  • RSS: Resident physical memory size.

  • Cache: Cache size.

  • Total: Maximum available physical memory size for a single instance.

Memory Utilization

Average memory utilization of the service at that time. Calculation: Memory RSS ÷ Memory Total. For services with multiple instances, this metric is the average across all instances.

GPU

GPU Utilization

For GPU-enabled services, average GPU utilization at that time. For services with multiple instances, this metric is the average across all instances.

GPU Memory

For GPU-enabled services, GPU memory usage at that time. For services with multiple instances, this metric is the average across all instances.

GPU Total

For GPU-enabled services, total GPU amount at that time. For services with multiple instances, this metric is the sum of GPUs across all instances.

GPU Memory Utilization

For GPU-enabled services, GPU memory utilization at that time. For services with multiple instances, this metric is the average across all instances.

Network

Traffic

Traffic received and sent by the service, in bits per second. For services with multiple instances, this metric is the average across all instances.

Where:

  • In: Traffic received.

  • Out: Traffic sent.

TCP Connections

Number of TCP connections.

Resources

Replicas

Number of service instances in different states at that time: Total, Pending, Available.

Replicas By Resource

Number of service instances by resource type at that time: Total, Dedicated (dedicated resources), Public (public resources).

Single Instance Monitoring Dashboard (Minute-Level)

Monitor the following metrics on this dashboard:

Metric

Description

QPS

Requests per second received by this instance. Requests with different return codes are calculated separately.

RT

Response time of requests for this instance.

Response

Total responses received by this instance within the selected time range. Responses with different return codes are calculated separately.

More Metrics (CPU | Memory | GPU | Network | Resources)

Metric

Description

CPU

CPU

CPU usage of this instance, in CPU cores.

CPU Utilization

Average CPU utilization of this instance at that time. Calculation: Average CPU usage ÷ Maximum available CPU cores.

Memory

Memory

Memory usage of this instance.

  • RSS: Resident physical memory size.

  • Cache: Cache size.

  • Total: Maximum available physical memory size for a single instance.

Memory Utilization

Average memory utilization of this instance at that time. Calculation: Memory RSS ÷ Memory Total.

GPU

GPU Utilization

GPU utilization of this instance.

GPU Memory

GPU memory usage of this instance.

GPU Memory Utilization

GPU memory utilization of this instance.

Network

Traffic

Traffic received and sent by this instance, in bits per second.

Where:

  • In: Traffic received.

  • Out: Traffic sent.

TCP Connections

Number of TCP connections.

Multiple Instance Monitoring Dashboard

Minute-level and second-level monitoring metrics are detailed below.

  • Minute-Level

    Metric

    Description

    Instance QPS

    Requests per second for each instance. Requests with different return codes are calculated separately.

    Instance RT

    Average response time for each instance.

    Instance CPU

    CPU usage for each instance, in CPU cores.

    Instance Memory -- RSS

    Resident physical memory size for each instance.

    Instance Memory -- Cache

    Cache size for each instance.

    Instance GPU

    GPU utilization for each instance.

    Instance GPU Memory

    GPU memory usage for each instance.

    Instance TCP Connections

    Number of TCP connections for each instance.

  • Second-Level

    Important

    Data precision is accurate to 5 seconds. Only the last 1 hour of data is retained.

    Metric

    Description

    Instance QPS Fine

    Requests per second received by each instance. Requests with different return codes are calculated separately.

    Instance RT Fine

    Average response time for requests received by each instance.

GPU Monitoring Dashboard

Monitor the following GPU metrics at service and instance levels. Service-level metrics represent the average across all instances.

Metric

Description

GPU Utilization

GPU utilization of the service at that time.

GPU Memory

GPU memory usage and total GPU memory of the service at that time.

  • Used: GPU memory usage at that time.

  • Total: Total GPU memory at that time.

Memory Copy Utilization

GPU memory copy utilization of the service at that time.

GPU Memory Utilization

GPU memory utilization of the service at that time. Calculation: Memory usage ÷ Total memory.

PCIe

PCIe (Peripheral Component Interconnect Express) rate of the service at that time, measured by DCGM. PCIe is a high-speed serial computer expansion bus standard.

  • PCIe Transmit: PCIe transmission rate at that time.

  • PCIe Receive: PCIe reception rate at that time.

Memory Bandwidth

GPU memory bandwidth metric of the service at that time.

SM Utilization and Occupancy

SM (Streaming Multiprocessor) related metrics of the service at that time. SMs are core components of a GPU, responsible for executing and scheduling parallel computing tasks.

  • SM Utilization: SM utilization at that time.

  • SM Occupancy: Ratio of Warp threads residing on the SM at that time.

Graphics Engine Utilization

GPU graphics engine utilization of the service at that time.

Pipe Active Ratio

Activity rate of the GPU compute pipelines of the service at that time.

  • Pipe Fp32 Active Ratio: FP32 pipeline activity rate at that time.

  • Pipe Fp16 Active Ratio: FP16 pipeline activity rate at that time.

  • Pipe Tensor Active Ratio: Tensor pipeline activity rate at that time.

Tflops Usage

Tflops (Tera floating-point operations per second) compute volume of the GPU compute pipelines of the service at that time.

  • FP32 Tflops Used: FP32 pipeline Tflops compute volume at that time.

  • FP16 Tflops Used: FP16 pipeline Tflops compute volume at that time.

  • Tensor Tflops Used: Tensor pipeline Tflops compute volume at that time.

DRAM Active Ratio

Activity rate of the GPU device memory interface sending or receiving data at that time.

SM Clock

SM clock frequency of the service at that time.

GPU Temperature

GPU temperature related metrics of the service at that time.

  • GPU Temperature: GPU temperature at that time.

  • GPU Slowdown Temperature: GPU throttling temperature threshold at that time. When the GPU temperature reaches this value, the GPU automatically reduces its operating frequency to prevent overheating.

  • GPU Shutdown Temperature: GPU shutdown temperature threshold at that time. When the GPU temperature reaches this value, the system forces the GPU device to shut down. This prevents hardware damage or more severe system failures due to GPU overheating.

Power Usage

GPU power consumption of the service at that time.

The following are GPU health status and anomaly information metrics:

Metric

Description

GPU Health Count

Number of healthy GPU cards for the service at that time.

GPU Lost Card Num

Number of lost GPU cards for the service at that time.

ECC Error Count

Number of ECC errors for the service at that time. ECC (Error Correction Code) detects and corrects errors during GPU memory data transmission or storage.

  • Volatile SBE ECC Error: Number of single-bit volatile ECC errors for the service at that time.

  • Volatile DBE ECC Error: Number of double-bit volatile ECC errors for the service at that time.

  • Aggregate SBE ECC Error: Number of single-bit persistent ECC errors for the service at that time.

  • Aggregate DBE ECC Error: Number of double-bit persistent ECC errors for the service at that time.

  • Uncorrectable ECC Error: Number of uncorrectable ECC errors for the service at that time.

NVSwitch Error Count

Number of NVSwitch errors for the service at that time. NVSwitch provides high-bandwidth, low-latency communication channels for high-speed communication between multiple GPUs.

  • NVSwitch Fatal Error: Number of fatal NVSwitch errors for the service at that time.

  • NVSwitch Non-Fatal Error: Number of non-fatal NVSwitch errors for the service at that time.

Xid Error Count

Number of Xid errors for the service at that time. Xid errors are error codes reported by the GPU driver. They indicate issues encountered by the GPU during operation. These errors are typically recorded in system logs (such as Linux dmesg or Windows Event Viewer) and represented as Xid codes.

  • Xid Error: Number of non-fatal Xid errors for the service at that time.

  • Fatal Xid Error: Number of fatal Xid errors for the service at that time.

Kernel Error Count

Number of non-Xid errors for the service at that time. Non-Xid errors refer to other types of errors reported in kernel logs, excluding Xid errors.

Driver Hang

Number of GPU driver hangs for the service at that time.

Remap Status

The state of the service when the GPU attempts to remap GPU memory rows.

VLLM Monitoring Dashboard

If the service has multiple instances, throughput-related metrics are the sum of instances. Latency-related metrics are the average of instances.

Metric

Description

Requests Status

Total requests for the service at that time.

  • Running: Number of requests running on the GPU at that time.

  • Waiting: Number of requests waiting for processing at that time.

  • Swapped: Number of requests swapped to the CPU at that time.

Token Throughput

Number of input and generated tokens for all requests of the service at that time.

  • TPS_IN: Number of input tokens at that time.

  • TPS_OUT: Number of output tokens at that time.

Request Completion Status

Completion status statistics for all requests of the service at that time.

  • preemptions: Requests preempted.

  • stop: Requests successfully completed due to natural termination (the model output a stop token, such as <EOS>).

  • length: Requests reached the maximum output token length.

  • abort: Requests forcibly terminated.

Time To First Token

Time to first token latency for all requests of the service at that time (time from receiving a request to generating the first token).

  • Avg: Average time to first token latency for all requests at that time.

  • TPXX: Percentile values for time to first token latency for all requests at that time.

Time Per Output Token

Time per output token latency for all requests of the service at that time (average time required for each output token after the first token is generated).

  • Avg: Average time per token latency for all requests at that time.

  • TPXX: Percentile values for time per token latency for all requests at that time.

E2E Request Latency

End-to-end latency for all requests of the service at that time (time from receiving a request to returning all tokens).

  • Avg: Average end-to-end latency for all requests at that time.

  • TPXX: Percentile values for end-to-end latency for all requests at that time.

Queue Time

Queue waiting latency for all requests of the service at that time (time requests wait in queue for engine processing).

  • Avg: Average queue waiting latency for all requests at that time.

  • TPXX: Percentile values for queue waiting latency for all requests at that time.

Inference Time

Inference latency for all requests of the service at that time (time requests are processed by the engine).

  • Avg: Average inference latency for all requests at that time.

  • TPXX: Percentile values for inference latency for all requests at that time.

Prefill Time

Prefill stage latency for all requests of the service at that time (time the engine processes request input tokens).

  • Avg: Average prefill latency for all requests at that time.

  • TPXX: Percentile values for prefill latency for all requests at that time.

Decode Time

Decode stage latency for all requests of the service at that time (time the engine generates output tokens).

  • Avg: Average decode latency for all requests at that time.

  • TPXX: Percentile values for decode latency for all requests at that time.

Input Token Length

Number of input tokens processed by the service at that time.

  • Avg: Average input token length for all requests at that time.

  • TPXX: Percentile values for input token length for all requests at that time.

Output Token Length

Number of output tokens generated by the service at that time.

  • Avg: Average output token length for all requests at that time.

  • TPXX: Percentile values for output token length for all requests at that time.

Request Parameters(params_n & max_tokens)

Parameter N and parameter max_tokens for all requests of the service at that time.

  • Params_n: Average value of parameter N for all requests at that time.

  • Params_max_tokens: Average value of parameter max_tokens for all requests at that time.

GPU KV Cache Usage

Average GPU KV cache utilization of the service at that time.

CPU KV Cache Usage

Average CPU KV cache utilization of the service at that time.

Prefix Cache Hit Rate

Average prefix cache hit rate for all requests of the service at that time.

  • GPU: Average GPU prefix cache hit rate for all requests at that time.

  • CPU: Average CPU prefix cache hit rate for all requests at that time.

HTTP Requests by Endpoint

Number of requests for the service at that time, grouped by request method, path, and response status code.

HTTP Request Latency

Average latency for different request paths of the service at that time.

Speculative Decoding Throughput

Speculative decoding count for the service at that time. For services with multiple instances, this metric is the average across all instances.

  • Drafts: Number of Drafts Tokens generated at that time.

  • Draft Tokens: Number of Drafts Tokens processed at that time.

  • Accepted Tokens: Number of Drafts Tokens accepted at that time.

  • Emitted Tokens: Number of Drafts Tokens emitted at that time.

Speculative Decoding Efficiency

Speculative decoding performance of the service at that time.

  • Draft Acceptance Rate: Average ratio of Drafts Tokens accepted at that time.

  • Efficiency: Average efficiency of speculative decoding at that time.

Token Acceptance by Position

Number of Drafts Tokens accepted at different generation positions for the service at that time. For services with multiple instances, this metric is the average across all instances.

SGLang Monitoring Dashboard

If the service has multiple instances, throughput-related metrics are the sum of instances. Latency-related metrics are the average of instances.

Metric

Description

Requests Num

Total requests for the service at that time.

  • Running: Number of requests running on the GPU at that time.

  • Waiting: Number of requests waiting for processing at that time.

Token Throughput

Number of input and generated tokens for all requests of the service at that time.

  • TPS_IN: Number of input tokens at that time.

  • TPS_OUT: Number of output tokens at that time.

Time To First Token

Time to first token latency for all requests of the service at that time. Time to first token latency is the time from receiving a request to generating the first token.

  • Avg: Average time to first token latency for all requests at that time.

  • TPXX: Percentile values for time to first token latency for all requests at that time.

Time Per Output Token

Time per output token latency for all requests of the service at that time. Time per token latency is the average time required for each subsequent output token after the first token is generated.

  • Avg: Average time per token latency for all requests at that time.

  • TPXX: Percentile values for time per token latency for all requests at that time.

E2E Request Latency

End-to-end latency for all requests of the service at that time. End-to-end latency is the time from receiving a request to returning all tokens.

  • Avg: Average end-to-end latency for all requests at that time.

  • TPXX: Percentile values for end-to-end latency for all requests at that time.

Cache Hit Rate

Average prefix cache hit rate for all requests of the service at that time.

Used Tokens Num

Number of KV cache tokens used by the service at that time. For services with multiple instances, this metric is the average across all instances.

Token Usage

Average KV cache token utilization of the service at that time. For services with multiple instances, this metric is the average across all instances.

FAQ

Q: LLM Monitoring Dashboard Missing from Monitoring Page

Problem Description: After deploying a model using EAS custom deployment, the monitoring page only shows general Service and GPU monitoring, and LLM monitoring is missing.

Root Cause: The service configuration lacks the key tag ServiceEngineType. This tag explicitly declares the backend inference engine type.

image

Note

Other parameters provided by Model Gallery deployment do not affect LLM monitoring, except for the ServiceEngineType tag.

Solution: Update the service configuration. Add the ServiceEngineType tag. Set its value based on the inference deployment engine used (only vllm or sglang are supported).

Q: Why do /metrics 200 frequently appear in logs?

After the ServiceEngineType tag is correctly configured and takes effect, the EAS backend periodically calls the inference deployment framework's /metrics API operation. This occurs approximately every 10-15 seconds, including the collection interval and polling across all pods. This API operation provides real-time framework metrics in Prometheus format, which the frontend uses to render LLM monitoring data.

Reference