All Products
Search
Document Center

Platform For AI:Horizontal auto scaling feature

Last Updated:Jan 19, 2026

Manually managing replicas to handle traffic peaks and valleys is inefficient and can lead to slow response times, service overloads, or idle resources. Horizontal auto scaling in Elastic Algorithm Service (EAS) automatically adjusts the number of service replicas based on real-time load. This ensures service stability while maximizing resource utilization, striking an optimal balance between cost and performance.

How it works

Horizontal auto scaling dynamically adjusts the number of replicas based on configured metric thresholds.

  • Calculate the target number of replicas: The system determines the target number of replicas (desiredReplicas) by the ratio of the current metric value (currentMetricValue) to the desired metric value (desiredMetricValue), factored by the current number of replicas (currentReplicas).

    • Formula: desiredReplicas = ceil[currentReplicas × ( currentMetricValue / desiredMetricValue )]

    • Example: Assume you have 2 current replicas and the QPS Threshold of Individual Instance is set to 10. When the average QPS per replica rises to 23, the target number of replicas becomes 5 = ceil[2 * (23/10)]. Later, if the average QPS drops to 2, the target number of replicas becomes 1 = ceil[5 * (2/10)].

    • If you configure multiple metrics, the system calculates the target number of replicas for each metric and uses the maximum of these values as the final target number of replicas.

  • Trigger logic: When the calculated target number of replicas is greater than the current number, the system triggers a scale-out. When the target is less than the current number, it triggers a scale-in.

    Important

    To prevent frequent scale-out and scale-in operations due to metric fluctuations, the system applies a 10% toleration range to the threshold. For example, if the queries per second (QPS) threshold is set to 10, a scale-out operation is triggered only when the QPS is consistently above 11 (10 × 1.1). This means:

    • If the QPS briefly fluctuates between 10 and 11, the system does not scale out.

    • A scale-out operation is triggered only when the QPS remains stable at 11 or higher.

    This mechanism reduces unnecessary resource changes and improves system stability and cost-effectiveness.

  • Delayed execution: Scaling operations support a delay mechanism to prevent frequent adjustments caused by brief traffic fluctuations.

User guide

Configure horizontal auto scaling policies using the PAI console or the eascmd client.

Enable or update horizontal auto scaling

Use the console

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the service list, click the name of the target service to go to the service details page.

  3. On the Auto Scaling tab, in the Auto Scaling section, click Enable Auto Scaling or Update.

    imageimage

  4. In the Auto Scaling Settings dialog box, configure the following parameters.

Parameter description
  • Basic configuration

    Parameter

    Description

    Recommendations and risk warnings

    Minimum Replicas

    The minimum number of replicas a service can scale in to. The minimum value is 0.

    Production environment recommendation: For services that require continuous availability, it is strongly recommended to set this value to 1 or higher.

    Important

    Setting this to 0 removes all service replicas when there is no traffic. New requests will then face a full cold start delay (which can range from tens of seconds to several minutes), during which the service is unavailable. Additionally, services that use a dedicated gateway do not support setting this value to 0.

    Maximum Replicas

    The maximum number of replicas a service can scale out to. The maximum value is 1000.

    Set this value based on your estimated peak traffic and account resource quota to prevent unexpected traffic spikes from causing cost overruns.

    General Scaling Metrics

    Built-in performance metrics used to trigger scaling.

    • QPS Threshold of Individual Instance: Set based on stress test results, typically at 70% to 80% of a single replica's optimal performance.

      Important

      To set the single-replica QPS threshold to a decimal value, use the client (eascmd) and set the qps1k field.

    • CPU Utilization Threshold: Setting this too low can waste resources, while setting it too high can negatively impact request latency. Set this value based on response time (RT) metrics.

    • GPU utilization threshold: Set this value based on RT metrics.

    • Asynchronous Queue Length: Applies only to asynchronous services. Set this based on the average task processing time and acceptable latency. For more information, see Configure horizontal auto scaling for asynchronous inference services.

    Custom Scaling Metric

    You can report custom metrics and use them for auto scaling. For more information, see Custom monitoring and scaling metrics.

    Suitable for complex scenarios where built-in metrics do not meet business requirements.

  • Advanced configuration

    Parameter

    Description

    Recommendations and risk warnings

    Scale-out Starts in

    The observation window for scale-out decisions. After a scale-out is triggered, the system observes the metric during this period. If the metric value falls back below the threshold, the scale-out is canceled. The unit is seconds.

    The default is 0 seconds, which means scale-out occurs immediately. Increase this value (for example, to 60 seconds) to prevent unnecessary scaling caused by transient traffic spikes.

    Scale-in Starts in

    The observation window for scale-in decisions, which is a key parameter to prevent service jitter. A scale-in only occurs after the metric remains below the threshold for this entire duration. The unit is seconds.

    The default is 300 seconds. This is the core safeguard against frequent scale-in events due to traffic fluctuations. Do not set this value too low, as it may affect service stability.

    Scale-in to 0 Instance Starts in

    When Minimum Replicas is 0, this parameter defines the wait time before the replica count is reduced to 0.

    Delays the complete shutdown of the service, providing a buffer for potential traffic recovery.

    Scale-from-Zero Replica Count

    The number of replicas to add when the service scales out from 0 replicas.

    Set to a value that can handle the initial traffic burst and reduce service unavailability during a cold start.

Use the client

Before you run the commands, ensure you have downloaded and authenticated the client. Both enabling and updating use the autoscale command. Set the policy by using the -D parameter or a JSON configuration file.

  • Parameter format:

    # Format: eascmd autoscale [region]/[service_name] -D[attr_name]=[attr_value]
    # Example: Set the minimum number of replicas to 2, the maximum to 5, and the QPS threshold to 10.
    eascmd autoscale cn-shanghai/test_autoscaler -Dmin=2 -Dmax=5 -Dstrategies.qps=10
    # Example: Set the scale-in delay to 100 seconds.
    eascmd autoscale cn-shanghai/test_autoscaler -Dbehavior.scaleDown.stabilizationWindowSeconds=100
  • Configuration file format:

    # Step 1: Create a configuration file (for example, scaler.json).
    # Step 2: Run the command: eascmd autoscale [region]/[service_name] -s [desc_json]
    # Example
    eascmd autoscale cn-shanghai/test_autoscaler -s scaler.json
Configuration example

The following scaler.json example includes common configuration options:

scaler.json

{
    "min": 1,
    "max": 2,
    "behavior": {
        "onZero": {
            "interceptTraffic": false,
            "scaleDownGracePeriodSeconds": 700,
            "scaleUpActivationReplicas": 2
        },
        "scaleDown": {
            "stabilizationWindowSeconds": 20
        },
        "scaleUp": {
            "stabilizationWindowSeconds": 10
        }
    },
    "scaleStrategies": [
        {
            "metricName": "queue[backlog]",
            "threshold": 10
        },
        {
            "metricName": "qps",
            "threshold": 1
        },
        {
            "metricName": "cpu",
            "threshold": 80
        },
        {
            "metricName": "gpu[util]",
            "threshold": 60
        }
    ]
}
Parameter description

Parameter

Description

min

Minimum number of replicas.

max

Maximum number of replicas.

scaleStrategies

Scaling metrics and thresholds.

  • qps: QPS Threshold of Individual Instance.

    Important

    The QPS Threshold of Individual Instance metric supports decimal values with up to 2 decimal places, such as 1.25. To set a decimal value, use the qps1k field. For example, setting qps1k to 1250 means that a scale-out is triggered when the average QPS of a single replica is greater than 1.25. The configuration example is as follows:

    {
        "min": 1,
        "max": 2,
        "scaleStrategies": [
            {
                "metricName": "qps1k",
                "threshold": 1250
            }
        ]
    }
  • cpu: CPU Utilization Threshold.

  • gpu[util]: GPU Usage Threshold.

  • queue[backlog]: Asynchronous Queue Length.

behavior.scaleUp.stabilizationWindowSeconds

Corresponds to Scale-out Delay in the console.

behavior.scaleDown.stabilizationWindowSeconds

Corresponds to Scale-in Delay in the console.

Disable horizontal auto scaling

Using the client

  • Command format

    eascmd autoscale rm [region]/[service_name]
  • Example

    eascmd autoscale rm cn-shanghai/test_autoscaler

Production best practices

Scenario-specific configuration guide

  • CPU-intensive online inference services: Configure both the CPU Utilization Threshold and the QPS Threshold Per Replica. CPU utilization reflects resource consumption and QPS reflects business load. Combining these metrics enables more precise scaling.

  • GPU-intensive online inference services: Focus primarily on the GPU Utilization Threshold. When GPU computing units are saturated, scale out promptly to allow the service to handle more concurrent tasks.

  • Asynchronous task processing services: Use the Asynchronous Queue Length as the core metric. When the number of backlogged tasks in the queue exceeds the threshold, scaling out increases processing capacity and reduces task waiting times.

Stability best practices

  • Avoid scaling in to zero: For synchronous services in a production environment, always set the Minimum Replicas to 1 or higher to ensure continuous availability and low latency.

  • Set a reasonable delay: Use the Scale-in Delay to prevent service jitter caused by normal traffic fluctuations. The default value of 300 seconds is suitable for most scenarios.

FAQ

Why does my service not scale out even when the threshold is met?

Possible reasons include:

  • Insufficient resource quota: The available vCPU or GPU quota in your account for the current region is exhausted.

  • Scale-out delay is active: If you configured a Scale-out Delay, the system is waiting for this period to end to confirm the traffic increase is sustained.

  • replica health check failed: The newly scaled-out replicas failed their health checks, causing the operation to fail.

  • Maximum number of replicas reached: The current number of replicas has reached the configured Maximum Replicas limit.

Why does my service scale in and out frequently (jitter)?

This is usually caused by an improperly configured scaling policy:

  • Threshold is too sensitive: The threshold is set too close to the normal load level, causing minor fluctuations to trigger scaling events.

  • Scale-in delay is too short: A short delay period makes the system overreact to brief drops in traffic, leading to unnecessary scale-ins. When traffic recovers, another scale-out is immediately triggered. Increase the Scale-in Delay.

References