You can configure auto scaling rules based on the total number of on-demand instances and provisioned instances and limits on instance scaling speed. For provisioned instances, you can create a scheduled scaling policy and water-level scaling policy to improve resource utilization.
Instance scaling behavior
Function Compute preferentially uses existing instances to process requests. If the existing instances are fully loaded, Function Compute creates new instances to process requests. As the number of invocations increases, Function Compute continues to create new instances until enough instances are created to handle requests or the upper limit is reached. The scale-out of instances is subject to the limitations of the scaling speed. For more information, see Limits on the scale-out speed of instances in different regions.
This section describes instance scaling behavior of on-demand and provisioned instances. After you configure provisioned instances for a function, a specific number of instances are reserved prior to function invocations so that execution of requests are not delayed by cold starts.
Scaling of on-demand instances
When the total number of instances or the scaling-out speed of instances reaches the limit, Function Compute reports a throttling error, for which the HTTP status code
is 429
. The following figure shows how Function Compute performs throttling in a scenario in which the number of invocations rapidly increases.
①: Before the upper limit for burst instances is reached, Function Compute immediately creates instances when the number of requests increases. During this process, a cold start occurs but no throttling error is reported.
②: When the limit for burst instances is reached, the increase of instances is restricted by the growth rate. Throttling errors are reported for some requests.
③: When the upper limit of instances is reached, some requests are throttled.
Scaling of provisioned instances
If the number of sudden invocations is large, the creation of a large number of instances is throttled, which results in request failures. Cold starts of instances also increase request latency. To prevent such issues, you can use provisioned instances in Function Compute. Provisioned instances are reserved in advance of invocations.
The following figure shows the throttling behavior of provisioned instances in a scenario that has the same amount of traffic as the preceding figure.
①: Before the provisioned instances are fully loaded, requests are immediately processed. During this process, no cold start occurs and no throttling error is reported.
②: When the provisioned instances are fully loaded, Function Compute immediately creates instances before the upper limit for burst instances is reached. During this process, a cold start occurs but no throttling error is reported.
Limits on the scale-out speed of instances in different regions
Region | Upper limit of burst instances | Upper limit of instance growth rate |
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), and China (Shenzhen) | 300 | 300 per minute |
Other regions | 100 | 100 per minute |
In the same region, the limits on instance scale-out speed for provisioned instances and on-demand instances are the same.
The scaling speed of GPU-accelerated instances is lower than that of elastic instances. We recommend that you use GPU-accelerated instances together with the provisioned mode.
If you need higher scaling speed, join the DingTalk group 64970014484 for technical support.
Auto scaling of provisioned instances
A fixed number of provisioned instances may lead to insufficient utilization of resources. You can configure scheduled scaling and water-level scaling to resolve this issue.
If you configure a scheduled scaling policy and a water-lever scaling policy, the maximum value specified by these scaling policies is used as the number of provisioned instances.
Scheduled scaling
Scenarios
Scheduled instance scaling applies to functions that have noticeable cyclical patterns or predictable traffic peaks. If the number of concurrent invocations is greater than the concurrency capacity of the scheduled scaling policy, excess requests are sent to on-demand instances for processing.
Sample configuration
You can configure two scheduled scaling policies. The first one increases the number of provisioned instances before traffic surges. The second policy decreases the number of provisioned instances when the traffic declines. The following figure shows the details.
The following code snippet provides an example of the request parameters of PutProvisionConfig that is called to create a scheduled scaling policy. In this example, a scheduled scaling policy is configured for function_1. The time zone is set to Asia/Shanghai (UTC+8). The policy takes effect from 10:00:00 on August 1, 2024 to 10:00:00 on August 30, 2024. The policy increases the number of instances to 50 at 20:00 every day and reduces the number of instances to 10 at 22:00 every day.
"scheduledActions": [
{
"name": "scale_up_action",
"startTime": "2024-08-01T10:00:00",
"endTime": "2024-08-30T10:00:00",
"target": 50,
"scheduleExpression": "cron(0 0 20 * * *)",
"timeZone": "Asia/Shanghai"
},
{
"name": "scale_down_action",
"startTime": "2024-08-01T10:00:00",
"endTime": "2024-08-30T10:00:00",
"target": 10,
"scheduleExpression": "cron(0 0 22 * * *)",
"timeZone": "Asia/Shanghai"
}
]
Cron expressions
Water-level scaling
Scenarios
Function Compute periodically collects values of metrics, such as the concurrency utilization of provisioned instances or resource utilization of provisioned instances. Provisioned instances are scaled based on values of the metrics and the minimum and maximum numbers of provisioned instances you specify.
Sample configuration
Assume you configure a water-level scaling policy in which you specify concurrency utilization threshold of provisioned instances. When the traffic volume increases, the scale-out threshold is triggered and the system starts to scale out provisioned instances. The scale-out stops if the specified maximum value is reached. Excess requests are sent to on-demand instances. When the traffic volume decreases, the scale-in threshold is triggered and the system starts to scale in provisioned instances. The following figure shows the details.
If you use reserved scaling, you must enable the instance-level metrics feature. Otherwise, the
400 InstanceMetricsRequired
error is reported. For more information about how to enable instance-level metrics, see Configure instance-level metrics.The concurrency utilization metric only includes concurrency of provisioned instances and does not include concurrency of on-demand instances.
The concurrency utilization of provisioned instances is the ratio of the concurrency requests to which provisioned instances are responding to the maximum concurrency value. The value range is [0,1].
The following code snippet provides an example of the request parameters of PutProvisionConfig that is called to create a water-level scaling policy. In this example, a water-level scaling policy is configured for function_1. The time zone is set to Asia/Shanghai (UTC+8). The policy takes effect from 10:00:00 on August 1, 2024 to 10:00:00 on August 30, 2024. The policy tracks ProvisionedConcurrencyUtilization and starts to perform scale-out when the concurrency utilization exceeds 60% and scale-in when the concurrency utilization is lower than 60%. The upper limit is 100 and the lower limit is 10.
"targetTrackingPolicies": [
{
"name": "action_1",
"startTime": "2024-08-01T10:00:00",
"endTime": "2024-08-30T10:00:00",
"metricType": "ProvisionedConcurrencyUtilization",
"metricTarget": 0.6,
"minCapacity": 10,
"maxCapacity": 100,
"timeZone": "Asia/Shanghai"
}
]
Scaling principles
A relatively conservative scale-in process is achieved by using a scale-in coefficient, whose value falls in the range of (0,1]. The scale-in coefficient is a system parameter that is used to slow down the scale-in speed. You do not need to set the scale-in coefficient. The target values for scaling operations are the smallest integers that are greater than or equal to the following calculation results:
Scale-out target = Number of current provisioned instances × (Current metric value/Specified utilization threshold)
Scale-in target = Number of current provisioned instances × Scaling coefficient × (1 – Current metric value/Specified utilization threshold)
Example:
If the current metric value is 80%, the specified utilization threshold is 40%, and the number of current provisioned instances is 100, the target number of instances is: 100 × (80%/40%) = 200. The number of provisioned instances is increased to 200, if the specified maximum number of provisioned instances is equal to or greater than 200. This ensures that the utilization threshold remains close to 40%.
Maximum concurrency
The following items describe how to calculate the maximum number of concurrent invocations for different instance concurrency values:
A single instance processes a single request at a time
Maximum concurrency = Number of instances.
A single instance concurrently processes multiple requests at a time
Maximum concurrency = Number of instances × Instance concurrency
For more information about the scenarios, benefits, configurations, and impacts of the instance concurrency feature, see Configure instance concurrency.
More information
For more information about the basic concepts and billing methods of on-demand instances and provisioned instances, see Instance types and usage modes.
For more information about how to improve resource utilization of provisioned instances, see Configure provisioned instances.
By default, all functions within an Alibaba Cloud account in the same region share the preceding scaling limits. To limit the number of instances for a function, you can specify an upper limit for concurrent instances. For more information, see Specify the maximum number of concurrent instances. After the maximum number of on-demand instances is specified, Function Compute returns a throttling error if the total number of running instances for the function reaches the specified limit.