The system monitors the amount of data in queues and automatically scales an inference service to control the number of inference service instances. This topic describes how to configure automatic scaling for an inference service.
How it works
In asynchronous inference scenarios, the system can automatically add or remove service instances based on the queue status. If the queue is empty, the system can remove all service instances to reduce costs. The following figure shows how the automatic scaling mechanism works in an asynchronous inference service.
Configure automatic scaling for the asynchronous inference service
Log on to the EASCMD client and run the following command to enable the auto scaling feature for the asynchronous inference service. For information about how to log on to the EASCMD client, see Download the EASCMD client and complete user authentication.
Syntax
eascmd autoscale <service_name> -Dmin=[attr_value] -Dmax=[attr_value] -Dstrategies.queue[avgbacklog]=[attr_value]
Where:
queue[avgbacklog]: the average number of requests that each service instance in the queue can process. This parameter is used as a threshold to trigger auto scaling for the queue service.
<service_name>: the name of the asynchronous inference service.
Example
eascmd autoscale pmmlasync -Dmin=0 -Dmax=10 -Dstrategies.queue[avgbacklog]=10
In the preceding code:
queue[avgbacklog]=10: indicates that each instance of the asynchronous inference service can process up to 10 requests.
max=10: indicates that the asynchronous inference service can be scaled out to up to 10 instances.
min=0: indicates that the asynchronous inference service can be scaled in to zero instances.
If the asynchronous inference service has three instances and the number of requests in the queue exceeds 30 (more than 10 requests for each instance), a scale-out activity is triggered. The inference service can be scaled out to up to 10 instances. If the number of requests in the queue is less than or equal to 30, a scale-in activity is triggered. If the queue is empty, the inference service can be scaled in to zero instances. In this case, the queue still functions as expected. When new requests are delivered to the queue, the inference service is scaled out again.
You can run the following command to configure waiting periods for scale-in or scale-out activities.
Syntax
eascmd autoscale <service_name> -Dbehavior.<attr_name>.stabilizationWindowSeconds=<attr_value>
In the preceding code:
<service_name>: the name of the asynchronous inference service.
<attr_name>: the value is scaleDown for scale-in activities and scaleUp for scale-out activities.
<attr_value>: the waiting period in seconds. The value must be of the INT type.
The default waiting period for scale-in activities is 300 seconds. We recommend that you do not set the waiting period for scale-in activities to a small value to prevent frequent scale-in activities.
The default waiting period for scale-out activities is 0 seconds. In most cases, we recommend that you set the waiting period for scale-out activities to a small value to scale out the service at the earliest opportunity when resources are exhausted. This helps prevent service interruptions.
Examples
Configure the waiting period for scale-in activities
eascmd autoscale pmmlasync -Dbehavior.scaleDown.stabilizationWindowSeconds=100
The system must wait 100 seconds before it can trigger a scale-in activity.
Configure the waiting period for scale-out activities
eascmd autoscale pmmlasync -Dbehavior.scaleUp.stabilizationWindowSeconds=100
The system must wait 100 seconds before it can trigger a scale-out activity.
You can modify the configuration file to configure the waiting periods for scale-in and scale-out activities.
Syntax
eascmd autoscale <service_name> -s <scale.json>
In the preceding code:
<service_name>: the name of the asynchronous inference service.
<scale.json>: the configuration file. Sample file content:
{ "behavior": { "scaleUp": { "stabilizationWindowSeconds": 20 }, "scaleDown": { "stabilizationWindowSeconds": 150 } } }
Example
eascmd autoscale pmmlasync -s scale.json
The system must wait 20 seconds before it can trigger a scale-out activity and 150 seconds before it can trigger a scale-in activity.