When you wait for the inference result of a time-consuming inference service in AI content generation or video processing scenarios that involve complex model inference, the request may fail due to connection timeouts or load imbalance may occur between service instances. To avoid the preceding issues, Platform for AI (PAI) provides the asynchronous inference feature that allows you to obtain inference results by subscribing to requests or polling. This topic describes how to use an asynchronous inference service.
Background information
Feature description
Asynchronous inference
In most cases, synchronous inference is used for online inference services because online inference requires real-time interaction. When synchronous inference is used, a client waits for the response and remains idle after sending a request.
Asynchronous inference is commonly used for time-consuming inference services to avoid the following issues when a client waits for the response: the HTTP persistent connection is closed or the client times out. When asynchronous inference is used, a client no longer waits for the response after sending a request to the server. The client periodically queries the inference result or subscribes to the request, which allows the server to automatically push the inference result to the client.
Queue service
Quasi-real-time inference services, such as the processing of short videos, video streams, audio streams, and other complex graphic processing services, do not need to return inference results in real time. However, these services need to return inference results within a specific period of time. In this case, the following issues may occur:
You cannot use round robin as the load balancing algorithm. You need to distribute requests based on the actual loads of each instance.
When an instance fails, other instances must take over the uncompleted tasks of the failed instance.
PAI provides a queue service framework to help resolve the preceding request distribution issues.
How the asynchronous inference service and queue service work
When you create an asynchronous inference service, the service is integrated with an inference subservice and a queue subservice. By default, each queue subservice creates two queues: an input queue and an output queue (sink queue). When the service receives a request, the request is sent to the input queue. The EAS service framework of the inference subservice automatically subscribes to the queue and obtains the request data in a streaming manner, calls the operations to process the request data, and writes the inference results to the output queue.
When the output queue is full, the service framework can no longer write data to the output queue. In this case, the service framework stops receiving data from the input queue.
If you want to ignore the output queue to deliver the inference result to other services such as Object Storage Service (OSS) or your message middleware, you can configure the API requests to return an empty string. This way, the output queue is automatically ignored.
Create a highly-available queue subservice to receive client requests. The client subscribes to requests within the upper concurrency limit of the client. The queue subservice ensures that the number of requests processed on each instance does not exceed the subscription window size of the client. This ensures that the instances of the inference subservice are not overloaded and can return the inference results to the client as expected.
NoteFor example, if each instance can process up to five audio streams, set the subscription window size to 5. After an instance finishes processing an audio stream and commits the result, the queue subservice pushes another audio stream to the instance. This ensures that the instance processes no more than five audio streams at the same time.
The queue subservice checks the status of the connections between the inference subservice instances and the client to assess the health status of the instances. If the client is unexpectedly disconnected from an instance, the queue subservice considers the instance unhealthy and distributes the uncompleted requests to other healthy instances. This ensures that all requests are handled as expected.
Create an asynchronous inference service
When you create an asynchronous inference service, the system automatically creates a service group that has the same name as the asynchronous inference service to facilitate usage. The system also automatically creates a queue subservice and integrates the queue subservice into the asynchronous inference service. By default, the queue subservice starts one instance and automatically scales based on the number of instances of the inference subservice. The queue subservice can start up to two instances. By default, each instance is equipped with 1 vCPU and 4 GB of memory. If the default number of instances for the queue subservice cannot meet your business requirements, you can configure the related parameters of the instances. For more information, see the Parameter configuration for a queue subservice section of this topic.
You can use one of the following methods to create an asynchronous inference service.
Use the PAI console
Go to the Custom Deployment page and configure the following parameters. For information about other parameters, see Deploy a model service in the PAI console.
Deployment Method: Select Image-based Deployment or Processor-based Deployment.
Asynchronous Services: Turn on Asynchronous Services.
After you configure the parameters, click Deploy.
Use an EASCMD client
Prepare a service configuration file named service.json.
Deployment method: Deploy Service by Using Model and Processor.
{ "processor": "pmml", "model_path": "http://example.oss-cn-shanghai.aliyuncs.com/models/lr.pmml", "metadata": { "name": "pmmlasync", "type": "Async", "cpu": 4, "instance": 1, "memory": 8000 } }
Take note of the following parameters. For information about other parameters, see All Parameters of model services.
type: Set the value to Async to create an asynchronous inference service.
model_path: Replace the value with the endpoint of your model.
Deployment method: Select Deploy Service by Using Image.
{ "metadata": { "name": "image_async", "instance": 1, "rpc.worker_threads": 4, "type": "Async" }, "cloud": { "computing": { "instance_type": "ecs.gn6i-c16g1.4xlarge" } }, "queue": { "cpu": 1, "min_replica": 1, "memory": 4000, "resource": "" }, "containers": [ { "image": "eas-registry-vpc.cn-beijing.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1", "script": "python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat", "port": 8000 } ] }
Take note of the following parameters. For information about other parameters, see All Parameters of model services.
type: Set the value to Async to create an asynchronous inference service.
instance: the number of instances of the inference subservice.
rpc.worker_threads: Specify the number of threads for the EAS service framework of the asynchronous inference service. The value is equal to the window size of the subscribed data in the input queue. In the preceding sample code, a value of 4 indicates that you can subscribe to up to four data entries from the queue at the same time. The queue subservice does not push new data to the inference subservice until the four data entries are processed.
For example, if a single inference subservice instance of a video stream processing service can process only two video streams at the same time, you can set this parameter to 2. This way, the queue subservice pushes the endpoints of up to two video streams to the inference subservice, and does not push new video stream endpoints until the inference subservice returns results. If the inference subservice finishes processing one of the video streams and returns a result, the queue subservice pushes a new video stream endpoint to the inference subservice instance. This ensures that the inference subservice instance can process up to two video streams at the same time.
Create an asynchronous inference service.
Log on to the EASCMD client and run the create command to create an asynchronous inference service. For more information about how to log on to the EASCMD client, see Download the EASCMD client and complete user authentication.
eascmd create service.json
Access an asynchronous inference service
By default, the system creates a service group that has the same name as the asynchronous inference service. A queue subservice in the group has the data ingress of the group. You can access the queue subservice by using the paths in the following table. For information, see Access the queue service.
Endpoint type | Endpoint format | Example |
Input queue |
|
|
Output queue |
|
|
Manage asynchronous inference services
You can manage an asynchronous inference service in the same manner that you manage other services. The subservices of the asynchronous inference service are managed by the system. For example, when you delete an asynchronous inference service, the queue subservice and the inference subservice are also deleted. When you update the inference subservice, the update does not affect the queue subservice. This ensures service availability.
In addition to the instance that you configured for the inference subservice, the subservice architecture displays a queue subservice in the instance list.
The number of instances of an asynchronous inference service varies based on the number of inference subservice instances. The number of queue subservice instances automatically changes based on the number of inference subservice instances. For example, if you increase the number of inference subservice instances to 3, the number of queue subservice instances increases to 2.
The following rules take effect between the two subservices:
If the asynchronous inference service is stopped, the number of instances of the queue subservice and inference subservice decreases to 0. The instance list is empty.
If the number of instances of the inference subservice is 1, the number of instances of the queue subservice is also 1. You can modify the number of instances of the queue subservice by custom configuration.
If the number of instances of the inference subservice exceeds 2, the number of instances of the queue subservice remains 2. You can modify the number of instances of the queue subservice by custom configuration.
If you enable the automatic scaling feature for the asynchronous inference service and set the minimum number of instances to 0, the queue subservice retains one instance when the number of instances of the inference subservice decreases to 0.
Parameter configuration for a queue subservice
In most cases, you can use the default configuration for the queue subservice. If you have special requirements, you can configure the queue subservice by modifying the queue field in the JSON file. Sample file:
{
"queue": {
"sink": {
"memory_ratio": 0.3
},
"source": {
"auto_evict": true,
}
}
The following sections describe specific configuration items.
Configure resources for the queue subservice
By default, the resources of queue subservices are configured based on metadata fields. To modify the configuration of the resources, perform the following steps:
Specify the resource group used by the subservice by using the queue.resource parameter.
{ "queue": { "resource": eas-r-slzkbq4tw0p6xd**** # By default, the resource group of the inference subservice is used. } }
By default, queue subservices use the resource group of inference subservices.
If you want to use a public resource group to deploy the queue subservice, you can leave the resource parameter empty. You can use this method when the CPU and memory resources in your dedicated resource group are insufficient.
NoteWe recommend that you use the public resource group to deploy the queue subservice.
Specify the number of CPU cores and the memory of each instance by using the queue.cpu and queue.memory parameters.
{ "queue": { "cpu": 2, # Default value: 1. "memory": 8000 # Default value: 4000. } }
By default, the system creates instances that have 1 vCPU and 4 GB of memory for the queue subservice. In most cases, this specification can meet business requirements.
ImportantIf the number of subscribers, such as the number of instances of an inference subservice, exceeds 200, we recommend that you set the number of CPU cores to a value greater than 2.
We recommend that you do not use small memory sizes in the production environment.
Specify the minimum number of queue subservice instances by using the queue.min_replica parameter.
{ "queue": { "min_replica": 3 # Default value: 1. } }
When you use an asynchronous inference service, the number of queue subservice instances is automatically adjusted based on the number of runtime instances of the inference subservices. Valid values of the number of instances:
[1, min(2, Number of inference subservice instances)]
. If you configure an auto scaling rule for an asynchronous inference service that allows the number of instances to be 0, the system automatically retains one instance of the queue subservice. Specify the minimum number of queue service instances by using the queue.min_replica parameter.NoteYou can increase the number of queue subservice instances to improve service availability. The number of instances does not affect service performance.
Configure queue subservice features
This section describes the feature configurations of a queue subservice.
Configure automatic data eviction for output or input queues by using the queue.sink.auto_evict or queue.source.auto_evict parameter.
{ "queue": { "sink": { "auto_evict": true # Enable automatic eviction for the output queue. Default value: false. }, "source": { "auto_evict": true # Enable automatic eviction for the input queue. Default value: false. } } }
By default, automatic data eviction is disabled for the queues. When a queue reaches the upper limit of its capacity, data can no longer be written to the queue. In specific scenarios, you can enable automatic data eviction to allow the queue to automatically evict the oldest data to allow new data to be written.
Configure the maximum number of data deliveries by using the queue.max_delivery parameter.
{ "queue": { "max_delivery": 10 # Set the maximum number of data deliveries to 10. Default value: 5. If you set the value to 0, data can be delivered for an unlimited number of times. } }
If the number of times that a single data entry is delivered exceeds this value, the data is unprocessable and handled as a dead-letter message. For more information, see dead-letter policies.
Specify the maximum data processing time by using the queue.max_idle parameter.
{ "queue": { "max_idle": "1m" # Set the maximum processing time of a single data entry to 1 minute. If the processing time exceeds 1 minute, the data entry is delivered to other subscribers. After the data is delivered, the delivery count increases by one. # The default value is 0, which specifies that the data processing time has no limit. } }
In this example, the parameter is set to 1 minute. You can specify other units, such as h for hours, m for minutes, and s for seconds. If the processing time of a single data entry exceeds the value of this parameter, the following situations may occur:
If the delivery time of the data entry does not exceed the threshold specified by the queue.max_delivery parameter, the data is delivered to other subscribers.
If the delivery time of the data entry exceeds the threshold specified by the queue.max_delivery parameter, the system applies the dead-letter policy to the data.
Configure a dead-letter policy by using the queue.dead_message_policy parameter.
{ "queue": { "dead_message_policy": Valid values: Rear and Drop. The value Rear indicates that the data is placed in the tail queue. The value Drop indicates that the data is deleted. # Default value: Rear. } }
Configure the maximum queue length or data volume
The following formulas show the relationship between the maximum queue length and the maximum data volume.
The memory of a queue subservice instance is fixed. The queue length decreases with the increase in the size of data entries.
The default memory is 4 GB, and the maximum data volume is 8 KB. This means that the input and output queues can store 230,399 data entries. If you want to store more data entries in the queue subservice, perform the steps in the preceding sections to increase the memory size. The system consumes 10% of the total memory.
You cannot configure both the maximum length and maximum data volume for the same queue at the same time.
Specify the maximum length of the output queue or input queue by using the queue.sink.max_length or queue.source.max_length parameter.
{ "queue": { "sink": { "max_length": 8000 # Set the maximum length of the output queue to 8000 entries. }, "source": { "max_length": 2000 # Set the maximum length of the input queue to 2000 entries. } } }
Specify the maximum data volume of a single data entry in the output or input queue by using the queue.sink.max_payload_size_kb or queue.source.max_payload_size_kb parameter.
{ "queue": { "sink": { "max_payload_size_kb": Set the maximum size of a single data entry in the output queue to 10 KB. Default value: 8 KB. }, "source": { "max_payload_size_kb": 1024 # Set the maximum size of a single data entry in the input queue to 1024KB (1 MB). Default value: 8 KB. } } }
Configure memory allocation
Adjust the memory size occupied by the input and output queues by using the queue.sink.memory_ratio parameter.
{ "queue": { "sink": { "memory_ratio": 0.9 # Specify the memory ratio of the output queue. Default value: 0.5. } } }
NoteBy default, the memory of a queue subservice instance is equally divided between the input and output queues. If the input of your service is texts and the output is images, and you want to store more data in the output queue, you can increase the value of the queue.sink.memory_ratio parameter. If the input of your service is images and the output is texts, you can decrease the value of the queue.sink.memory_ratio parameter.
Configure automatic scaling
For information about how to configure automatic scaling for an asynchronous inference service, see Automatic scaling.