When you wait for the inference result of a time-consuming inference service in AI content generation or video processing scenarios that involve complex model inference, the request may fail due to connection timeouts or load imbalance may occur between service instances. To avoid the preceding issues, Platform for AI (PAI) provides the asynchronous inference feature that allows you to obtain inference results by subscribing to requests or polling. This topic describes how to use an asynchronous inference service.
How it works

When you create an asynchronous inference service, the service is integrated with an inference subservice and a queue subservice. By default, each queue subservice creates two queues: an input queue and an output queue (sink queue). When the service receives a request, the request is sent to the input queue. The EAS service framework of the inference subservice automatically subscribes to the queue and obtains the request data in a streaming manner, calls the operations to process the request data, and writes the inference results to the output queue.
When the output queue is full, the service framework can no longer write data to the output queue. In this case, the service framework stops receiving data from the input queue.
If you want to ignore the output queue to deliver the inference result to other services such as Object Storage Service (OSS) or your message middleware, you can configure the API requests to return an empty string. This way, the output queue is automatically ignored.
Create a highly-available queue subservice to receive client requests. The client subscribes to requests within the upper concurrency limit of the client. The queue subservice ensures that the number of requests processed on each instance does not exceed the subscription window size of the client. This ensures that the instances of the inference subservice are not overloaded and can return the inference results to the client as expected.
Note
For example, if each instance can process up to five audio streams, set the subscription window size to 5. After an instance finishes processing an audio stream and commits the result, the queue subservice pushes another audio stream to the instance. This ensures that the instance processes no more than five audio streams at the same time.
The queue subservice checks the status of the connections between the inference subservice instances and the client to assess the health status of the instances. If the client is unexpectedly disconnected from an instance, the queue subservice considers the instance unhealthy and distributes the uncompleted requests to other healthy instances. This ensures that all requests are handled as expected.
Create an asynchronous inference service
When you create an asynchronous inference service, the system automatically creates a service group that has the same name as the asynchronous inference service to facilitate usage. The system also automatically creates a queue subservice and integrates the queue subservice into the asynchronous inference service. By default, the queue subservice starts one instance and automatically scales based on the number of instances of the inference subservice. The queue subservice can start up to two instances. By default, each instance is equipped with 1 vCPU and 4 GB of memory. If the default number of instances for the queue subservice cannot meet your business requirements, you can configure the related parameters of the instances. For more information, see the Parameter configuration for a queue subservice section of this topic.
You can use one of the following methods to create an asynchronous inference service.
Use the PAI console
Use an EASCMD client
Prepare a service configuration file named service.json.
Deployment method: Deploy Service by Using Model and Processor.
{
"processor": "pmml",
"model_path": "http://example.oss-cn-shanghai.aliyuncs.com/models/lr.pmml",
"metadata": {
"name": "pmmlasync",
"type": "Async",
"cpu": 4,
"instance": 1,
"memory": 8000
}
}
Take note of the following parameters. For information about other parameters, see All Parameters of model services.
Deployment method: Select Deploy Service by Using Image.
{
"metadata": {
"name": "image_async",
"instance": 1,
"rpc.worker_threads": 4,
"type": "Async"
},
"cloud": {
"computing": {
"instance_type": "ecs.gn6i-c16g1.4xlarge"
}
},
"queue": {
"cpu": 1,
"min_replica": 1,
"memory": 4000,
"resource": ""
},
"containers": [
{
"image": "eas-registry-vpc.cn-beijing.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1",
"script": "python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat",
"port": 8000
}
]
}
Take note of the following parameters. For information about other parameters, see All Parameters of model services.
type: Set the value to Async to create an asynchronous inference service.
instance: the number of instances of the inference subservice.
rpc.worker_threads: Specify the number of threads for the EAS service framework of the asynchronous inference service. The value is equal to the window size of the subscribed data in the input queue. In the preceding sample code, a value of 4 indicates that you can subscribe to up to four data entries from the queue at the same time. The queue subservice does not push new data to the inference subservice until the four data entries are processed.
For example, if a single inference subservice instance of a video stream processing service can process only two video streams at the same time, you can set this parameter to 2. This way, the queue subservice pushes the endpoints of up to two video streams to the inference subservice, and does not push new video stream endpoints until the inference subservice returns results. If the inference subservice finishes processing one of the video streams and returns a result, the queue subservice pushes a new video stream endpoint to the inference subservice instance. This ensures that the inference subservice instance can process up to two video streams at the same time.
Create an asynchronous inference service.
Access an asynchronous inference service
By default, the system creates a service group that has the same name as the asynchronous inference service. A queue subservice in the group has the data ingress of the group. You can access the queue subservice by using the paths in the following table. For information, see Access the queue service.
Endpoint type | Endpoint format | Example |
Endpoint type | Endpoint format | Example |
Input queue | {domain}/api/predict/{service_name}
| xxx.cn-shanghai.pai-eas.aliyuncs.com/api/predict/{service_name}
|
Output queue | {domain}/api/predict/{service_name}/sink
| xxx.cn-shanghai.pai-eas.aliyuncs.com/api/predict/{service_name}/sink
|
Manage asynchronous inference services
You can manage an asynchronous inference service in the same manner that you manage other services. The subservices of the asynchronous inference service are managed by the system. For example, when you delete an asynchronous inference service, the queue subservice and the inference subservice are also deleted. When you update the inference subservice, the update does not affect the queue subservice. This ensures service availability.
In addition to the instance that you configured for the inference subservice, the subservice architecture displays a queue subservice in the instance list.

The number of instances of an asynchronous inference service varies based on the number of inference subservice instances. The number of queue subservice instances automatically changes based on the number of inference subservice instances. For example, if you increase the number of inference subservice instances to 3, the number of queue subservice instances increases to 2.

The following rules take effect between the two subservices:
If the asynchronous inference service is stopped, the number of instances of the queue subservice and inference subservice decreases to 0. The instance list is empty.
If the number of instances of the inference subservice is 1, the number of instances of the queue subservice is also 1. You can modify the number of instances of the queue subservice by custom configuration.
If the number of instances of the inference subservice exceeds 2, the number of instances of the queue subservice remains 2. You can modify the number of instances of the queue subservice by custom configuration.
If you enable the automatic scaling feature for the asynchronous inference service and set the minimum number of instances to 0, the queue subservice retains one instance when the number of instances of the inference subservice decreases to 0.
Parameter configuration for a queue subservice
In most cases, you can use the default configuration for the queue subservice. If you have special requirements, you can configure the queue subservice by modifying the queue field in the JSON file. Sample file:
{
"queue": {
"sink": {
"memory_ratio": 0.3
},
"source": {
"auto_evict": true,
}
}
The following sections describe specific configuration items.
Configure resources for the queue subservice
By default, the resources of queue subservices are configured based on metadata fields. To modify the configuration of the resources, perform the following steps:
Specify the resource group used by the subservice by using the queue.resource parameter.
{
"queue": {
"resource": eas-r-slzkbq4tw0p6xd**** # By default, the resource group of the inference subservice is used.
}
}
By default, queue subservices use the resource group of inference subservices.
If you want to use a public resource group to deploy the queue subservice, you can leave the resource parameter empty. You can use this method when the CPU and memory resources in your dedicated resource group are insufficient.
Note
We recommend that you use the public resource group to deploy the queue subservice.
Specify the number of CPU cores and the memory of each instance by using the queue.cpu and queue.memory parameters.
{
"queue": {
"cpu": 2, # Default value: 1.
"memory": 8000 # Default value: 4000.
}
}
By default, the system creates instances that have 1 vCPU and 4 GB of memory for the queue subservice. In most cases, this specification can meet business requirements.
Important
If the number of subscribers, such as the number of instances of an inference subservice, exceeds 200, we recommend that you set the number of CPU cores to a value greater than 2.
We recommend that you do not use small memory sizes in the production environment.
Specify the minimum number of queue subservice instances by using the queue.min_replica parameter.
{
"queue": {
"min_replica": 3 # Default value: 1.
}
}
When you use an asynchronous inference service, the number of queue subservice instances is automatically adjusted based on the number of runtime instances of the inference subservices. Valid values of the number of instances: [1, min(2, Number of inference subservice instances)]
. If you configure an auto scaling rule for an asynchronous inference service that allows the number of instances to be 0, the system automatically retains one instance of the queue subservice. Specify the minimum number of queue service instances by using the queue.min_replica parameter.
Note
You can increase the number of queue subservice instances to improve service availability. The number of instances does not affect service performance.
Configure queue subservice features
This section describes the feature configurations of a queue subservice.
Configure automatic data eviction for output or input queues by using the queue.sink.auto_evict or queue.source.auto_evict parameter.
{
"queue": {
"sink": {
"auto_evict": true # Enable automatic eviction for the output queue. Default value: false.
},
"source": {
"auto_evict": true # Enable automatic eviction for the input queue. Default value: false.
}
}
}
By default, automatic data eviction is disabled for the queues. When a queue reaches the upper limit of its capacity, data can no longer be written to the queue. In specific scenarios, you can enable automatic data eviction to allow the queue to automatically evict the oldest data to allow new data to be written.
Configure the maximum number of data deliveries by using the queue.max_delivery parameter.
{
"queue": {
"max_delivery": 10 # Set the maximum number of data deliveries to 10. Default value: 5. If you set the value to 0, data can be delivered for an unlimited number of times.
}
}
If the number of times that a single data entry is delivered exceeds this value, the data is unprocessable and handled as a dead-letter message. For more information, see dead-letter policies.
Specify the maximum data processing time by using the queue.max_idle parameter.
{
"queue": {
"max_idle": "1m" # Set the maximum processing time of a single data entry to 1 minute. If the processing time exceeds 1 minute, the data entry is delivered to other subscribers. After the data is delivered, the delivery count increases by one.
# The default value is 0, which specifies that the data processing time has no limit.
}
}
In this example, the parameter is set to 1 minute. You can specify other units, such as h for hours, m for minutes, and s for seconds. If the processing time of a single data entry exceeds the value of this parameter, the following situations may occur:
If the delivery time of the data entry does not exceed the threshold specified by the queue.max_delivery parameter, the data is delivered to other subscribers.
If the delivery time of the data entry exceeds the threshold specified by the queue.max_delivery parameter, the system applies the dead-letter policy to the data.
Configure a dead-letter policy by using the queue.dead_message_policy parameter.
{
"queue": {
"dead_message_policy": Valid values: Rear and Drop. The value Rear indicates that the data is placed in the tail queue. The value Drop indicates that the data is deleted.
# Default value: Rear.
}
}
Configure the maximum queue length or data volume
The following formulas show the relationship between the maximum queue length and the maximum data volume.

The memory of a queue subservice instance is fixed. The queue length decreases with the increase in the size of data entries.
Note
The default memory is 4 GB, and the maximum data volume is 8 KB. This means that the input and output queues can store 230,399 data entries. If you want to store more data entries in the queue subservice, perform the steps in the preceding sections to increase the memory size. The system consumes 10% of the total memory.
You cannot configure both the maximum length and maximum data volume for the same queue at the same time.
Specify the maximum length of the output queue or input queue by using the queue.sink.max_length or queue.source.max_length parameter.
{
"queue": {
"sink": {
"max_length": 8000 # Set the maximum length of the output queue to 8000 entries.
},
"source": {
"max_length": 2000 # Set the maximum length of the input queue to 2000 entries.
}
}
}
Specify the maximum data volume of a single data entry in the output or input queue by using the queue.sink.max_payload_size_kb or queue.source.max_payload_size_kb parameter.
{
"queue": {
"sink": {
"max_payload_size_kb": Set the maximum size of a single data entry in the output queue to 10 KB. Default value: 8 KB.
},
"source": {
"max_payload_size_kb": 1024 # Set the maximum size of a single data entry in the input queue to 1024KB (1 MB). Default value: 8 KB.
}
}
}
Configure memory allocation
Configure automatic scaling
For information about how to configure automatic scaling for an asynchronous inference service, see Automatic scaling.