In large language model (LLM) scenarios, issues such as uncertain resource demand and unbalanced load on backend inference instances exist. To resolve these issues, Elastic Algorithm Service (EAS) provides the LLM Intelligent Router component to dynamically distribute requests at the request scheduling layer based on metrics for LLM scenarios. This helps ensure an even allocation of computing power and GPU memory across backend inference instances and improve the resource usage of clusters.
Background information
In LLM scenarios, the amount of GPU resources occupied by a single request is uncertain due to the difference between the lengths of a user request and a model response, and the randomness of the number of tokens generated by the model in the input and output generation phases. The load balancing policies of traditional gateways, such as Round Robin and Least Connections, cannot detect the load on backend computing resources in real time. As a result, the load on backend inference instances is unbalanced, which affects system throughput and response latency. In particular, long-tail requests that are time-consuming, require a large amount of GPU computing, or occupy a large amount of GPU memory exacerbate uneven resource allocation and reduce overall cluster performance.
To resolve the preceding issues, EAS provides LLM Intelligent Router as a basic component at the request scheduling layer. The component dynamically distributes requests based on metrics for LLM scenarios. This ensures an even allocation of computing power and GPU memory across inference instances, and significantly improves the efficiency and stability of cluster resources.
LLM Intelligent Router significantly improves the speed and throughput of inference services. For more information, see Appendix 1: Comparison of test results.
How it works
LLM Intelligent Router is essentially a special EAS service that can intelligently schedule requests to backend inference services. LLM Intelligent Router is associated with inference instances through service groups. The following section describes how LLM Intelligent Router works:
By default, an LLM Intelligent Router service has a built-in LLM scheduler object, which is used to collect the metrics of inference instances and uses a specific algorithm to select the globally optimal instance based on the metrics. LLM Intelligent Router forwards requests to the optimal instance. For more information about Metrics APIs, see Appendix 2: Metrics APIs.
The LLM scheduler also establishes keepalive connections with inference instances. If an exception occurs in an inference instance, the LLM scheduler can immediately detect the exception and stop distributing traffic to the instance.
The LLM gateway forwards requests as instructed by the LLM scheduler. The HTTP-based Server Sent Events (SSE) and WebSocket protocols are supported.
Limits
LLM Intelligent Router applies to LLM inference scenarios. Only BladeLLM or vLLM can be used as the inference framework for backend instances.
The value of LLM Intelligent Router can be realized only if LLM Intelligent Router and multiple inference instances are deployed in the same service group.
Deploy services
Deploy an LLM Intelligent Router service
The following deployment methods are supported:
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. On the Deploy Service page, select one of the following deployment methods:
Choose Custom Model Deployment > Custom Deployment.
Choose Scenario-based Model Deployment > LLM Deployment.
In the Features section, turn on LLM Intelligent Router. Then, click Create LLM Intelligent Router in the drop-down list.
In the Create LLM Intelligent Router panel, configure the parameters and click Deploy. The following table describes the parameters.
Parameter
Description
Basic Information
Service Name
Specify a name for the LLM Intelligent Router service. Example: llm_router.
Resource Configuration
Deployment
Configure resources for the LLM Intelligent Router service. Default configurations:
Minimum Instances: 2. To ensure that the LLM Intelligent Router service can run on multiple instances, we recommend that you set Minimum Instances to at least 2.
CPU: 2 Cores.
Memory: 4 GB.
Schedule Resource
Configure scheduling resources for the LLM scheduler. Default configurations:
CPU: 2 Cores.
Memory: 4 GB.
Inference Acceleration
Select the inference framework that you use in the image. An LLM gateway supports the following two frameworks:
BladeLLM Inference Acceleration
Open-source vLLM Inference Acceleration
After the LLM Intelligent Router service is deployed, a service group is created at the same time. You can view the service group on the Canary Release tab of the Elastic Algorithm Service (EAS) page. The service group is named in the group_Name of an LLM Intelligent Router service format.
Intelligent routing conflicts with service queues. Do not add both of them in a service group.
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section of the Deploy Service page, click JSON Deployment.
In the configuration editor section of the JSON Deployment page, configure the parameters and click Deploy.
The following sample code provides examples:
To prevent a single point of failure (SPOF), we recommend that you set the metadata.instance parameter to at least 2 to ensure that LLM Intelligent Router can run on multiple instances.
Basic configurations:
{ "cloud": { "computing": { "instance_type": "ecs.c7.large" } }, "metadata": { "type": "LLMGatewayService", "cpu": 4, "group": "llm_group", "instance": 2, "memory": 4000, "name": "llm_router" } }
To deploy an LLM Intelligent Router service, set the metadata.type parameter to LLMGatewayService. For information about other parameters, see Parameters of model services. After the service is deployed, EAS automatically creates a composite service that contains LLM Intelligent Router and an LLM scheduler. LLM Intelligent Router uses the resource configurations that you specify for the LLM Intelligent Router service. The default resource configurations of the LLM scheduler are 4 vCPUs and 4 GiB of memory.
Advanced configurations: If the basic configurations cannot meet your business requirements, you can prepare a JSON file to specify the following advanced configurations.
{ "cloud": { "computing": { "instance_type": "ecs.c7.large" } }, "llm_gateway": { "infer_backend": "vllm", "max_queue_size": 128, "retry_count": 2, "wait_schedule_timeout": 5000, "wait_schedule_try_period": 500 }, "llm_scheduler": { "cpu": 4, "memory": 4000 }, "metadata": { "cpu": 2, "group": "llm_group", "instance": 2, "memory": 4000, "name": "llm_router", "type": "LLMGatewayService" } }
The following table describes the key parameters. For information about other parameters, see Parameters of model services.
Parameter
Description
llm_gateway.infer_backend
The inference framework used by the LLM. Valid values:
vllm (default)
bladellm
llm_gateway.max_queue_size
The maximum length of the cache queue in LLM Intelligent Router. Default value: 128.
If the processing capability of the backend inference framework is exceeded, LLM Intelligent Router caches requests in the queue and forwards the cached requests when an inference instance is available.
llm_gateway.retry_count
The number of retries. Default value: 2. If the backend inference instance to which a request is forwarded is abnormal, LLM Intelligent Router tries to forward the request to another instance.
llm_gateway.wait_schedule_timeout
The timeout period. Default value: 5000. Unit: milliseconds. When the LLM scheduler is unavailable for the timeout period, LLM Intelligent Router uses the simple Round Robin policy to distribute requests.
llm_gateway.wait_schedule_try_period
The interval at which LLM Intelligent Router retries to connect to the LLM scheduler during the timeout period specified by the wait_schedule_timeout parameter. Default value: 500. Unit: milliseconds.
llm_scheduler.cpu
The number of vCPUs of the LLM scheduler. Default value: 4.
llm_scheduler.memory
The memory of the LLM scheduler. Default value: 4. Unit: GiB.
llm_scheduler.instance_type
The instance type of the LLM scheduler. An instance type defines the number of vCPUs and memory. If you specify this parameter, you do not need to separately specify the number of vCPUs and memory.
Deploy an LLM service
The following deployment methods are supported:
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service.
On the Deploy Service page, select one of the following deployment methods and configure the key parameters. For information about other parameters, see Deploy a model service in the PAI console.
In the Custom Model Deployment section, select Custom Deployment. On the Create Service page, configure the key parameters described in the following table.
Parameter
Description
Environment Information
Deployment Method
Select Image-based Deployment and Enable Web App.
Select Image
Alibaba Cloud images and custom images are supported.
If you select Alibaba Cloud Image, use chat-llm-webui:3.0-vllm or chat-llm-webui:3.0-blade.
If you select Image Address, enter a custom image address in the field. The inference framework of a custom image must be BladeLLM or vllm.
Features
LLM Intelligent Router
Turn on LLM Intelligent Router and select the LLM Intelligent Router service that you deployed.
In the Scenario-based Model Deployment section, click LLM Deployment. On the LLM Deployment page, configure the key parameters described in the following table.
Parameter
Description
Basic Information
Version
Select one of the two versions:
Open-source Model Quick Deployment: When using this version, select a model type that supports vLLM or BladeLLM.
High-performance Deployment: Use the BladeLLM engine for quick deployment. You need to select image version and model settings.
Service Configuration
LLM Intelligent Router
Turn on LLM Intelligent Router and select the LLM Intelligent Router service that you deployed.
After you configure the parameters, click Deploy.
In this example, the open source vLLM-0.3.3 image, which is a built-in image provided by PAI, is used. Perform the following steps:
In the configuration editor section of the JSON Deployment page, configure the parameters and click Deploy.
The following sample code provides an example:
{ "cloud": { "computing": { "instance_type": "ecs.gn7i-c16g1.4xlarge" } }, "containers": [ { "image": "eas-registry-vpc.<regionid>.cr.aliyuncs.com/pai-eas/chat-llm:vllm-0.3.3", "port": 8000, "script": "python3 -m vllm.entrypoints.openai.api_server --served-model-name llama2 --model /huggingface/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1d3cabadba7ec7f1a9ef2ba5467ad31b3b84ff0/" } ], "features": { "eas.aliyun.com/extra-ephemeral-storage": "50Gi" }, "metadata": { "cpu": 16, "gpu": 1, "instance": 5, "memory": 60000, "group": "llm_group", "name": "vllm_service" } }
The following table describes the key parameters. For information about other parameters, see Parameters of model services.
metadata.group: the name of the service group to which the LLM service belongs. The LLM service must belong to the same service group as the LLM Intelligent Router service. This way, the LLM service can register with the LLM scheduler and report related metrics, and LLM Intelligent Router can forward traffic.
If you deploy an LLM Intelligent Router service in the console, you must view the name of the service group to which the LLM Intelligent Router service belongs on the Canary Release tab of the Elastic Algorithm Service (EAS) page. The name of a service group is in the group_Name of an LLM Intelligent Router service format.
If you deploy an LLM service by using JSON, specify the name of a service group as llm_group.
containers.image: In this example, the preset image provided by PAI is used. You must replace <regionid> with the ID of the region in which you want to deploy the service. For example, if you want to deploy the service in the China (Beijing) region, replace <regionid> with cn-beijing.
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section of the Deploy Service page, click JSON Deployment.
Access an LLM Intelligent Router service
Obtain the endpoint and token for accessing an LLM Intelligent Router service
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Find the LLM Intelligent Router service that you deployed and click Invocation Method in the Service Type column.
On the Public Endpoint tab of the Invocation Method dialog box, view the endpoint and token for accessing the service.
Configure the endpoint for accessing the LLM Intelligent Router service.
Configuration rule
Example
Configuration rule
Example
Format:
{domain}/api/predict/{group_name}.{router_service_name}_llm_gateway/{endpoint}
Replace {endpoint} with the API endpoint that is supported by your LLM service. Example:
v1/completions
.In this example, the LLM Intelligent Router service that is deployed by using JSON is used. If the endpoint obtained in Step 3 is
http://175805416243****.cn-beijing.pai-eas.aliyuncs.com/api/predict/llm_group.llm_router
. The endpoint for accessing the LLM Intelligent Router service ishttp://175805416243****.cn-beijing.pai-eas.aliyuncs.com/api/predict/llm_group.llm_router_llm_gateway/v1/completions
.
Test the access
In the terminal, run the following command to access the LLM Intelligent Router service:
$curl -H "Authorization: xxxxxx" -H "Content-Type: application/json" http://********.cn-beijing.pai-eas.aliyuncs.com/api/predict/{group_name}.{router_service_name}_llm_gateway/v1/completions -d '{"model": "llama2", "prompt": "I need your help writing an article. I will provide you with some background information to begin with. And then I will provide you with directions to help me write the article.", "temperature": 0.0, "best_of": 1, "n_predict": 34, "max_tokens": 34, "stream": true}'
In the preceding command:
"Authorization: xxxxxx": Specify the token obtained in the preceding step.
http://********.cn-beijing.pai-eas.aliyuncs.com/api/predict/{group_name}.{router_service_name}_llm_gateway/v1/completions
: Replace the value with the endpoint obtained in the preceding step.
Sample response:
data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"\n"}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":1,"total_tokens":37},"error_info":null}
data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"\n"}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":2,"total_tokens":38},"error_info":null}
data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":""}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":3,"total_tokens":39},"error_info":null}
...
[DONE]
View service monitoring metrics
After the test is complete, you can view the service monitoring metrics to understand the performance of the service. Perform the following steps:
On the Elastic Algorithm Service (EAS) page, find the LLM Intelligent Router service that you deployed and click the
icon in the Monitoring column.
On the Monitoring tab of the page that appears, view the following metrics.
Token Throughput
The throughput of input and output tokens of the LLM.
IN: the throughput of input tokens of the LLM.
OUT: the throughput of output tokens of the LLM.
GPU Cache Usage
The percentage of GPU memory consumed by the key-value (KV) cache.
Engine Current Requests
The number of concurrent requests on the LLM engine in real time.
Running: the number of requests that the LLM engine is processing.
Waiting: the number of requests that are queuing in the LLM engine.
Gateway Current Requests
The number of requests on LLM Intelligent Router in real time.
Total: the total number of requests that are received by LLM Intelligent Router in real time.
Pending: the number of requests that are cached in LLM Intelligent Router, waiting to be processed by the LLM engine.
Time To First Token
The latency of the first output token.
Max: the maximum latency of the first output token.
Avg: the average latency of the first output token.
Min: the minimum latency of the first output token.
TPxx: the percentiles of the latencies of the first output token.
Time Per Output Token
The latency of an output token.
Max: the maximum latency of an output token.
Avg: the average latency of an output token.
Min: the minimum latency of an output token.
TPxx: the percentiles of the latencies of an output token.