In large language model (LLM) scenarios, issues such as uncertain resource demand and unbalanced load on backend inference instances exist. To resolve these issues, Elastic Algorithm Service (EAS) provides the LLM Intelligent Router component to dynamically distribute requests at the request scheduling layer based on metrics for LLM scenarios. This helps ensure an even allocation of computing power and GPU memory across backend inference instances and improve the resource usage of clusters.
Background information
In LLM scenarios, the amount of GPU resources occupied by a single request is uncertain due to the difference between the lengths of a user request and a model response, and the randomness of the number of tokens generated by the model in the input and output generation phases. The load balancing policies of traditional gateways, such as Round Robin and Least Connections, cannot detect the load on backend computing resources in real time. As a result, the load on backend inference instances is unbalanced, which affects system throughput and response latency. In particular, long-tail requests that are time-consuming, require a large amount of GPU computing, or occupy a large amount of GPU memory exacerbate uneven resource allocation and reduce overall cluster performance.
To resolve the preceding issues, EAS provides LLM Intelligent Router as a basic component at the request scheduling layer. The component dynamically distributes requests based on metrics for LLM scenarios. This ensures an even allocation of computing power and GPU memory across inference instances, and significantly improves the efficiency and stability of cluster resources.
LLM Intelligent Router significantly improves the speed and throughput of inference services. For more information, see Appendix 1: Comparison of test results.
How it works
LLM Intelligent Router is essentially a special EAS service that can intelligently schedule requests to backend inference services. LLM Intelligent Router is associated with inference instances through service groups. The following section describes how LLM Intelligent Router works:
By default, an LLM Intelligent Router service has a built-in LLM scheduler object, which is used to collect the metrics of inference instances and uses a specific algorithm to select the globally optimal instance based on the metrics. LLM Intelligent Router forwards requests to the optimal instance. For more information about Metrics APIs, see Appendix 2: Metrics APIs.
The LLM scheduler also establishes keepalive connections with inference instances. If an exception occurs in an inference instance, the LLM scheduler can immediately detect the exception and stop distributing traffic to the instance.
The LLM gateway forwards requests as instructed by the LLM scheduler. The HTTP-based Server Sent Events (SSE) and WebSocket protocols are supported.
Limits
LLM Intelligent Router applies to LLM inference scenarios. Only BladeLLM or vLLM can be used as the inference framework for backend instances.
The value of LLM Intelligent Router can be realized only if LLM Intelligent Router and multiple inference instances are deployed in the same service group.
Deploy services
Deploy an LLM Intelligent Router service
The following deployment methods are supported:
Method 1: Deploy a service in the PAI console
Go to the EAS page.
Log on to the Platform for AI (PAI) console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace to which you want to deploy the model and click its name to go to the Workspace Details page.
In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.
On the Elastic Algorithm Service (EAS) page, click Deploy Service.
On the Deploy Service page, select one of the following deployment methods:
Choose Custom Model Deployment > Custom Deployment.
Choose Scenario-based Model Deployment > LLM Deployment. In the Basic Information section of the LLM Deployment page, select a model type that supports inference acceleration, such as Qwen2-7b, Qwen1.5-1.8b, Qwen1.5-7b, Qwen1.5-14b, llama3-8b, llama2-7b, llama2-13b, chatglm3-6b, baichuan2-7b, baichuan2-13b, falcon-7b, yi-6b, mistral-7b-instruct-v0.2, gemma-2b-it, gemma-7b-it, or deepseek-coder-7b-instruct-v1.5.
In the Service Configuration section, click LLM Intelligent Router, turn on LLM Intelligent Router, and then click Create LLM Intelligent Router in the drop-down list.
In the Create LLM Intelligent Router panel, configure the parameters and click Deploy. The following table describes the parameters.
Parameter
Description
Basic Information
Service Name
Specify a name for the LLM Intelligent Router service as prompted. Example: llm_router.
Resource Configuration
Deployment
Configure resources for the LLM Intelligent Router service. Default configurations:
Minimum Instances: 2. To ensure that the LLM Intelligent Router service can run on multiple instances, we recommend that you set Minimum Instances to 2.
CPU: 2 Cores.
Memory: 4 GB.
Schedule Resource
Configure scheduling resources for the LLM scheduler. Default configurations:
CPU: 2 Cores.
Memory: 4 GB.
Inference Acceleration
Select the inference framework that you use in the image. An LLM gateway supports the following two frameworks:
BladeLLM Inference Acceleration
Open-source vLLM Inference Acceleration
After the LLM Intelligent Router service is deployed, a service group is created at the same time. You can view the service group on the Group Service tab of the Elastic Algorithm Service (EAS) page. The service group is named in the group_Name of an LLM Intelligent Router service format.
Intelligent routing conflicts with service queues. We recommend that you do not create a queue service in a service group.
Method 2: Deploy a service by using JSON
Go to the EAS page.
Log on to the Platform for AI (PAI) console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace to which you want to deploy the model and click its name to go to the Workspace Details page.
In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.
On the Elastic Algorithm Service (EAS) page, click Deploy Service.
In the Custom Model Deployment section of the Deploy Service page, click JSON Deployment.
In the configuration editor section of the JSON Deployment page, configure the parameters and click Deploy.
The following sample code provides examples:
NoteTo prevent a single point of failure (SPOF), we recommend that you set the metadata.instance parameter to at least 2 to ensure that LLM Intelligent Router can run on multiple instances.
If an LLM Intelligent Router service is deployed before an LLM service is deployed, the LLM Intelligent Router service remains in the waiting state until the LLM service is deployed.
Basic configurations:
{ "cloud": { "computing": { "instance_type": "ecs.c7.large" } }, "metadata": { "type": "LLMGatewayService", "cpu": 4, "group": "llm_group", "instance": 2, "memory": 4000, "name": "llm_router" } }
To deploy an LLM Intelligent Router service, set the metadata.type parameter to LLMGatewayService. For information about other parameters, see Parameters of model services. After the service is deployed, EAS automatically creates a composite service that contains LLM Intelligent Router and an LLM scheduler. LLM Intelligent Router uses the resource configurations that you specify for the LLM Intelligent Router service. The default resource configurations of the LLM scheduler are 4 vCPUs and 4 GiB of memory.
Advanced configurations: If the basic configurations cannot meet your business requirements, you can prepare a JSON file to specify the following advanced configurations.
{ "cloud": { "computing": { "instance_type": "ecs.c7.large" } }, "llm_gateway": { "infer_backend": "vllm", "max_queue_size": 128, "retry_count": 2, "wait_schedule_timeout": 5000, "wait_schedule_try_period": 500 }, "llm_scheduler": { "cpu": 4, "memory": 4000 }, "metadata": { "cpu": 2, "group": "llm_group", "instance": 2, "memory": 4000, "name": "llm_router", "type": "LLMGatewayService" } }
The following table describes the key parameters. For information about other parameters, see Parameters of model services.
Parameter
Description
llm_gateway.infer_backend
The inference framework used by the LLM. Valid values:
vllm (default)
bladellm
llm_gateway.max_queue_size
The maximum length of the cache queue in LLM Intelligent Router. Default value: 128.
If the processing capability of the backend inference framework is exceeded, LLM Intelligent Router caches requests in the queue and forwards the cached requests when an inference instance is available.
llm_gateway.retry_count
The number of retries. Default value: 2. If the backend inference instance to which a request is forwarded is abnormal, LLM Intelligent Router tries to forward the request to another instance.
llm_gateway.wait_schedule_timeout
The timeout period. Default value: 5000. Unit: milliseconds. When the LLM scheduler is unavailable for the timeout period, LLM Intelligent Router uses the simple Round Robin policy to distribute requests.
llm_gateway.wait_schedule_try_period
The interval at which LLM Intelligent Router retries to connect to the LLM scheduler during the timeout period specified by the wait_schedule_timeout parameter. Default value: 500. Unit: milliseconds.
llm_scheduler.cpu
The number of vCPUs of the LLM scheduler. Default value: 4.
llm_scheduler.memory
The memory of the LLM scheduler. Default value: 4. Unit: GiB.
llm_scheduler.instance_type
The instance type of the LLM scheduler. An instance type defines the number of vCPUs and memory. If you specify this parameter, you do not need to separately specify the number of vCPUs and memory.
Deploy an LLM service
The following deployment methods are supported:
Method 1: Deploy a service in the PAI console
Go to the EAS page.
Log on to the Platform for AI (PAI) console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace to which you want to deploy the model and click its name to go to the Workspace Details page.
In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.
On the Elastic Algorithm Service (EAS) page, click Deploy Service.
On the Deploy Service page, select one of the following deployment methods and configure the key parameters. For information about other parameters, see Deploy a model service in the PAI console.
In the Custom Model Deployment section, select Custom Deployment. On the Create Service page, configure the key parameters described in the following table.
Parameter
Description
Model Service Information
Deployment Method
Select Deploy Web App by Using Image.
Select Image
PAI images and custom images are supported.
If you set Select Image to PAI Image, select the chat-llm-webui image and select 3.0-vllm or 3.0-blade for the image version.
If you set Select Image to Image Address, enter a custom image address in the field. The inference framework of a custom image must be BladeLLM or vllm.
Service Configuration
LLM Intelligent Router
Turn on LLM Intelligent Router and select the LLM Intelligent Router service that you deployed.
In the Scenario-based Model Deployment section, click LLM Deployment. On the LLM Deployment page, configure the key parameters described in the following table.
Parameter
Description
Basic Information
Model Type
Select a model type that supports inference acceleration from the following model types: Qwen2-7b, Qwen1.5-1.8b, Qwen1.5-7b, Qwen1.5-14b, llama3-8b, llama2-7b, llama2-13b, chatglm3-6b, baichuan2-7b, baichuan2-13b, falcon-7b, yi-6b, mistral-7b-instruct-v0.2, gemma-2b-it, gemma-7b-it, and deepseek-coder-7b-instruct-v1.5.
Service Configuration
LLM Intelligent Router
Turn on LLM Intelligent Router and select the LLM Intelligent Router service that you deployed.
After you configure the parameters, click Deploy.
Method 2: Deploy a service by using JSON
In this example, the open source vLLM-0.3.3 image, which is a built-in image provided by PAI, is used. Perform the following steps:
On the Elastic Algorithm Service (EAS) page, click Deploy Service.
In the Custom Model Deployment section of the Deploy Service page, click JSON Deployment.
In the configuration editor section of the JSON Deployment page, configure the parameters and click Deploy.
The following sample code provides an example:
{ "cloud": { "computing": { "instance_type": "ecs.gn7i-c16g1.4xlarge" } }, "containers": [ { "image": "eas-registry-vpc.<regionid>.cr.aliyuncs.com/pai-eas/chat-llm:vllm-0.3.3", "port": 8000, "script": "python3 -m vllm.entrypoints.openai.api_server --served-model-name llama2 --model /huggingface/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1d3cabadba7ec7f1a9ef2ba5467ad31b3b84ff0/" } ], "features": { "eas.aliyun.com/extra-ephemeral-storage": "50Gi" }, "metadata": { "cpu": 16, "gpu": 1, "instance": 5, "memory": 60000, "group": "llm_group", "name": "vllm_service" } }
The following table describes the key parameters. For information about other parameters, see Parameters of model services.
metadata.group: the name of the service group to which the LLM service belongs. The LLM service must belong to the same service group as the LLM Intelligent Router service. This way, the LLM service can register with the LLM scheduler and report related metrics, and LLM Intelligent Router can forward traffic.
If you deploy an LLM Intelligent Router service in the console, you must view the name of the service group to which the LLM Intelligent Router service belongs on the Group Service tab of the Elastic Algorithm Service (EAS) page. The name of a service group is in the group_Name of an LLM Intelligent Router service format.
If you deploy an LLM service by using JSON, specify the name of a service group as llm_group.
containers.image: In this example, the preset image provided by PAI is used. You must replace <regionid> with the ID of the region in which you want to deploy the service. For example, if you want to deploy the service in the China (Beijing) region, replace <regionid> with cn-beijing.
Access an LLM Intelligent Router service
Obtain the endpoint and token for accessing an LLM Intelligent Router service
Go to the EAS-Online Model Services page.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which you want to deploy the model.
In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.
Find the LLM Intelligent Router service that you deployed and click Invocation Method in the Service Type column.
On the Public Endpoint tab of the Invocation Method dialog box, view the endpoint and token for accessing the service.
Configure the endpoint for accessing the LLM Intelligent Router service.
Configuration rule
Example
Format:
{domain}/api/predict/{group_name}.{router_service_name}_llm_gateway/{endpoint}
Replace {endpoint} with the API endpoint that is supported by your LLM service. Example:
v1/completions
.In this example, the LLM Intelligent Router service that is deployed by using JSON is used. If the endpoint obtained in Step 3 is
http://175805416243****.cn-beijing.pai-eas.aliyuncs.com/api/predict/llm_group.llm_router
. The endpoint for accessing the LLM Intelligent Router service ishttp://175805416243****.cn-beijing.pai-eas.aliyuncs.com/api/predict/llm_group.llm_router_llm_gateway/v1/completions
.
Test the access
In the terminal, run the following command to access the LLM Intelligent Router service:
$curl -H "Authorization: xxxxxx" -H "Content-Type: application/json" http://***http://********.cn-beijing.pai-eas.aliyuncs.com/api/predict/{group_name}.{router_service_name}_llm_gateway/v1/completions -d '{"model": "llama2", "prompt": "I need your help writing an article. I will provide you with some background information to begin with. And then I will provide you with directions to help me write the article.", "temperature": 0.0, "best_of": 1, "n_predict": 34, "max_tokens": 34, "stream": true}'
In the preceding command:
"Authorization: xxxxxx": Specify the token obtained in the preceding step.
http://********.cn-beijing.pai-eas.aliyuncs.com/api/predict/{group_name}.{router_service_name}_llm_gateway/v1/completions
: Replace the value with the endpoint obtained in the preceding step.
Sample response:
data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e9a38","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"\n"}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":1,"total_tokens":37},"error_info":null}
data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e9a38","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"\n"}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":2,"total_tokens":38},"error_info":null}
data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e9a38","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":""}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":3,"total_tokens":39},"error_info":null}
...
[DONE]
View service monitoring metrics
After the test is complete, you can view the service monitoring metrics to understand the performance of the service. Perform the following steps:
On the Elastic Algorithm Service (EAS) page, find the LLM Intelligent Router service that you deployed and click the icon in the Monitoring column.
On the Monitoring tab of the page that appears, view the following metrics.
Token Throughput
The throughput of input and output tokens of the LLM.
IN: the throughput of input tokens of the LLM.
OUT: the throughput of output tokens of the LLM.
GPU Cache Usage
The percentage of GPU memory consumed by the key-value (KV) cache.
Engine Current Requests
The number of concurrent requests on the LLM engine in real time.
Running: the number of requests that the LLM engine is processing.
Waiting: the number of requests that are queuing in the LLM engine.
Gateway Current Requests
The number of requests on LLM Intelligent Router in real time.
Total: the total number of requests that are received by LLM Intelligent Router in real time.
Pending: the number of requests that are cached in LLM Intelligent Router, waiting to be processed by the LLM engine.
Time To First Token
The latency of the first output token.
Max: the maximum latency of the first output token.
Avg: the average latency of the first output token.
Min: the minimum latency of the first output token.
TPxx: the percentiles of the latencies of the first output token.
Time Per Output Token
The latency of an output token.
Max: the maximum latency of an output token.
Avg: the average latency of an output token.
Min: the minimum latency of an output token.
TPxx: the percentiles of the latencies of an output token.