Use LLM Intelligent Router to improve inference efficiency - Platform For AI

In large language model (LLM) scenarios, issues such as uncertain resource demand and unbalanced load on backend inference instances exist. To resolve these issues, Elastic Algorithm Service (EAS) provides the LLM Intelligent Router component to dynamically distribute requests at the request scheduling layer based on metrics for LLM scenarios. This helps ensure an even allocation of computing power and GPU memory across backend inference instances and improve the resource usage of clusters.

Background information

In LLM scenarios, the amount of GPU resources occupied by a single request is uncertain due to the difference between the lengths of a user request and a model response, and the randomness of the number of tokens generated by the model in the input and output generation phases. The load balancing policies of traditional gateways, such as Round Robin and Least Connections, cannot detect the load on backend computing resources in real time. As a result, the load on backend inference instances is unbalanced, which affects system throughput and response latency. In particular, long-tail requests that are time-consuming, require a large amount of GPU computing, or occupy a large amount of GPU memory exacerbate uneven resource allocation and reduce overall cluster performance.

To resolve the preceding issues, EAS provides LLM Intelligent Router as a basic component at the request scheduling layer. The component dynamically distributes requests based on metrics for LLM scenarios. This ensures an even allocation of computing power and GPU memory across inference instances, and significantly improves the efficiency and stability of cluster resources.

LLM Intelligent Router significantly improves the speed and throughput of inference services. For more information, see Appendix 1: Comparison of test results.

How it works

LLM Intelligent Router is essentially a special EAS service that can intelligently schedule requests to backend inference services. LLM Intelligent Router is associated with inference instances through service groups. The following section describes how LLM Intelligent Router works:

By default, an LLM Intelligent Router service has a built-in LLM scheduler object, which is used to collect the metrics of inference instances and uses a specific algorithm to select the globally optimal instance based on the metrics. LLM Intelligent Router forwards requests to the optimal instance. For more information about Metrics APIs, see Appendix 2: Metrics APIs.
The LLM scheduler also establishes keepalive connections with inference instances. If an exception occurs in an inference instance, the LLM scheduler can immediately detect the exception and stop distributing traffic to the instance.
The LLM gateway forwards requests as instructed by the LLM scheduler. The HTTP-based Server Sent Events (SSE) and WebSocket protocols are supported.

Limits

LLM Intelligent Router applies to LLM inference scenarios. Only BladeLLM or vLLM can be used as the inference framework for backend instances.
The value of LLM Intelligent Router can be realized only if LLM Intelligent Router and multiple inference instances are deployed in the same service group.

Deploy services

Deploy an LLM Intelligent Router service

The following deployment methods are supported:

Method 1: Deploy a service in the PAI console

Method 2: Deploy a service by using JSON

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. On the Deploy Service page, select one of the following deployment methods:
- Choose Custom Model Deployment > Custom Deployment.
- Choose Scenario-based Model Deployment > LLM Deployment.
In the Features section, turn on LLM Intelligent Router. Then, click Create LLM Intelligent Router in the drop-down list.

In the Create LLM Intelligent Router panel, configure the parameters and click Deploy. The following table describes the parameters.

Parameter		Description
Basic Information	Service Name	Specify a name for the LLM Intelligent Router service. Example: llm_router.
Resource Configuration	Deployment	Configure resources for the LLM Intelligent Router service. Default configurations: Minimum Instances: 2. To ensure that the LLM Intelligent Router service can run on multiple instances, we recommend that you set Minimum Instances to at least 2. CPU: 2 Cores. Memory: 4 GB.
	Schedule Resource	Configure scheduling resources for the LLM scheduler. Default configurations: CPU: 2 Cores. Memory: 4 GB.
	Inference Acceleration	Select the inference framework that you use in the image. An LLM gateway supports the following two frameworks: BladeLLM Inference Acceleration Open-source vLLM Inference Acceleration

After the LLM Intelligent Router service is deployed, a service group is created at the same time. You can view the service group on the Canary Release tab of the Elastic Algorithm Service (EAS) page. The service group is named in the group_Name of an LLM Intelligent Router service format.

Note

Intelligent routing conflicts with service queues. Do not add both of them in a service group.

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section of the Deploy Service page, click JSON Deployment.

In the configuration editor section of the JSON Deployment page, configure the parameters and click Deploy.

The following sample code provides examples:

Note

To prevent a single point of failure (SPOF), we recommend that you set the metadata.instance parameter to at least 2 to ensure that LLM Intelligent Router can run on multiple instances.

Basic configurations:
```
{
    "cloud": {
        "computing": {
            "instance_type": "ecs.c7.large"
        }
    },
    "metadata": {
        "type": "LLMGatewayService",
        "cpu": 4,
        "group": "llm_group",
        "instance": 2,
        "memory": 4000,
        "name": "llm_router"
    }
}
```
To deploy an LLM Intelligent Router service, set the metadata.type parameter to LLMGatewayService. For information about other parameters, see Parameters of model services. After the service is deployed, EAS automatically creates a composite service that contains LLM Intelligent Router and an LLM scheduler. LLM Intelligent Router uses the resource configurations that you specify for the LLM Intelligent Router service. The default resource configurations of the LLM scheduler are 4 vCPUs and 4 GiB of memory.

Advanced configurations: If the basic configurations cannot meet your business requirements, you can prepare a JSON file to specify the following advanced configurations.

{
    "cloud": {
        "computing": {
            "instance_type": "ecs.c7.large"
        }
    },
    "llm_gateway": {
        "infer_backend": "vllm",
        "max_queue_size": 128,
        "retry_count": 2,
        "wait_schedule_timeout": 5000,
        "wait_schedule_try_period": 500
    },
    "llm_scheduler": {
        "cpu": 4,
        "memory": 4000
    },
    "metadata": {
        "cpu": 2,
        "group": "llm_group",
        "instance": 2,
        "memory": 4000,
        "name": "llm_router",
        "type": "LLMGatewayService"
    }
}

The following table describes the key parameters. For information about other parameters, see Parameters of model services.

Parameter	Description
llm_gateway.infer_backend	The inference framework used by the LLM. Valid values: vllm (default) bladellm
llm_gateway.max_queue_size	The maximum length of the cache queue in LLM Intelligent Router. Default value: 128. If the processing capability of the backend inference framework is exceeded, LLM Intelligent Router caches requests in the queue and forwards the cached requests when an inference instance is available.
llm_gateway.retry_count	The number of retries. Default value: 2. If the backend inference instance to which a request is forwarded is abnormal, LLM Intelligent Router tries to forward the request to another instance.
llm_gateway.wait_schedule_timeout	The timeout period. Default value: 5000. Unit: milliseconds. When the LLM scheduler is unavailable for the timeout period, LLM Intelligent Router uses the simple Round Robin policy to distribute requests.
llm_gateway.wait_schedule_try_period	The interval at which LLM Intelligent Router retries to connect to the LLM scheduler during the timeout period specified by the wait_schedule_timeout parameter. Default value: 500. Unit: milliseconds.
llm_scheduler.cpu	The number of vCPUs of the LLM scheduler. Default value: 4.
llm_scheduler.memory	The memory of the LLM scheduler. Default value: 4. Unit: GiB.
llm_scheduler.instance_type	The instance type of the LLM scheduler. An instance type defines the number of vCPUs and memory. If you specify this parameter, you do not need to separately specify the number of vCPUs and memory.

Deploy an LLM service

The following deployment methods are supported:

Method 1: Deploy a service in the PAI console

Method 2: Deploy a service by using JSON

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service.

On the Deploy Service page, select one of the following deployment methods and configure the key parameters. For information about other parameters, see Deploy a model service in the PAI console.

In the Custom Model Deployment section, select Custom Deployment. On the Create Service page, configure the key parameters described in the following table.

Parameter		Description
Environment Information	Deployment Method	Select Image-based Deployment and Enable Web App.
Environment Information	Select Image	Alibaba Cloud images and custom images are supported. If you select Alibaba Cloud Image, use chat-llm-webui:3.0-vllm or chat-llm-webui:3.0-blade. If you select Image Address, enter a custom image address in the field. The inference framework of a custom image must be BladeLLM or vllm.
Features	LLM Intelligent Router	Turn on LLM Intelligent Router and select the LLM Intelligent Router service that you deployed.

In the Scenario-based Model Deployment section, click LLM Deployment. On the LLM Deployment page, configure the key parameters described in the following table.

Parameter

Description

Basic Information

Version

Select one of the two versions:

Open-source Model Quick Deployment: When using this version, select a model type that supports vLLM or BladeLLM.
High-performance Deployment: Use the BladeLLM engine for quick deployment. You need to select image version and model settings.

Service Configuration

LLM Intelligent Router

Turn on LLM Intelligent Router and select the LLM Intelligent Router service that you deployed.

After you configure the parameters, click Deploy.

In this example, the open source vLLM-0.3.3 image, which is a built-in image provided by PAI, is used. Perform the following steps:

On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section of the Deploy Service page, click JSON Deployment.

In the configuration editor section of the JSON Deployment page, configure the parameters and click Deploy.
The following sample code provides an example:
```
{
    "cloud": {
        "computing": {
            "instance_type": "ecs.gn7i-c16g1.4xlarge"
        }
    },
    "containers": [
        {
            "image": "eas-registry-vpc.<regionid>.cr.aliyuncs.com/pai-eas/chat-llm:vllm-0.3.3",
            "port": 8000,
            "script": "python3 -m vllm.entrypoints.openai.api_server --served-model-name llama2 --model /huggingface/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1d3cabadba7ec7f1a9ef2ba5467ad31b3b84ff0/"
        }
    ],
    "features": {
        "eas.aliyun.com/extra-ephemeral-storage": "50Gi"
    },
    "metadata": {
        "cpu": 16,
        "gpu": 1,
        "instance": 5,
        "memory": 60000,
        "group": "llm_group",
        "name": "vllm_service"
    }
}
```
The following table describes the key parameters. For information about other parameters, see Parameters of model services.
- metadata.group: the name of the service group to which the LLM service belongs. The LLM service must belong to the same service group as the LLM Intelligent Router service. This way, the LLM service can register with the LLM scheduler and report related metrics, and LLM Intelligent Router can forward traffic.
  - If you deploy an LLM Intelligent Router service in the console, you must view the name of the service group to which the LLM Intelligent Router service belongs on the Canary Release tab of the Elastic Algorithm Service (EAS) page. The name of a service group is in the group_Name of an LLM Intelligent Router service format.
  - If you deploy an LLM service by using JSON, specify the name of a service group as llm_group.
- containers.image: In this example, the preset image provided by PAI is used. You must replace <regionid> with the ID of the region in which you want to deploy the service. For example, if you want to deploy the service in the China (Beijing) region, replace <regionid> with cn-beijing.

Access an LLM Intelligent Router service

Obtain the endpoint and token for accessing an LLM Intelligent Router service

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Find the LLM Intelligent Router service that you deployed and click Invocation Method in the Service Type column.
On the Public Endpoint tab of the Invocation Method dialog box, view the endpoint and token for accessing the service.

Configure the endpoint for accessing the LLM Intelligent Router service.

Configuration rule	Example

Configuration rule

Example

Format: {domain}/api/predict/{group_name}.{router_service_name}_llm_gateway/{endpoint}

Replace {endpoint} with the API endpoint that is supported by your LLM service. Example: v1/completions.

In this example, the LLM Intelligent Router service that is deployed by using JSON is used. If the endpoint obtained in Step 3 is http://175805416243****.cn-beijing.pai-eas.aliyuncs.com/api/predict/llm_group.llm_router. The endpoint for accessing the LLM Intelligent Router service is http://175805416243****.cn-beijing.pai-eas.aliyuncs.com/api/predict/llm_group.llm_router_llm_gateway/v1/completions.

Test the access

In the terminal, run the following command to access the LLM Intelligent Router service:

$curl -H "Authorization: xxxxxx" -H "Content-Type: application/json" http://********.cn-beijing.pai-eas.aliyuncs.com/api/predict/{group_name}.{router_service_name}_llm_gateway/v1/completions -d '{"model": "llama2", "prompt": "I need your help writing an article. I will provide you with some background information to begin with. And then I will provide you with directions to help me write the article.", "temperature": 0.0, "best_of": 1, "n_predict": 34, "max_tokens": 34, "stream": true}'

In the preceding command:

"Authorization: xxxxxx": Specify the token obtained in the preceding step.
http://********.cn-beijing.pai-eas.aliyuncs.com/api/predict/{group_name}.{router_service_name}_llm_gateway/v1/completions: Replace the value with the endpoint obtained in the preceding step.

Sample response:

data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"\n"}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":1,"total_tokens":37},"error_info":null}

data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"\n"}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":2,"total_tokens":38},"error_info":null}

data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":""}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":3,"total_tokens":39},"error_info":null}

...

[DONE]

View service monitoring metrics

After the test is complete, you can view the service monitoring metrics to understand the performance of the service. Perform the following steps:

On the Elastic Algorithm Service (EAS) page, find the LLM Intelligent Router service that you deployed and click the icon in the Monitoring column.

On the Monitoring tab of the page that appears, view the following metrics.

Token Throughput

The throughput of input and output tokens of the LLM.

IN: the throughput of input tokens of the LLM.
OUT: the throughput of output tokens of the LLM.

GPU Cache Usage

The percentage of GPU memory consumed by the key-value (KV) cache.

Engine Current Requests

The number of concurrent requests on the LLM engine in real time.

Running: the number of requests that the LLM engine is processing.
Waiting: the number of requests that are queuing in the LLM engine.

Gateway Current Requests

The number of requests on LLM Intelligent Router in real time.

Total: the total number of requests that are received by LLM Intelligent Router in real time.
Pending: the number of requests that are cached in LLM Intelligent Router, waiting to be processed by the LLM engine.

Time To First Token

The latency of the first output token.

Max: the maximum latency of the first output token.
Avg: the average latency of the first output token.
Min: the minimum latency of the first output token.
TPxx: the percentiles of the latencies of the first output token.

Time Per Output Token

The latency of an output token.

Max: the maximum latency of an output token.
Avg: the average latency of an output token.
Min: the minimum latency of an output token.
TPxx: the percentiles of the latencies of an output token.

Appendix 1: Comparison of test results

The following test results are for reference only. The actual performance improvement is subject to your own test results.

Test environment

Item	Description
Model	Llama2-7B
Data	ShareGPT_V3_unfiltered_cleaned_split.json
Client code (modified)	vllm/benchmarks/benchmark_serving.py
GPU type	ecs.gn7i-c8g1.2xlarge (24G A10)
Inference engine	vLLM
Number of backend instances	5

Code for stress testing

This section provides the test code for the vLLM 0.3.3 LLM service used in this example.

Download address: benchmarks.tgz.

Main files:

benchmarks
├── benchmark.sh # The test script. 
├── backend_request_func.py
├── benchmark_serving.py
├── samples.txt  # The request data excerpted from the file ShareGPT_V3_unfiltered_cleaned_split.json. 
└── tokenizer    # The Llama2-7B tokenizer.
    ├── tokenizer_config.json
    ├── tokenizer.json
    └── tokenizer.model

Test code:

# Install vLLM of the required version. 
$pip install vllm==0.3.3 
# Replace {gateway_service_token} with the token for accessing the LLM Intelligent Router service. 
$export OPENAI_API_KEY={router_service_token}

# Run the client test code. 
$python ./benchmark_serving.py --base-url http://********.cn-beijing.pai-eas.aliyuncs.com/api/predict/{group_name}.{router_service_name}_llm_gateway \
--endpoint /v1/completions \
--tokenizer ./tokenizer/ \
--model-id llama2 \
--load-inputs ./samples.txt \
--request-count 10000 \
--max-concurrent 160

Test results

Metric	Without LLM Intelligent Router	With LLM Intelligent Router	Improvement
Successful requests	9851	9851	-
Benchmark duration	1138.429295 s	1060.888924 s	+6.8%
Total input tokens	1527985	1527985	-
Total generated tokens	2808261	2812861	-
Input token throughput	1342.19 tokens/s	1440.29 tokens/s	+7.3%
Output token throughput	2466.79 tokens/s	2651.42 tokens/s	+7.5%
Mean TTFT	1981.86 ms	304.00 ms	+84%
Median TTFT	161.69 ms	158.67 ms	+1.8%
P99 TTFT	19396.84 ms	3534.64 ms	+81%
Mean TPOT	120.33 ms	69.41 ms	+42%
P99 TPOT	852.49 ms	260.33 ms	+69%

Appendix 2: Metrics APIs

The scheduling algorithm of LLM Intelligent Router schedules requests to idle instances based on the metrics of different backend inference instances. To use LLM Intelligent Router, you must implement the Metrics API in the inference instances to report relevant metrics based on your business requirements. LLM Intelligent Router is compatible with the Metrics APIs of vLLM and BladeLLM.

The following code shows the sample output of the Metrics API of vLLM:

# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="Llama-2-7b-chat-hf"} 30.0
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="Llama-2-7b-chat-hf"} 0.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="Llama-2-7b-chat-hf"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="Llama-2-7b-chat-hf"} 0.8426270136307311
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="Llama-2-7b-chat-hf"} 15708.0
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="Llama-2-7b-chat-hf"} 13419.0

The LLM scheduler converts and calculates the preceding metric data, and prioritizes the backend inference instances based on the calculation results. LLM Intelligent Router accesses the LLM scheduler to query the backend inference instance that has the highest priority and forward requests to the instance.

The following table describes the supported LLM inference frameworks and the corresponding metrics.

Note

LLM Intelligent Router can also be used with other inference frameworks that are compatible with the metrics described in the following table. In this case, LLM Intelligent Router can operate as expected and schedule requests in an efficient manner.

LLM inference engine	Metric	Description
vLLM	vllm:num_requests_running	The number of requests that are running.
	vllm:num_requests_waiting	The number of requests that are waiting in the queue.
	vllm:gpu_cache_usage_perc	The percentage of GPU memory consumed by the KV cache.
	vllm:prompt_tokens_total	The total number of input tokens.
	vllm:generation_tokens_total	The total number of output tokens.
BladeLLM	decode_batch_size_mean	The number of requests that are running.
	wait_queue_size_mean	The number of requests that are waiting in the queue.
	block_usage_gpu_mean	The percentage of GPU memory consumed by the KV cache.
	tps_total	The total number of tokens that are processed per second.
	tps_out	The number of output tokens that are generated per second.