All Products
Search
Document Center

Platform For AI:Use LLM Intelligent Router to improve inference efficiency

Last Updated:Sep 26, 2024

In large language model (LLM) scenarios, issues such as uncertain resource demand and unbalanced load on backend inference instances exist. To resolve these issues, Elastic Algorithm Service (EAS) provides the LLM Intelligent Router component to dynamically distribute requests at the request scheduling layer based on metrics for LLM scenarios. This helps ensure an even allocation of computing power and GPU memory across backend inference instances and improve the resource usage of clusters.

Background information

In LLM scenarios, the amount of GPU resources occupied by a single request is uncertain due to the difference between the lengths of a user request and a model response, and the randomness of the number of tokens generated by the model in the input and output generation phases. The load balancing policies of traditional gateways, such as Round Robin and Least Connections, cannot detect the load on backend computing resources in real time. As a result, the load on backend inference instances is unbalanced, which affects system throughput and response latency. In particular, long-tail requests that are time-consuming, require a large amount of GPU computing, or occupy a large amount of GPU memory exacerbate uneven resource allocation and reduce overall cluster performance.

To resolve the preceding issues, EAS provides LLM Intelligent Router as a basic component at the request scheduling layer. The component dynamically distributes requests based on metrics for LLM scenarios. This ensures an even allocation of computing power and GPU memory across inference instances, and significantly improves the efficiency and stability of cluster resources.

LLM Intelligent Router significantly improves the speed and throughput of inference services. For more information, see Appendix 1: Comparison of test results.

How it works

image

LLM Intelligent Router is essentially a special EAS service that can intelligently schedule requests to backend inference services. LLM Intelligent Router is associated with inference instances through service groups. The following section describes how LLM Intelligent Router works:

  • By default, an LLM Intelligent Router service has a built-in LLM scheduler object, which is used to collect the metrics of inference instances and uses a specific algorithm to select the globally optimal instance based on the metrics. LLM Intelligent Router forwards requests to the optimal instance. For more information about Metrics APIs, see Appendix 2: Metrics APIs.

  • The LLM scheduler also establishes keepalive connections with inference instances. If an exception occurs in an inference instance, the LLM scheduler can immediately detect the exception and stop distributing traffic to the instance.

  • The LLM gateway forwards requests as instructed by the LLM scheduler. The HTTP-based Server Sent Events (SSE) and WebSocket protocols are supported.

Limits

  • LLM Intelligent Router applies to LLM inference scenarios. Only BladeLLM or vLLM can be used as the inference framework for backend instances.

  • The value of LLM Intelligent Router can be realized only if LLM Intelligent Router and multiple inference instances are deployed in the same service group.

Deploy services

Deploy an LLM Intelligent Router service

The following deployment methods are supported:

Method 1: Deploy a service in the PAI console

  1. Go to the EAS page.

    1. Log on to the Platform for AI (PAI) console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace to which you want to deploy the model and click its name to go to the Workspace Details page.

    3. In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.image

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service.

  3. On the Deploy Service page, select one of the following deployment methods:

    • Choose Custom Model Deployment > Custom Deployment.

    • Choose Scenario-based Model Deployment > LLM Deployment. In the Basic Information section of the LLM Deployment page, select a model type that supports inference acceleration, such as Qwen2-7b, Qwen1.5-1.8b, Qwen1.5-7b, Qwen1.5-14b, llama3-8b, llama2-7b, llama2-13b, chatglm3-6b, baichuan2-7b, baichuan2-13b, falcon-7b, yi-6b, mistral-7b-instruct-v0.2, gemma-2b-it, gemma-7b-it, or deepseek-coder-7b-instruct-v1.5.

  4. In the Service Configuration section, click LLM Intelligent Router, turn on LLM Intelligent Router, and then click Create LLM Intelligent Router in the drop-down list.image

  5. In the Create LLM Intelligent Router panel, configure the parameters and click Deploy. The following table describes the parameters.

    Parameter

    Description

    Basic Information

    Service Name

    Specify a name for the LLM Intelligent Router service as prompted. Example: llm_router.

    Resource Configuration

    Deployment

    Configure resources for the LLM Intelligent Router service. Default configurations:

    • Minimum Instances: 2. To ensure that the LLM Intelligent Router service can run on multiple instances, we recommend that you set Minimum Instances to 2.

    • CPU: 2 Cores.

    • Memory: 4 GB.

    Schedule Resource

    Configure scheduling resources for the LLM scheduler. Default configurations:

    • CPU: 2 Cores.

    • Memory: 4 GB.

    Inference Acceleration

    Select the inference framework that you use in the image. An LLM gateway supports the following two frameworks:

    • BladeLLM Inference Acceleration

    • Open-source vLLM Inference Acceleration

After the LLM Intelligent Router service is deployed, a service group is created at the same time. You can view the service group on the Group Service tab of the Elastic Algorithm Service (EAS) page. The service group is named in the group_Name of an LLM Intelligent Router service format.image

Note

Intelligent routing conflicts with service queues. We recommend that you do not create a queue service in a service group.

Method 2: Deploy a service by using JSON

  1. Go to the EAS page.

    1. Log on to the Platform for AI (PAI) console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace to which you want to deploy the model and click its name to go to the Workspace Details page.

    3. In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.image

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service.

  3. In the Custom Model Deployment section of the Deploy Service page, click JSON Deployment.

  4. In the configuration editor section of the JSON Deployment page, configure the parameters and click Deploy.

    The following sample code provides examples:

    Note
    • To prevent a single point of failure (SPOF), we recommend that you set the metadata.instance parameter to at least 2 to ensure that LLM Intelligent Router can run on multiple instances.

    • If an LLM Intelligent Router service is deployed before an LLM service is deployed, the LLM Intelligent Router service remains in the waiting state until the LLM service is deployed.

    • Basic configurations:

      {
          "cloud": {
              "computing": {
                  "instance_type": "ecs.c7.large"
              }
          },
          "metadata": {
              "type": "LLMGatewayService",
              "cpu": 4,
              "group": "llm_group",
              "instance": 2,
              "memory": 4000,
              "name": "llm_router"
          }
      }

      To deploy an LLM Intelligent Router service, set the metadata.type parameter to LLMGatewayService. For information about other parameters, see Parameters of model services. After the service is deployed, EAS automatically creates a composite service that contains LLM Intelligent Router and an LLM scheduler. LLM Intelligent Router uses the resource configurations that you specify for the LLM Intelligent Router service. The default resource configurations of the LLM scheduler are 4 vCPUs and 4 GiB of memory.

    • Advanced configurations: If the basic configurations cannot meet your business requirements, you can prepare a JSON file to specify the following advanced configurations.

      {
          "cloud": {
              "computing": {
                  "instance_type": "ecs.c7.large"
              }
          },
          "llm_gateway": {
              "infer_backend": "vllm",
              "max_queue_size": 128,
              "retry_count": 2,
              "wait_schedule_timeout": 5000,
              "wait_schedule_try_period": 500
          },
          "llm_scheduler": {
              "cpu": 4,
              "memory": 4000
          },
          "metadata": {
              "cpu": 2,
              "group": "llm_group",
              "instance": 2,
              "memory": 4000,
              "name": "llm_router",
              "type": "LLMGatewayService"
          }
      }

      The following table describes the key parameters. For information about other parameters, see Parameters of model services.

      Parameter

      Description

      llm_gateway.infer_backend

      The inference framework used by the LLM. Valid values:

      • vllm (default)

      • bladellm

      llm_gateway.max_queue_size

      The maximum length of the cache queue in LLM Intelligent Router. Default value: 128.

      If the processing capability of the backend inference framework is exceeded, LLM Intelligent Router caches requests in the queue and forwards the cached requests when an inference instance is available.

      llm_gateway.retry_count

      The number of retries. Default value: 2. If the backend inference instance to which a request is forwarded is abnormal, LLM Intelligent Router tries to forward the request to another instance.

      llm_gateway.wait_schedule_timeout

      The timeout period. Default value: 5000. Unit: milliseconds. When the LLM scheduler is unavailable for the timeout period, LLM Intelligent Router uses the simple Round Robin policy to distribute requests.

      llm_gateway.wait_schedule_try_period

      The interval at which LLM Intelligent Router retries to connect to the LLM scheduler during the timeout period specified by the wait_schedule_timeout parameter. Default value: 500. Unit: milliseconds.

      llm_scheduler.cpu

      The number of vCPUs of the LLM scheduler. Default value: 4.

      llm_scheduler.memory

      The memory of the LLM scheduler. Default value: 4. Unit: GiB.

      llm_scheduler.instance_type

      The instance type of the LLM scheduler. An instance type defines the number of vCPUs and memory. If you specify this parameter, you do not need to separately specify the number of vCPUs and memory.

Deploy an LLM service

The following deployment methods are supported:

Method 1: Deploy a service in the PAI console

  1. Go to the EAS page.

    1. Log on to the Platform for AI (PAI) console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace to which you want to deploy the model and click its name to go to the Workspace Details page.

    3. In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.image

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service.

  3. On the Deploy Service page, select one of the following deployment methods and configure the key parameters. For information about other parameters, see Deploy a model service in the PAI console.

    • In the Custom Model Deployment section, select Custom Deployment. On the Create Service page, configure the key parameters described in the following table.

      Parameter

      Description

      Model Service Information

      Deployment Method

      Select Deploy Web App by Using Image.

      Select Image

      PAI images and custom images are supported.

      • If you set Select Image to PAI Image, select the chat-llm-webui image and select 3.0-vllm or 3.0-blade for the image version.

      • If you set Select Image to Image Address, enter a custom image address in the field. The inference framework of a custom image must be BladeLLM or vllm.

      Service Configuration

      LLM Intelligent Router

      Turn on LLM Intelligent Router and select the LLM Intelligent Router service that you deployed.

    • In the Scenario-based Model Deployment section, click LLM Deployment. On the LLM Deployment page, configure the key parameters described in the following table.

      Parameter

      Description

      Basic Information

      Model Type

      Select a model type that supports inference acceleration from the following model types: Qwen2-7b, Qwen1.5-1.8b, Qwen1.5-7b, Qwen1.5-14b, llama3-8b, llama2-7b, llama2-13b, chatglm3-6b, baichuan2-7b, baichuan2-13b, falcon-7b, yi-6b, mistral-7b-instruct-v0.2, gemma-2b-it, gemma-7b-it, and deepseek-coder-7b-instruct-v1.5.

      Service Configuration

      LLM Intelligent Router

      Turn on LLM Intelligent Router and select the LLM Intelligent Router service that you deployed.

  4. After you configure the parameters, click Deploy.

Method 2: Deploy a service by using JSON

In this example, the open source vLLM-0.3.3 image, which is a built-in image provided by PAI, is used. Perform the following steps:

  1. On the Elastic Algorithm Service (EAS) page, click Deploy Service.

  2. In the Custom Model Deployment section of the Deploy Service page, click JSON Deployment.

  3. In the configuration editor section of the JSON Deployment page, configure the parameters and click Deploy.

    The following sample code provides an example:

    {
        "cloud": {
            "computing": {
                "instance_type": "ecs.gn7i-c16g1.4xlarge"
            }
        },
        "containers": [
            {
                "image": "eas-registry-vpc.<regionid>.cr.aliyuncs.com/pai-eas/chat-llm:vllm-0.3.3",
                "port": 8000,
                "script": "python3 -m vllm.entrypoints.openai.api_server --served-model-name llama2 --model /huggingface/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1d3cabadba7ec7f1a9ef2ba5467ad31b3b84ff0/"
            }
        ],
        "features": {
            "eas.aliyun.com/extra-ephemeral-storage": "50Gi"
        },
        "metadata": {
            "cpu": 16,
            "gpu": 1,
            "instance": 5,
            "memory": 60000,
            "group": "llm_group",
            "name": "vllm_service"
        }
    }

    The following table describes the key parameters. For information about other parameters, see Parameters of model services.

    • metadata.group: the name of the service group to which the LLM service belongs. The LLM service must belong to the same service group as the LLM Intelligent Router service. This way, the LLM service can register with the LLM scheduler and report related metrics, and LLM Intelligent Router can forward traffic.

      • If you deploy an LLM Intelligent Router service in the console, you must view the name of the service group to which the LLM Intelligent Router service belongs on the Group Service tab of the Elastic Algorithm Service (EAS) page. The name of a service group is in the group_Name of an LLM Intelligent Router service format.

      • If you deploy an LLM service by using JSON, specify the name of a service group as llm_group.

    • containers.image: In this example, the preset image provided by PAI is used. You must replace <regionid> with the ID of the region in which you want to deploy the service. For example, if you want to deploy the service in the China (Beijing) region, replace <regionid> with cn-beijing.

Access an LLM Intelligent Router service

Obtain the endpoint and token for accessing an LLM Intelligent Router service

  1. Go to the EAS-Online Model Services page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which you want to deploy the model.

    3. In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page. image

  2. Find the LLM Intelligent Router service that you deployed and click Invocation Method in the Service Type column.

  3. On the Public Endpoint tab of the Invocation Method dialog box, view the endpoint and token for accessing the service.image

  4. Configure the endpoint for accessing the LLM Intelligent Router service.

    Configuration rule

    Example

    Format: {domain}/api/predict/{group_name}.{router_service_name}_llm_gateway/{endpoint}

    Replace {endpoint} with the API endpoint that is supported by your LLM service. Example: v1/completions.

    In this example, the LLM Intelligent Router service that is deployed by using JSON is used. If the endpoint obtained in Step 3 is http://175805416243****.cn-beijing.pai-eas.aliyuncs.com/api/predict/llm_group.llm_router. The endpoint for accessing the LLM Intelligent Router service is http://175805416243****.cn-beijing.pai-eas.aliyuncs.com/api/predict/llm_group.llm_router_llm_gateway/v1/completions.

Test the access

In the terminal, run the following command to access the LLM Intelligent Router service:

$curl -H "Authorization: xxxxxx" -H "Content-Type: application/json" http://***http://********.cn-beijing.pai-eas.aliyuncs.com/api/predict/{group_name}.{router_service_name}_llm_gateway/v1/completions -d '{"model": "llama2", "prompt": "I need your help writing an article. I will provide you with some background information to begin with. And then I will provide you with directions to help me write the article.", "temperature": 0.0, "best_of": 1, "n_predict": 34, "max_tokens": 34, "stream": true}'

In the preceding command:

  • "Authorization: xxxxxx": Specify the token obtained in the preceding step.

  • http://********.cn-beijing.pai-eas.aliyuncs.com/api/predict/{group_name}.{router_service_name}_llm_gateway/v1/completions: Replace the value with the endpoint obtained in the preceding step.

Sample response:

data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e9a38","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"\n"}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":1,"total_tokens":37},"error_info":null}

data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e9a38","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"\n"}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":2,"total_tokens":38},"error_info":null}

data: {"id":"0d9e74cf-1025-446c-8aac-89711b2e9a38","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":""}],"object":"text_completion","usage":{"prompt_tokens":36,"completion_tokens":3,"total_tokens":39},"error_info":null}

...

[DONE]

View service monitoring metrics

After the test is complete, you can view the service monitoring metrics to understand the performance of the service. Perform the following steps:

  1. On the Elastic Algorithm Service (EAS) page, find the LLM Intelligent Router service that you deployed and click the image icon in the Monitoring column.

  2. On the Monitoring tab of the page that appears, view the following metrics.

    Token Throughput

    The throughput of input and output tokens of the LLM.image

    • IN: the throughput of input tokens of the LLM.

    • OUT: the throughput of output tokens of the LLM.

    GPU Cache Usage

    The percentage of GPU memory consumed by the key-value (KV) cache.

    image

    Engine Current Requests

    The number of concurrent requests on the LLM engine in real time.

    image

    • Running: the number of requests that the LLM engine is processing.

    • Waiting: the number of requests that are queuing in the LLM engine.

    Gateway Current Requests

    The number of requests on LLM Intelligent Router in real time.

    image

    • Total: the total number of requests that are received by LLM Intelligent Router in real time.

    • Pending: the number of requests that are cached in LLM Intelligent Router, waiting to be processed by the LLM engine.

    Time To First Token

    The latency of the first output token.

    image

    • Max: the maximum latency of the first output token.

    • Avg: the average latency of the first output token.

    • Min: the minimum latency of the first output token.

    • TPxx: the percentiles of the latencies of the first output token.

    Time Per Output Token

    The latency of an output token.

    image

    • Max: the maximum latency of an output token.

    • Avg: the average latency of an output token.

    • Min: the minimum latency of an output token.

    • TPxx: the percentiles of the latencies of an output token.

Appendix 1: Comparison of test results

The following test results are for reference only. The actual performance improvement is subject to your own test results.

Test environment

Item

Description

Model

Llama2-7B

Data

ShareGPT_V3_unfiltered_cleaned_split.json

Client code (modified)

vllm/benchmarks/benchmark_serving.py

GPU type

ecs.gn7i-c8g1.2xlarge (24G A10)

Inference engine

vLLM

Number of backend instances

5

Code for stress testing

This section provides the test code for the vLLM 0.3.3 LLM service used in this example.

  • Download address: benchmarks.tgz.

  • Main files:

    benchmarks
    ├── benchmark.sh # The test script. 
    ├── backend_request_func.py
    ├── benchmark_serving.py
    ├── samples.txt  # The request data excerpted from the file ShareGPT_V3_unfiltered_cleaned_split.json. 
    └── tokenizer    # The Llama2-7B tokenizer.
        ├── tokenizer_config.json
        ├── tokenizer.json
        └── tokenizer.model
  • Test code:

    # Install vLLM of the required version. 
    $pip install vllm==0.3.3 
    # Replace {gateway_service_token} with the token for accessing the LLM Intelligent Router service. 
    $export OPENAI_API_KEY={router_service_token}
    
    # Run the client test code. 
    $python ./benchmark_serving.py --base-url http://********.cn-beijing.pai-eas.aliyuncs.com/api/predict/{group_name}.{router_service_name}_llm_gateway \
    --endpoint /v1/completions \
    --tokenizer ./tokenizer/ \
    --model-id llama2 \
    --load-inputs ./samples.txt \
    --request-count 10000 \
    --max-concurrent 160

Test results

Metric

Without LLM Intelligent Router

With LLM Intelligent Router

Improvement

Successful requests

9851

9851

-

Benchmark duration

1138.429295 s

1060.888924 s

+6.8%

Total input tokens

1527985

1527985

-

Total generated tokens

2808261

2812861

-

Input token throughput

1342.19 tokens/s

1440.29 tokens/s

+7.3%

Output token throughput

2466.79 tokens/s

2651.42 tokens/s

+7.5%

Mean TTFT

1981.86 ms

304.00 ms

+84%

Median TTFT

161.69 ms

158.67 ms

+1.8%

P99 TTFT

19396.84 ms

3534.64 ms

+81%

Mean TPOT

120.33 ms

69.41 ms

+42%

P99 TPOT

852.49 ms

260.33 ms

+69%

Appendix 2: Metrics APIs

The scheduling algorithm of LLM Intelligent Router schedules requests to idle instances based on the metrics of different backend inference instances. To use LLM Intelligent Router, you must implement the Metrics API in the inference instances to report relevant metrics based on your business requirements. LLM Intelligent Router is compatible with the Metrics APIs of vLLM and BladeLLM.

The following code shows the sample output of the Metrics API of vLLM:

# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="Llama-2-7b-chat-hf"} 30.0
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="Llama-2-7b-chat-hf"} 0.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="Llama-2-7b-chat-hf"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="Llama-2-7b-chat-hf"} 0.8426270136307311
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="Llama-2-7b-chat-hf"} 15708.0
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="Llama-2-7b-chat-hf"} 13419.0

The LLM scheduler converts and calculates the preceding metric data, and prioritizes the backend inference instances based on the calculation results. LLM Intelligent Router accesses the LLM scheduler to query the backend inference instance that has the highest priority and forward requests to the instance.

The following table describes the supported LLM inference frameworks and the corresponding metrics.

Note

LLM Intelligent Router can also be used with other inference frameworks that are compatible with the metrics described in the following table. In this case, LLM Intelligent Router can operate as expected and schedule requests in an efficient manner.

LLM inference engine

Metric

Description

vLLM

vllm:num_requests_running

The number of requests that are running.

vllm:num_requests_waiting

The number of requests that are waiting in the queue.

vllm:gpu_cache_usage_perc

The percentage of GPU memory consumed by the KV cache.

vllm:prompt_tokens_total

The total number of input tokens.

vllm:generation_tokens_total

The total number of output tokens.

BladeLLM

decode_batch_size_mean

The number of requests that are running.

wait_queue_size_mean

The number of requests that are waiting in the queue.

block_usage_gpu_mean

The percentage of GPU memory consumed by the KV cache.

tps_total

The total number of tokens that are processed per second.

tps_out

The number of output tokens that are generated per second.