Real-time inference scenarios

This topic describes real-time inference scenarios and how to use GPU-accelerated instances in idle mode to build cost-effective real-time inference services with low latency.

Scenarios

Characteristics of real-time inference workloads

Real-time inference workloads often have one or more of the following characteristics:

Low latency
Real-time inference workloads have high requirements on the response time of each request. The long-tail latency must be within hundreds of milliseconds for 90% of requests.
Core links
In most cases, real-time inference workloads are generated in core business links, so they require a high success rate and cannot afford to have extended retries. The following items serve as examples:
- Promotional content on launch pages and homepages: Product advertisements and recommendations that match the individual preferences of users need to be promptly and prominently displayed on the users' launch pages and homepages.
- Real-time streaming services: In scenarios such as co-streaming, live streaming, and ultra-low latency playback, audio and video streams must be transmitted at extremely low end-to-end latency. The performance of real-time AI-based video super resolution and video recognition must also be guaranteed.
Fluctuating traffic
Business traffic fluctuates with user habits, experiencing peak and off-peak hours.
Low resource utilization
In most cases, GPU resources are planned based on traffic peaks, which leads to a significant amount of resources sitting dormant during off-peak hours. The resource utilization rate is generally lower than 30%.

Benefits of using Function Compute in real-time inference scenarios

GPU-accelerated instances in idle mode
Function Compute offers an idle mode feature for provisioned GPU-accelerated instances. If you want to mitigate cold starts and meet the low latency requirements of real-time inference workloads, you can configure provisioned GPU-accelerated instances with idle mode enabled. For more information, see Instance types and usage modes. This idle mode feature for GPU-accelerated instances delivers the following benefits:
- Quick instance wake-up: Function Compute freezes GPU-accelerated instances based on your real-time workloads, and automatically unfreezes them when there are incoming requests. Note that the unfreezing process takes two to three seconds.
- Cost-effective services: Measurement of the execution duration of instances in provisioned and on-demand modes varies. Provisioned instances in idle mode are billed at lower unit prices compared to active ones. For more information, see How am I charged for using a real-time inference service in Function Compute? While the overall cost of using provisioned GPU-accelerated instances in idle mode is higher compared to using on-demand instances, it is still over 50% lower than the cost of building GPU clusters in your on-premises environments.
Optimized request scheduling mechanism for inference scenarios
Function Compute provides a built-in intelligent scheduling mechanism to achieve load balancing between different GPU-accelerated instances within a function. This intelligent scheduling of Function Compute evenly distributes inference requests to backend GPU-accelerated instances, which improves the overall utilization of inference clusters.

GPU-accelerated instances in idle mode

After a GPU function is deployed, you can use provisioned GPU-accelerated instances with idle mode enabled to provide infrastructure capabilities for real-time inference scenarios. Function Compute performs horizontal pod autoscaling (HPA) on provisioned GPU-accelerated instances, dynamically adjusting resources based on metric-based scaling policies and workloads. Inference requests are preferentially allocated to provisioned GPU-accelerated instances for processing. Provisioned instances help reduce cold starts, which allows your inference services to consistently respond with low latency.

Idle mode helps reduce costs

After you enable the idle mode feature, the billing for GPU-accelerated instances is determined by two separate unit prices: one for idle GPUs and another for active GPUs. Function Compute automatically collects statistics and charges fees based on the instance status.

In the example shown in the following figure, a GPU-accelerated instance goes through five time windows (T0 to T4) from creation to destruction. The instance is active in T1 and T3, and idle in T0, T2, and T4. The following formula is used to calculate the total cost: (T0 + T2 + T4) x Unit price of idle GPUs + (T1 + T3) x Unit price of active GPUs. For more information about the unit prices of idle GPUs and active GPUs, see Billing overview.

How it works

Function Compute implements instant freezing and restoration for GPU-accelerated instances based on advanced Alibaba Cloud technologies. When a GPU-accelerated instance is not processing requests, Function Compute automatically freezes the instance and charges you based on the idle GPU unit price. This mechanism optimizes resource utilization and minimizes costs. When a new inference request arrives, Function Compute activates the instance to seamlessly execute the request. In this case, you are charged based on the active GPU unit price.

This process is completely transparent to users and does not affect user experience. At the same time, Function Compute ensures the unwavering accuracy and reliability of your inference service even when the instance is frozen, providing users with stable and cost-effective computing capabilities.

Activation duration of idle GPU-accelerated instances

The duration to activate idle GPU-accelerated instances varies based on workloads. The following table lists the durations in typical inference scenarios for your reference.

Inference workload type	Activation duration (seconds)

Inference workload type	Activation duration (seconds)
OCR/NLP	0.5–1
Stable Diffusion	2
LLM	3

Important

The activation duration varies based on model sizes. The actual duration prevails.

Usage notes

CUDA version
We recommend that you use CUDA 12.2 or an earlier version.
Image permissions
We recommend that you run container images as the default root user.
Instance logon
You cannot log on to an idle GPU-accelerated instance because the GPUs are frozen.
Graceful instance rotation
Function Compute rotates idle GPU-accelerated instances based on the workload. To ensure service quality, we recommend that you add lifecycle hooks to function instances for model warm-up and pre-inference. This way, your inference service can be provided immediately after the launch of a new instance. For more information, see Model warm-up.
Model warm-up and pre-inference
To reduce the latency of the initial wake-up of an idle GPU-accelerated instance, we recommend that you use the initialize hook in your code to warm up or preload your model. For more information, see Model warm-up.
Provisioned instance configurations
When you turn on the Idle Mode switch, the existing provisioned GPU-accelerated instances for the function are gracefully shut down. Provisioned instances are reallocated after they are released for a short period of time.
Built-in Metrics Server of inference frameworks
To improve the compatibility and performance of idle GPUs, we recommend that you disable the built-in Metrics Server of your inference frameworks, such as NVIDIA Triton Inference Server and TorchServe.

GPU-accelerated instance specifications

Only functions that are configured with full sizes of GPUs support the idle mode. For more information about the specifications of GPU-accelerated instances, see Instance specifications.

Optimized request scheduling mechanism for inference scenarios

How it works

Function Compute adopts workload-based intelligent scheduling, a strategy that is notably superior to conventional round-robin scheduling methods. The platform monitors the task execution status of GPU-accelerated instances in real time, and immediately sends new requests to running GPU-accelerated instances when the instances become idle. This mechanism ensures efficient use of GPU resources and reduces resource waste and hot spots. It also ensures that the load balancing of GPU-accelerated instances is consistent with the utilization of GPU computing power. The following figure shows an example in which Tesla T4 GPUs are used.

Scheduling effect

The built-in scheduling logic of Function Compute implements load balancing among different GPU-accelerated instances. The scheduling is imperceptible to users.

Instance 1	Instance 2	Instance 3

Instance 1	Instance 2	Instance 3

Container support

GPU-accelerated instances of Function Compute can be used only in Custom Container runtimes. For more information about Custom Container runtimes, see Introduction to Custom Container.

Custom Container functions require a web server carried within an image to execute different code paths and trigger functions through events or HTTP requests. The web server mode is suitable for multi-path request execution scenarios such as AI learning and inference.

Deployment methods

You can deploy your models in Function Compute by using one of the following methods:

Use the Function Compute console. For more information, see Create a function in the Function Compute console.
Call SDKs. For more information, see List of operations by function.
Use Serverless Devs. For more information, see Common commands of Serverless Devs.

For more deployment examples, see start-fc-gpu.

Model warm-up

To address the long processing time of initial requests after a model is launched, Function Compute provides a model warm-up feature. This feature enables a model to enter the working state immediately after it is launched.

We recommend that you configure the initialize lifecycle hook in Function Compute. Function Compute automatically executes the business logic in the hook to warm up your models. For more information, see Lifecycle hooks for function instances.

Add the /initialize invocation path of the POST method to the HTTP server that you build, and place the model warm-up logic under the /initialize path. You can have the model perform simple inferences to achieve the warm-up effect.

The following sample code provides an example in Python:

def prewarm_inference():
    res = model.inference()

@app.route('/initialize', methods=['POST'])
def initialize():
    request_id = request.headers.get("x-fc-request-id", "")
    print("FC Initialize Start RequestId: " + request_id)

    # Prewarm model and perform naive inference task.    
    prewarm_inference()
    
    print("FC Initialize End RequestId: " + request_id)
    return "Function is initialized, request_id: " + request_id + "\n"

On the Function Details page, choose Configurations > Lifecycle, and then click Modify to configure lifecycle hooks.

Configure auto scaling in real-time inference scenarios

Use Serverless Devs

Use the Function Compute console

Prerequisites

Make sure that the following operations are complete in the region in which GPU-accelerated instances reside:
- A Container Registry Enterprise Edition instance or Personal Edition instance is created. We recommend that you create a Container Registry Enterprise Edition instance. For more information, see Push an image to a Container Registry Enterprise Edition instance and pull an image from the instance.
- A namespace and an image repository are created. For more information, see Push an image to a Container Registry Enterprise Edition instance and pull an image from the instance and Use a Container Registry Enterprise Edition instance to build an image.
Serverless Devs is installed. For more information, see Quick start.
Serverless Devs is configured. For more information, see Configure Serverless Devs.

1. Deploy a function

Run the following command to clone the project:

git clone https://github.com/devsapp/start-fc-gpu.git

Run the following command to go to the directory of the project:

cd /root/start-fc-gpu/fc-http-gpu-inference-paddlehub-nlp-porn-detection-lstm/src/

The following code snippet shows the structure of the project.

.
├── hook
│   └── index.js
└── src
    ├── code
    │   ├── Dockerfile
    │   ├── app.py
    │   ├── hub_home
    │   │   ├── conf
    │   │   ├── modules
    │   │   └── tmp
    │   └── test
    │       └── client.py
    └── s.yaml

Run the following command to use Docker to build an image and push the image to your image repository:
```
export IMAGE_NAME="registry.cn-shanghai.aliyuncs.com/fc-gpu-demo/paddle-porn-detection:v1"
# sudo docker build -f ./code/Dockerfile -t $IMAGE_NAME .
# sudo docker push $IMAGE_NAME
```
Important
The PaddlePaddle framework is large in size and it takes about one hour to build an image for the first time. Function Compute provides a VPC-based public image for you to directly use. If you use the public image, you do not need to execute the preceding docker build or docker push command.

Edit the s.yaml file.

edition: 3.0.0
name: container-demo
access: default
vars:
  region: cn-shanghai
resources:
  gpu-best-practive:
    component: fc3
    props:
      region: ${vars.region}
      description: This is the demo function deployment
      handler: not-used
      timeout: 1200
      memorySize: 8192
      cpu: 2
      gpuMemorySize: 8192
      diskSize: 512
      instanceConcurrency: 1
      runtime: custom-container
      environmentVariables:
        FCGPU_RUNTIME_SHMSIZE: '8589934592'
      customContainerConfig:
        image: >-
          registry.cn-shanghai.aliyuncs.com/serverless_devs/gpu-console-supervising:paddle-porn-detection
        port: 9000
      internetAccess: true
      logConfig:
        enableRequestMetrics: true
        enableInstanceMetrics: true
        logBeginRule: DefaultRegex
        project: z****
        logstore: log****
      functionName: gpu-porn-detection
      gpuConfig:
        gpuMemorySize: 8192
        gpuType: fc.gpu.tesla.1
      triggers:
        - triggerName: httpTrigger
          triggerType: http
          triggerConfig:
            authType: anonymous
            methods:
              - GET
              - POST

Run the following command to deploy the function:
```
sudo s deploy --skip-push true -t s.yaml
```
After the execution, a URL is returned in the output. Copy this URL for subsequent tests. Example URL: https://gpu-poretection-****.cn-shanghai.fcapp.run.

2. Test the function and view monitoring data

Run the curl command to call the function. The following code snippet provides an example. The URL obtained in the previous step is used in this command.

curl https://gpu-poretection-gpu-****.cn-shanghai.fcapp.run/invoke -H "Content-Type: text/plain" --data "Nice to meet you"

If the following output is returned, the test is passed.

[{"text": "Nice to meet you", "porn_detection_label": 0, "porn_detection_key": "not_porn", "porn_probs": 0.0, "not_porn_probs": 1.0}]%

Log on to the Function Compute console. In the left-side navigation pane, click Functions. Select a region. Find the function that you want to manage and click the function name. On the Function Details page, choose Monitoring > Instance Metrics to view the changes of GPU-related metrics.

3. Configure an auto scaling policy

In the directory in which the s.yaml file is located, create the provision.json template.

The following sample code provides an example template. This template uses instance concurrency as the tracking metric. The minimum number of instances is 2 and the maximum number of instances is 30.

{
  "targetTrackingPolicies": [
    {
      "name": "scaling-policy-demo",
      "startTime": "2024-07-01T16:00:00.000Z",
      "endTime": "2024-07-30T16:00:00.000Z",
      "metricType": "ProvisionedConcurrencyUtilization",
      "metricTarget": 0.3,
      "minCapacity": 2,
      "maxCapacity": 30
    }
  ]
}

Run the following command to deploy the scaling policy:

sudo s provision put --target 1 --targetTrackingPolicies ./provision.json --qualifier LATEST -t s.yaml -a {access}

Run the sudo s provision list command for verification. In the output, the values of target and current are the same, which means that the provisioned instances are allocated as expected and the auto scaling policy is correctly deployed.

[2023-05-10 14:49:03] [INFO] [FC] - Getting list provision: gpu-best-practive-service
gpu-best-practive:
  -
    serviceName:            gpu-best-practive-service
    qualifier:              LATEST
    functionName:           gpu-porn-detection
    resource:               143199913651****#gpu-best-practive-service#LATEST#gpu-porn-detection
    target:                 1
    current:                1
    scheduledActions:       null
    targetTrackingPolicies:
      -
        name:         scaling-policy-demo
        startTime:    2024-07-01T16:00:00.000Z
        endTime:      2024-07-30T16:00:00.000Z
        metricType:   ProvisionedConcurrencyUtilization
        metricTarget: 0.3
        minCapacity:  2
        maxCapacity:  30
    currentError:
    alwaysAllocateCPU:      true

Your model is successfully deployed and ready to use when your provisioned instance is created.

Release provisioned instances for a function.
1. Run the following command to disable an auto scaling policy and set the number of provisioned instances to 0:
```
sudo s provision put --target 0 --qualifier LATEST -t s.yaml -a {access}
```
2. Run the following command to check whether the auto scaling policy is disabled:
```
s provision list -a {access}
```
  If the following output is returned, the auto scaling policy is disabled:
  [2023-05-10 14:54:46] [INFO] [FC] - Getting list provision: gpu-best-practive-service End of method: provision

Prerequisites

A GPU function is created. For more information, see Create a Custom Container function.

Procedure

Log on to the Function Compute console. In the left-side navigation pane, click Functions. In the top navigation bar, select a region. On the page that appears, find the function that you want to manage. In the function configurations, enable instance-level metrics for the function.
On the Function Details page, choose Configurations > Triggers to obtain the URL of the HTTP trigger for subsequent tests.
Run the curl command to test the function. On the Function Details page, choose Monitoring > Instance Metrics to view the changes of GPU-related metrics.
```
curl https://gpu-poretection****.cn-shanghai.fcapp.run/invoke -H "Content-Type: text/plain" --data "Nice to meet you"
```
On the Function Details page, choose Configurations > Provisioned Instances. Then, click Create Provisioned Instance Policy to configure an auto scaling policy.
After the configuration is complete, you can choose Monitoring > Function Metrics on the Function Details page to view changes of provisioned instances.

Important

If you no longer require provisioned GPU-accelerated instances, delete them at your earliest opportunity.

FAQ

How am I charged for using a real-time inference service in Function Compute?

For information about the billing of Function Compute, see Billing overview. The billing method of provisioned instances is different from that of on-demand instances. Pay attention to your bill details.

Why do latencies still occur after I configured an auto scaling policy?

You can use a more aggressive auto scaling policy to provision nodes in advance to prevent performance strain caused by sudden bursts of requests.

Why is the number of instances not increased when the tracking metric reaches the threshold?

The metrics of Function Compute are collected at the minute level. The scale-out mechanism is triggered only when the metric value remains above the threshold for a period of time.

Scenarios

Characteristics of real-time inference workloads

Benefits of using Function Compute in real-time inference scenarios

GPU-accelerated instances in idle mode

Idle mode helps reduce costs

How it works

Activation duration of idle GPU-accelerated instances

Usage notes

GPU-accelerated instance specifications

Optimized request scheduling mechanism for inference scenarios

How it works

Scheduling effect

Container support

Deployment methods

Model warm-up

Configure auto scaling in real-time inference scenarios

Prerequisites

1. Deploy a function

2. Test the function and view monitoring data

3. Configure an auto scaling policy

Prerequisites

Procedure

FAQ

How am I charged for using a real-time inference service in Function Compute?

Why do latencies still occur after I configured an auto scaling policy?

Why is the number of instances not increased when the tracking metric reaches the threshold?

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)