All Products
Search
Document Center

Function Compute:Real-time inference scenarios

Last Updated:Nov 15, 2024

This topic describes real-time inference scenarios and how to use the idle mode feature of GPU-accelerated instances to build low-latency and cost-effective real-time inference services.

Scenarios

Features of real-time inference applications

Workloads of real-time inference scenarios feature one or more of the following characteristics:

  • Low latency

    Real-time inference workloads have high requirements on the time efficiency and response time of each request. The long-tail latency must be within hundreds of milliseconds for 90% of requests.

  • Core links

    In most cases, real-time inference occurs in core business links and requires a high success rate of inference. Long-term retries are not acceptable. The following items provide examples:

    • Launch page commercials and homepage recommendations: User-specific commercials and products can be displayed on launch pages and homepages based on user preferences.

    • Real-time production in streaming media: In scenarios such as interactive streaming, live streaming, and ultra-low latency playback, audio and video streams must be transmitted at extremely low end-to-end latency. Performance and user experience must also be guaranteed in scenarios such as real-time AI-based video super resolution and video recognition.

  • Fluctuating traffic

    Business traffic has peak hours and off-peak hours. The traffic fluctuation trend changes with user habits.

  • Low resource utilization

    In most cases, GPU resources are planned based on traffic peaks. A large amount of resources sit idle during off-peak hours and the resource utilization is generally lower than 30%.

Benefits of using Function Compute in real-time inference scenarios

  • Idle mode of GPU-accelerated instances

    Function Compute provides the idle mode feature for GPU-accelerated instances. If you want to eliminate the impacts of cold starts and meet the requirements of low latency for real-time inference, you can configure provisioned GPU-accelerated instances with idle mode enabled. For more information, see Instance types and usage modes. The idle mode feature of GPU-accelerated instances delivers the following benefits:

    • Quick instance wake-up: Function Compute automatically freezes GPU-accelerated instances based on your real-time workloads. The platform automatically wakes up the frozen instances before requests are sent to the instances. Note that frozen instances take two to three seconds to wake up.

    • High-quality services at low costs: The billing cycle of idle GPU-accelerated instances is different from that of on-demand GPU-accelerated instances. GPU-accelerated instances are billed at different prices during idle and active periods. For more information, see How do I calculate the cost of using Function Compute for real-time inference? Although the overall cost is still higher than the cost of on-demand GPU-accelerated instances, the cost is reduced by more than 50% compared with self-built GPU clusters.

  • Optimized request scheduling mechanism for inference scenarios

    Function Compute provides a built-in intelligent scheduling mechanism to achieve load balancing between different GPU-accelerated instances within a function. The intelligent scheduling of Function Compute evenly distributes inference requests to backend GPU-accelerated instances. This improves the overall utilization of inference clusters.

Idle mode of GPU-accelerated instances

After a GPU function is deployed, you can use provisioned GPU-accelerated instances with idle mode enabled to provide infrastructure capabilities for real-time inference scenarios. Function Compute performs horizontal pod autoscaling (HPA) on provisioned GPU-accelerated instances based on the configured metric-based scaling policy and workloads. Inference requests are preferentially allocated to provisioned GPU-accelerated instances for processing. The platform eliminates cold starts so that your inference service keeps responding with low latency.

image

Idle mode helps reduce costs

After you enable the idle mode, idle GPU prices and active GPU prices are used for the billing of GPU-accelerated instances in the idle state and active state. Function Compute automatically collects statistics and charges fees based on the instance status.

In the example shown in the following figure, a GPU-accelerated instance goes through four time windows (T0 to T4) from creation to destruction. The instance is active in T1 and T3, and idle in T0, T2, and T4. The following formula is used to calculate the total cost: (T0 + T2 + T4) x Unit price of idle GPUs + (T1 + T3) x Unit price of active GPUs. For more information about the unit price of idle GPUs and active GPUs, see Billing overview.

image

How it works

Function Compute implements instant freezing and restoration for GPU-accelerated instances based on advanced Alibaba Cloud technologies. When a GPU-accelerated instance is not processing requests, Function Compute platform automatically freezes the instance and you are charged based on idle GPU prices. This optimizes resource utilization and reduces costs. When a new inference request is sent, Function Compute quickly activates the instance and seamlessly executes the request. You are charged based on active GPU prices.

image

This process is completely transparent to users and does not affect user experience. At the same time, Function Compute ensures that the accuracy and reliability of your inference service are not affected even when the instance is frozen, providing users with stable and cost-effective computing capabilities.

image

Activation duration of idle GPU-accelerated instances

The time to activate idle GPU-accelerated instances varies based on workloads. The following table lists time in typical inference scenarios for your reference.

Inference workload type

Activation duration (seconds)

OCR/NLP

0.5 - 1

Stable Diffusion

2

LLM

3

Important

The activation time varies based on model sizes. The actual activation time prevails.

Usage notes

  • CUDA version

    We recommend that you use CUDA 12.2 or an earlier version.

  • Image permissions

    We recommend that you run container images as the default root user.

  • Instance logon

    You cannot log on to an idle GPU-accelerated instance because the GPU cards are frozen.

  • Model warmup and pre-inference

    To ensure that the latency of the initial wake-up of an idle GPU-accelerated instance meets your business requirements, we recommend that you use the initialize hook in your business code to warm up or preload your model. For more information, see Model warmup.

  • Provisioned instance configurations

    When you turn on the Idle Mode switch, the existing provisioned GPU-accelerated instances for the function are gracefully shut down. Provisioned instances are reallocated after they are released for a short period of time.

GPU specifications

For GPU functions, only the functions that are configured with full sizes of GPU cards support the idle mode. For more information about specifications of GPU-accelerated instances, see Instance types and usage modes.

Optimized request scheduling mechanism for inference scenarios

How it works

Function Compute adopts workload-based intelligent scheduling, a strategy that is notably superior to conventional round-robin scheduling methods. The platform monitors the task execution status of GPU-accelerated instances in real time, and immediately sends new requests to running GPU-accelerated instances when the instances become idle. This mechanism ensures efficient use of GPU resources and reduces waste of resources and hot spots. This ensures that the load balancing of GPU-accelerated instances is consistent with the utilization of GPU computing power. The following figure shows an example in which Tesla T4 GPU cards are used.

imageimage

Scheduling effect

The built-in scheduling logic of Function Compute implements load balancing among different GPU-accelerated instances. The scheduling is transparent to users.

Instance 1

Instance 2

Instance 3

image

image

image

Container support

GPU-accelerated instances of Function Compute can be used only in Custom Container runtimes. For more information about Custom Container runtimes, see Overview.

Custom container functions require a web server carried within an image to execute different code paths and trigger functions through events or HTTP requests. The web server mode is suitable for multi-path request execution scenarios such as AI learning and inference.

Deployment methods

You can deploy your models in Function Compute by using one of the following methods:

For more deployment examples, see start-fc-gpu.

Model warmup

To resolve the issue that initial requests take a long time after a model is released, Function Compute provides the model warmup feature. The model warmup feature enables a model to enter the working state immediately after it is launched.

We recommend that you configure the initialize lifecycle hook in Function Compute to warm up models. Function Compute automatically executes the business logic in initialize to warm up models. For more information, see Lifecycle hooks for function instances.

  1. Add the /initialize invocation path of the POST method to the HTTP server that you build, and place the model warmup logic under the /initialize path. You can have the model perform simple inferences to achieve the warmup effect.

    The following sample code provides an example in Python:

    def prewarm_inference():
        res = model.inference()
    
    @app.route('/initialize', methods=['POST'])
    def initialize():
        request_id = request.headers.get("x-fc-request-id", "")
        print("FC Initialize Start RequestId: " + request_id)
    
        # Prewarm model and perform naive inference task.    
        prewarm_inference()
        
        print("FC Initialize End RequestId: " + request_id)
        return "Function is initialized, request_id: " + request_id + "\n"
  2. On the function details page, choose Configurations > Lifecycle, and then click Modify to configure lifecycle hooks.

    image.png

Configure auto scaling in real-time inference scenarios

Use Serverless Devs

Before you start

1. Deploy a function.

  1. Run the following command to clone the project:

    git clone https://github.com/devsapp/start-fc-gpu.git
  2. Run the following command to go to the project directory:

    cd /root/start-fc-gpu/fc-http-gpu-inference-paddlehub-nlp-porn-detection-lstm/src/

    The following code snippet shows the structure of the project.

    .
    ├── hook
    │   └── index.js
    └── src
        ├── code
        │   ├── Dockerfile
        │   ├── app.py
        │   ├── hub_home
        │   │   ├── conf
        │   │   ├── modules
        │   │   └── tmp
        │   └── test
        │       └── client.py
        └── s.yaml
  3. Run the following command to use Docker to build an image and push the image to your image repository:

    export IMAGE_NAME="registry.cn-shanghai.aliyuncs.com/fc-gpu-demo/paddle-porn-detection:v1"
    # sudo docker build -f ./code/Dockerfile -t $IMAGE_NAME .
    # sudo docker push $IMAGE_NAME
    Important

    The PaddlePaddle framework is large in size and it takes about one hour to build an image for the first time. Function Compute provides a VPC-based public image for you to directly use. If you use the public image, you do not need to execute the preceding docker build or docker push command.

  4. Edit the s.yaml file.

    edition: 3.0.0
    name: container-demo
    access: default
    vars:
      region: cn-shanghai
    resources:
      gpu-best-practive:
        component: fc3
        props:
          region: ${vars.region}
          description: This is the demo function deployment
          handler: not-used
          timeout: 1200
          memorySize: 8192
          cpu: 2
          gpuMemorySize: 8192
          diskSize: 512
          instanceConcurrency: 1
          runtime: custom-container
          environmentVariables:
            FCGPU_RUNTIME_SHMSIZE: '8589934592'
          customContainerConfig:
            image: >-
              registry.cn-shanghai.aliyuncs.com/serverless_devs/gpu-console-supervising:paddle-porn-detection
            port: 9000
          internetAccess: true
          logConfig:
            enableRequestMetrics: true
            enableInstanceMetrics: true
            logBeginRule: DefaultRegex
            project: z****
            logstore: log****
          functionName: gpu-porn-detection
          gpuConfig:
            gpuMemorySize: 8192
            gpuType: fc.gpu.tesla.1
          triggers:
            - triggerName: httpTrigger
              triggerType: http
              triggerConfig:
                authType: anonymous
                methods:
                  - GET
                  - POST
  5. Run the following command to deploy the function:

    sudo s deploy --skip-push true -t s.yaml

    After the execution, a URL is returned in the output. Copy this URL for subsequent tests. Example URL: https://gpu-poretection-****.cn-shanghai.fcapp.run.

2. Test the function and view monitoring data

  1. Run the curl command to invoke the function. The following code snippet provides an example. The URL obtained in the previous step is used in this command.

    curl https://gpu-poretection-gpu-****.cn-shanghai.fcapp.run/invoke -H "Content-Type: text/plain" --data "Nice to meet you"

    If the following output is returned, the test is passed.

    [{"text": "Nice to meet you", "porn_detection_label": 0, "porn_detection_key": "not_porn", "porn_probs": 0.0, "not_porn_probs": 1.0}]%
  2. Log on to the Function Compute console. In the left-side navigation pane, click Functions. Select a region. Find the function that you want to manage and click the function name. On the function details page, choose Monitoring > Instance Metrics to view the changes of GPU-related metrics.

    gpu-index-changes

3. Configure an auto scaling policy

  1. In the directory where the s.yaml file is located, create the provision.json template.

    The following code snippet provides an example. This template uses instance concurrency as the tracking metric. The minimum number of instances is 2 and the maximum number of instances is 30.

    {
      "targetTrackingPolicies": [
        {
          "name": "scaling-policy-demo",
          "startTime": "2024-07-01T16:00:00.000Z",
          "endTime": "2024-07-30T16:00:00.000Z",
          "metricType": "ProvisionedConcurrencyUtilization",
          "metricTarget": 0.3,
          "minCapacity": 2,
          "maxCapacity": 30
        }
      ]
    }
  2. Run the following command to deploy the scaling policy:

    sudo s provision put --target 1 --targetTrackingPolicies ./provision.json --qualifier LATEST -t s.yaml -a {access}
  3. Run the sudo s provision list command for verification. You can see the following output. The values of target and current are the same, which means that the provisioned instances are allocated as expected and the auto scaling policy is correctly deployed.

    [2023-05-10 14:49:03] [INFO] [FC] - Getting list provision: gpu-best-practive-service
    gpu-best-practive:
      -
        serviceName:            gpu-best-practive-service
        qualifier:              LATEST
        functionName:           gpu-porn-detection
        resource:               143199913651****#gpu-best-practive-service#LATEST#gpu-porn-detection
        target:                 1
        current:                1
        scheduledActions:       null
        targetTrackingPolicies:
          -
            name:         scaling-policy-demo
            startTime:    2024-07-01T16:00:00.000Z
            endTime:      2024-07-30T16:00:00.000Z
            metricType:   ProvisionedConcurrencyUtilization
            metricTarget: 0.3
            minCapacity:  2
            maxCapacity:  30
        currentError:
        alwaysAllocateCPU:      true

    Your model is successfully deployed and ready to use when your provisioned instance is created.

  4. Release provisioned instances for a function.

    1. Run the following command to disable an auto scaling policy and set the number of provisioned instances to 0:

      sudo s provision put --target 0 --qualifier LATEST -t s.yaml -a {access}
    2. Run the following command to check whether the auto scaling policy is disabled:

      s provision list -a {access}

      If the following output is returned, the auto scaling policy is disabled:

      [2023-05-10 14:54:46] [INFO] [FC] - Getting list provision: gpu-best-practive-service
      End of method: provision

Use the Function Compute console

Before you start

Create a GPU function. For more information, see Create a Custom Container function.

Procedure

  1. Log on to the Function Compute console. In the left-side navigation pane, click Functions. In the top navigation bar, select a region. On the page that appears, find the function that you want to manage. In the function configurations, enable instance-level metrics for the function.

    image

  2. On the function details page, choose Configurations > Triggers to obtain the URL of the HTTP trigger for subsequent tests.

    image

  3. Run the curl command to test the function. On the function details page, choose Monitoring > Instance Metrics to view the changes of GPU-related metrics.

    curl https://gpu-poretection****.cn-shanghai.fcapp.run/invoke -H "Content-Type: text/plain" --data "Nice to meet you"
  4. On the function details page, choose Configurations > Provisioned Instances. Then, click Create Provisioned Instance Policy to configure an auto scaling policy.

    image

    After the configuration is complete, you can choose Monitoring > Function Metrics on the function details page to view changes of provisioned instances.

Important

If you no longer require provisioned GPU-accelerated instances, delete the provisioned GPU-accelerated instances at your earliest opportunity.

FAQ

How am I charged for using a real-time inference service in Function Compute?

For information about the billing of Function Compute, see Billing overview. The billing method of provisioned instances is different from that of on-demand instances. Pay attention to your bill details.

Why do latencies still occur after I configure an auto scaling policy?

You can use a more aggressive auto scaling strategy to provision nodes in advance to prevent performance strain caused by sudden bursts of requests.

Why is the number of instances not increased when the tracking metric reaches the threshold?

The metrics of Function Compute are collected at the minute level. The scale-out mechanism is triggered only when the metric value remains above the threshold for a period of time.