All Products
Search
Document Center

ENS:Best practices for AI inference in the edge cloud

Last Updated:Feb 08, 2025

The development of large models drives the transformation of AI from large-scale centralized training to distributed inference applications. Edge Node Service is an all-in-one edge service that provides distributed and elastic computing resources deployed close to end users. This allows nearby computing to help reduce computing costs, response latency, and loads in the central data center. This topic describes how to use Edge Node Service to build a basic resource environment for AI inference.

Prerequisites

  • Edge Node Service is activated. For more information, see Activate ENS.

  • A virtual private cloud (VPC) is created and CIDR blocks are planned for edge nodes.

Evaluate resources

Heterogeneous computing power of the edge cloud

The edge cloud provides four types of heterogeneous computing power at different prices and for different scenarios on widely distributed edge nodes. The single GPU memory size ranges from 12 GB to 48 GB. More computing power specifications will be provided with the evolution of GPUs.

GPU type

GPU memory (GB)

Resource specification

A

12

A*1

-

-

-

B

16

B*1

B*2

B*4

-

C

24

C*1

C*2

C*4

C*8

D

48

D*1

D*2

-

-

For example, Tongyi Qianwen and Llama 2 models cover large model applications that have 0.5 to 72 billion parameters. Supported scenarios include lightweight dialogue, intelligent customer service, text-based image generation, video understanding, code generation, content creation, and intelligent assistant. You can select heterogeneous computing resources of different specifications based on your business requirements.

Number of model parameters

Available inference resource

1B

A*1

B*1

C*1

D*1

7B

-

B*1

C*1

D*1

14B

-

B*2

C*2

D*1

32B

-

B*4

C*4

D*2

72B

-

-

C*8

-

Note

The GPU memory requirements and models are evaluated based on the FP16 precision. If you lower the precision, the GPU memory can support models that have more parameters.

Edge cloud capabilities

The edge cloud provides value-added capabilities at different layers to better support AI inference at the edge. You can select suitable capabilities to build a basic environment for edge AI inference based on your business requirements.

image
  1. Basic resource layer: provides heterogeneous computing power, CPU computing power, and storage services to help customers persist data.

  2. Resource configuration layer: provides comprehensive network services, such as Network Address Translation (NAT), Edge Load Balancer (ELB), and Elastic IP Address (EIP), and container services. The container multiboxing technology allows customers to run multiple container services on a heterogeneous computing instance. This improves resource utilization of a single instance and reduces costs.

  3. Service acceleration layer: provides the AIACC inference acceleration engine developed by Alibaba Cloud to improve inference performance and the open source TensorRT toolkit to support services that use different acceleration solutions.

  4. Service scheduling layer: provides service scheduling capabilities. The edge cloud provides services to nearby customers based on scheduling policies, and redirects requests to available edge nodes if an edge node fails or resources are insufficient. Collaborative storage allows user data synchronization between different edge nodes. This ensures a consistent user experience even on different edge nodes.

Build a basic environment for AI inference

Step 1: Create heterogeneous computing resources

  1. Log on to the ENS console.

  2. In the left-side navigation pane, choose Computing Capacity and Images > Instances.

  3. On the Instances page, click Create Instance. On the instance buy page, configure the parameters and pay for the order. For more information, see Create an instance.

Note
  • Heterogeneous computing resources support only the subscription billing method.

  • Confirm the instance specification with your account manager before you create an instance.

Step 2: Create network resources

  1. In the left-side navigation pane, choose Network Management and create network resources based on your business requirements.

  2. If you want the ENS instance to access the Internet or provide Internet services, apply for an edge elastic IP address (EIP). For more information, see Create and manage edge EIPs.

  3. If you want to distribute incoming traffic across multiple backend compute instances for synchronous processing and eliminate single points of failure (SPOFs) in the system, apply for Edge Load Balancer (ELB). For more information, see Create an ELB instance.

  4. If you want to translate public IP addresses, apply for an edge NAT gateway. For more information, see Create and manage edge NAT gateways.

Step 3: Create storage resources

  1. In the left-side navigation pane, choose Storage and Snapshots and create storage resources based on your business requirements.

  2. If you want the ENS instance to have storage capacity for data persistence, apply for disks. For more information, see Create and manage a disk and Create an instance.

  3. If you want to share data among ENS instances, apply for NAS. For more information, see Create and manage a file system.

Step 4: Deploy the inference acceleration engine

  1. Deploy the AIACC inference acceleration engine developed by Alibaba Cloud on an ENS instance. The NVIDIA Tesla T4 GPU in the Ubuntu 20.04 operating system is used as an example.

    1. Log on to the ENS instance that you have applied for. For more information, see Connect to an instance.

    2. Install the CUDA Toolkit.

      1. The NVIDIA Tesla T4 GPU requires CUDA Toolkit 11.8.

      2. Update the PATH and LD_LIBRARY_PATH environment variables.

    3. Install the GPU driver.

      1. The NVIDIA Tesla T4 GPU requires a 12.2 or later driver. We recommend that you install the latest version of the driver.

      2. Download the driver from the NVIDIA official website: NVIDIA-Linux-x86_64-535.154.05.run.

    4. Update the environment variable to authorize the current user to use the AIACC inference acceleration engine.

      echo 'export DEEPGPU_EXT_CURL=MzkuOTguMjIuMTI2OjcwNzA=' >> /etc/profile
      source /etc/profile
    5. Confirm component versions.

      1. Python: 3.8.

      2. PyTorch: 2.1.0.

      3. Deepytorch Inference: 0.7.18+pt2.1.0cu118-cp38-cp38.

      4. You can run the pip3 command to install Python-related components. Example:

      pip3 install torch==2.1.0 torchvision==0.16.0 numpy transformers
  2. Use a container to deploy the AIACC inference acceleration engine in the VM. The NVIDIA Tesla T4 GPU in the Ubuntu 20.04 operating system is used as an example.

    1. Log on to the ENS instance that you have applied for.

    2. Install the GPU driver.

      1. GPU driver version: 535.154.05. CUDA version: 12.2.

    3. Install the NVIDIA Container Toolkit. For more information, visit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html.

    4. Obtain the AIACC container image deepytorch_t4_ubuntu20.04.tar.gz. Container-based deployment simplifies the environment configuration. For more information about how to obtain the image, contact your account manager.

    5. Load the container image. You can also use Kubernetes to create and manage container instances.

      [root]# docker load -i deepytorch_t4_ubuntu20.04.tar
      
      [root]# docker images
      REPOSITORY   TAG       IMAGE ID       CREATED      SIZE
      <none>       <none>    7aef27446ff0   3 days ago   22.6GB
    6. Create a container and deploy the AIACC inference acceleration engine.

      #!/bin/bash
      
      /usr/bin/docker run --runtime=nvidia -ti -d --gpus all --network=host \
                 7aef27446ff0 sleep 86400000
  3. Deploy the open source TensorRT suite in the container.

    1. Install the NVIDIA GPU driver.

    2. Install the NVIDIA Container Toolkit.

    3. Download the TensorRT container image from https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html.

      Note

      The versions of components such as CUDA and TensorRT for the image must correspond to the version of the installed GPU driver.

    4. Create a container and deploy the TensorRT suite.

Step 5: Associate resources to finish environment setup

Associate resources to finish environment setup by using the ENS console or API. This general architecture supports AI inference scenarios such as image classification, object detection, speech recognition, and semantic analysis.

image
Note

If you want the edge cloud to manage scheduling for you, contact your account manager.