All Products
Search
Document Center

Container Service for Kubernetes:Overview of AI inference service deployment in Kubernetes clusters

Last Updated:Feb 27, 2025

You can use the Arena command-line tool that is provided by the cloud-native AI suite to schedule AI workloads. This provides an efficient way for you to deploy trained models as inference services in Container Service for Kubernetes (ACK) clusters. ACK provides auto scaling, GPU sharing and scheduling, and performance monitoring to reduce the O&M costs of inference services deployed in ACK clusters. This topic describes how to use the cloud-native AI suite to deploy a model as an inference service in an ACK cluster.

Important

NVIDIA Triton Server and TensorFlow Serving in ack-arena are free open source components provided by third-party open source communities or enterprises. You can choose to install the corresponding components and configure servers to deploy inference models as services, and then use the relevant model testing and optimization tools.

However, Alibaba Cloud is not responsible for the stability, service limits, and security compliance of third-party components. You shall pay close attention to the official websites of the third-party open source communities or enterprises and updates on code hosting platforms, and read and comply with the open source licenses. You are liable for any potential risks related to application development, maintenance, troubleshooting, and security due to the use of third-party components.

The following table describes the types of inference services that are supported by the cloud-native AI suite.

Inference service type

Description

References

Inference tasks that use shared GPUs

If you want to improve GPU utilization, you can use Arena to submit multiple inference tasks that use the same GPU to share the GPU memory and computing power.

Submit an inference task to use shared GPU resources

Inference services deployed from TensorFlow models

You can use Arena and TensorFlow Serving to deploy a TensorFlow model as an inference service.

Deploy a TensorFlow model as an inference service

Inference services deployed from PyTorch models

You can use NVIDIA Triton Inference Server or TorchServe to deploy a PyTorch model as an inference service.

Submit an inference task to use shared GPU resources

Containerized elastic inference services

You can deploy elastic inference services on Elastic Compute Service (ECS) or Elastic Container Instance. This improves elasticity and reduces costs.