Overview of AI inference service deployment in Kubernetes clusters - Container Service for Kubernetes

You can use the Arena command-line tool that is provided by the cloud-native AI suite to schedule AI workloads. This provides an efficient way for you to deploy trained models as inference services in Container Service for Kubernetes (ACK) clusters. ACK provides auto scaling, GPU sharing and scheduling, and performance monitoring to reduce the O&M costs of inference services deployed in ACK clusters. This topic describes how to use the cloud-native AI suite to deploy a model as an inference service in an ACK cluster.

Important

NVIDIA Triton Server and TensorFlow Serving in ack-arena are free open source components provided by third-party open source communities or enterprises. You can choose to install the corresponding components and configure servers to deploy inference models as services, and then use the relevant model testing and optimization tools.

However, Alibaba Cloud is not responsible for the stability, service limits, and security compliance of third-party components. You shall pay close attention to the official websites of the third-party open source communities or enterprises and updates on code hosting platforms, and read and comply with the open source licenses. You are liable for any potential risks related to application development, maintenance, troubleshooting, and security due to the use of third-party components.

The following table describes the types of inference services that are supported by the cloud-native AI suite.

Inference service type	Description	References
Inference tasks that use shared GPUs	If you want to improve GPU utilization, you can use Arena to submit multiple inference tasks that use the same GPU to share the GPU memory and computing power.	Submit an inference task to use shared GPU resources
Inference services deployed from TensorFlow models	You can use Arena and TensorFlow Serving to deploy a TensorFlow model as an inference service.	Deploy a TensorFlow model as an inference service
Inference services deployed from PyTorch models	You can use NVIDIA Triton Inference Server or TorchServe to deploy a PyTorch model as an inference service.	Submit an inference task to use shared GPU resources
Containerized elastic inference services	You can deploy elastic inference services on Elastic Compute Service (ECS) or Elastic Container Instance. This improves elasticity and reduces costs.	Elastic Container Instance-based elastic inference ECS-based elastic inference