You can use the Arena command-line tool that is provided by the cloud-native AI suite to schedule AI workloads. This provides an efficient way for you to deploy trained models as inference services in Container Service for Kubernetes (ACK) clusters. ACK provides auto scaling, GPU sharing and scheduling, and performance monitoring to reduce the O&M costs of inference services deployed in ACK clusters. This topic describes how to use the cloud-native AI suite to deploy a model as an inference service in an ACK cluster.
NVIDIA Triton Server and TensorFlow Serving in ack-arena are free open source components provided by third-party open source communities or enterprises. You can choose to install the corresponding components and configure servers to deploy inference models as services, and then use the relevant model testing and optimization tools.
However, Alibaba Cloud is not responsible for the stability, service limits, and security compliance of third-party components. You shall pay close attention to the official websites of the third-party open source communities or enterprises and updates on code hosting platforms, and read and comply with the open source licenses. You are liable for any potential risks related to application development, maintenance, troubleshooting, and security due to the use of third-party components.
The following table describes the types of inference services that are supported by the cloud-native AI suite.
Inference service type | Description | References |
Inference tasks that use shared GPUs | If you want to improve GPU utilization, you can use Arena to submit multiple inference tasks that use the same GPU to share the GPU memory and computing power. | |
Inference services deployed from TensorFlow models | You can use Arena and TensorFlow Serving to deploy a TensorFlow model as an inference service. | |
Inference services deployed from PyTorch models | You can use NVIDIA Triton Inference Server or TorchServe to deploy a PyTorch model as an inference service. | |
Containerized elastic inference services | You can deploy elastic inference services on Elastic Compute Service (ECS) or Elastic Container Instance. This improves elasticity and reduces costs. |