All Products
Search
Document Center

Container Service for Kubernetes:Overview of the cloud-native AI suite

Last Updated:Sep 11, 2024

The cloud-native AI suite is a Container Service for Kubernetes (ACK) solution that leverages cloud-native AI technologies and products. The cloud-native AI suite can help you fully utilize cloud-native architectures and technologies to quickly develop an AI-assisted production system in ACK. It also provides full-stack optimization for AI or machine learning applications and systems. ACK Edge clusters retain the full functionality of the AI suite available in cloud environments, while certain features may be limited in edge environments. This topic describes the capabilities and usage limits of the AI suite on ACK Edge clusters across different nodes and network types.

Usage limits

Item

Limit

AI suite components

Be aware of the usage limits for the specific components that you use, including the cluster version and NVIDIA driver version. For more information, see Component introduction and release notes.

ACK Edge clusters

Only specific operating systems and GPU models on edge nodes are supported. For more information, see Add an edge node.

Capability overview

image

ACK Edge clusters and ACK Pro clusters have two differences:

  1. Network connectivity: ACK Pro clusters require all nodes within the cluster to be in the same virtual private cloud (VPC) and connected, while ACK Edge clusters have a more complex network configuration based on node pools. AI suite capabilities may vary under different network conditions.

    1. On-cloud node pool: The network configuration for the on-cloud node pool is the same as that of the ACK Pro clusters. It manages connected Elastic Compute Service (ECS) nodes within the same VPC.

    2. Dedicated edge node pool: The dedicated edge node pool manages edge nodes connected to the cloud over the Express Connect circuit. It achieves network connections between data centers and the cloud.

    3. Basic edge node pool: The basic edge node pool manages edge nodes connected over the Internet. The network connectivity between edge nodes cannot be ensured.

  2. Node environment: ACK Edge clusters are mainly used to manage your on-premises resources. In comparison to ECS instances, the node environment is more complex and includes information such as GPU model, GPU driver, and OS version. Additionally, GPU memory isolation is not supported.

AI Suite capability

Corresponding component name

Cloud environment

Edge environment

References

On-cloud node pool

Dedicated edge node pool

Basic edge node pool

Elasticity

ack-alibaba-cloud-metrics-adapter

Supported

Supported

Supported

Acceleration

ack-fluid

Supported

Supported

Supported

Scheduling (batch task scheduling, GPU sharing, and GPU topology awareness)

ack-ai-installer

Supported

Supported except GPU memory isolation

Supported except GPU memory isolation

Scheduling (task queue)

ack-kube-queue

Supported

Supported

Supported

Use ack-kube-queue to manage job queues

Interaction mode (Arena)

ack-arena

Supported

Supported

Supported

Configure the Arena client

Interaction mode (console)

ack-ai-dashboard

ack-ai-dev-console

ack-mysql

Supported

Supported

Supported

Workflow

ack-ai-pipeline

Supported

Supported

Supported

Deploy the cloud-native AI suite

Monitoring

ack-arena-exporter

Supported

Supported

Supported

Work with cloud-native AI dashboards

Note

In edge node pools, the acceleration capability of the AI suite can only be used in edge node pools with network connectivity between nodes.

Usage method

Based on the cloud-edge architecture of ACK Edge clusters, we recommend that you manage different resources through node pools when using the AI suite.

image
  1. Management node pool: the on-cloud node pool used for deploying the management components of the AI suite.

    1. This type of node pool does not need GPU resources.

    2. By default, the on-cloud node pool default-nodepool automatically created by ACK Edge clusters is used as the management node pool.

    3. To utilize all features of the AI suite, the node pool must be scaled out to at least 4 nodes to ensure sufficient resources for proper component operations. For more information, see Create and scale out a node pool.

  2. Elastic node pool: the on-cloud node pool with auto scaling enabled.

    For elastic inference, you can use this type of node pool to achieve dynamic server scaling based on your business requirements.

  3. Edge node pool: manage different types of nodes in data centers.

    We recommend that you use edge node pools to manage related nodes based on their properties. For example, you can categorize them into AMD node pools and ARM-based node pools according to CPU architecture, or into node pools that use Express Connect circuits and node pools for the Internet based on network conditions.