Cloud-native AI

Cloud-Native AI

A Kubernetes-based service in the modular and extensible architecture, accelerating the construction of the AI platforms and improving resource utilization and delivery efficiency.

Console Documentation

Overview
Features
Scenarios
Consulting Service

Overview

Overview
Features
Scenarios
Consulting Service

Build AI/Machine Learning on Kubernetes

Cloud-native AI provides a set of essential features and services to help clients to build an AI platform, accelerate AI workloads and simplify MLOps.

Resource Scalability

Improves the utilization of GPUs and CPUs and enhances the scalability of heterogeneous resources.

Efficient Scheduling

Schedules AI and big data tasks in an efficient manner and provides end-to-end support for the entire AI production process.

Accelerated Data Access

Improves data access performance and integrates heterogeneous data sources.

Observability

Supports various methods that are used to observe tasks, user quotas, and resources.

Flexibility and Extensibility

Allows you to create a custom cloud-native AI platform by modifying the component-based extensible architecture.

Standard Kubernetes

Runs on standard Kubernetes and is compatible with public clouds, Apsara Stack, hybrid clouds, and ACK edge clusters.

Features

Efficient Utilization of Heterogeneous Resources

Supports GPU scheduling, GPU sharing and memory isolation. You can configure various policies to allocate GPUs and monitor the consumption of GPU resources from multiple dimensions.

Improved Data Access Performance

Separates computing and storage. Fluid abstracts data for cloud-native AI and big data applications to accelerate data access. Fluid also enhances the security isolation of data and eliminates data silos that are caused by different storage types.

AI Task Scheduling

Supports various scheduling policies (such as gang scheduling, capacity scheduling, and binpack scheduling) to meet the requirements of AI tasks and enhance cluster resource utilization.

Auto Scaling of Heterogeneous Resources

Performs intelligent load shifting to prevent cloud resource waste. Cloud-native AI also supports elastic model training and model-based inference.

Observable Cluster Tasks, Users, and Resources

Provides monitoring dashboards for tasks, user quotas, and cluster resources to help you evaluate inputs and outputs.

Scenarios

Benefits

Resource Allocation Based on Project Groups

You can divide project members into isolated groups. Then, you can allocate and isolate resources based on groups or manage the permissions of different groups.
Isolation and Sharing among Users

You can allocate cluster resources to user groups based on your business requirements. You can also manage the permissions of users in each group. The permissions include the read and write permissions of users on jobs, and the read and write permissions of jobs on data.
Elastic Quotas

You can use elastic quota groups for capacity scheduling to share resources and appropriately allocate resources to users. This improves the overall resource utilization of clusters.

Benefits

Native Support for Dataset Abstraction

Cloud-native AI packages the fundamental capabilities that are required by data-intensive applications into functions. This achieves efficient data access and reduces management costs of heterogeneous data.
Data preload and Acceleration on the Cloud

Fluid uses the distributed caching engines to support data preload and acceleration on the cloud. This ensures the observability, portability, and automatic horizontal scalability of cached data.
Collaborative Orchestration of Data and Applications

When you schedule applications and data on the cloud, you can coordinate the orchestration of applications and data based on characteristics and locations to improve the overall performance.
Namespace Management

You can access data from multiple data sources at the same time in one dataset. The data sources include Object Storage Service (OSS), Hadoop Distributed File System (HDFS), Ceph, and other storage services. This is suitable for hybrid cloud scenarios.
Management of Heterogeneous Data Sources

You can access data from multiple data sources at the same time in one dataset. The data sources include Object Storage Service (OSS), Hadoop Distributed File System (HDFS), Ceph, and other storage services. This is suitable for hybrid cloud scenarios.

Benefits

GPU Sharing and Scheduling

You can run multiple containers on one GPU by using GPU sharing and scheduling.
Topology-Aware GPU Scheduling

You can select a suitable GPU combination and achieve optimal training speed during GPU scheduling.
Binpack Scheduling

Jobs are initially allocated to one node. When the node has insufficient resources, jobs are allocated to another node. This minimizes cross-node data transmission and prevents resource fragmentation.
Gang Scheduling

Resources are only allocated to a job only when all subtasks of the job have sufficient resources. This prevents resource deadlocks where in which large jobs preempt the resources of small jobs.

Benefits

Arena AI Toolkit

The command lines and SDKs for Go, Java, and Python are compatible with heterogeneous underlying resources. This allows you to manage environments, schedule tasks, allocate GPUs, and monitor resources in a simplified manner.
The toolkit is compatible with various deep learning frameworks, such as TensorFlow, PyTorch, Caffe, Message Passing Interface (MPI), and Hovorod. The toolkit covers the entire process of Machine Learning Model Operationalization Management (MLOps), including training dataset management, AI task management, model development, distributed training, evaluation, and inference model release.
The R&D console provides an on-demand algorithm development environment where you can perform management operations throughout the entire R&D lifecycle. The operations include notebook management, AI task management, model management, and model release.

Benefits

Dashboards for Real-Time GPU Utilization

You can monitor resource utilization from multiple dimensions in real time.
Dataset Management and Acceleration

You can accelerate access to existing datasets with one click to improve efficiency.
User and User Group Management

You can create users and user groups based on projects and manage user permissions and quotas in a fine-grained manner.
Elastic Quota Management

Capacity scheduling allows user groups to dynamically share resources.

Cloud-Native AI

Build AI/Machine Learning on Kubernetes

Resource Scalability

Efficient Scheduling

Accelerated Data Access

Observability

Flexibility and Extensibility

Standard Kubernetes

Features

Efficient Utilization of Heterogeneous Resources

Improved Data Access Performance

AI Task Scheduling

Auto Scaling of Heterogeneous Resources

Observable Cluster Tasks, Users, and Resources

Scenarios

Benefits

Resource Allocation Based on Project Groups

Isolation and Sharing among Users

Elastic Quotas

Benefits

Native Support for Dataset Abstraction

Data preload and Acceleration on the Cloud

Collaborative Orchestration of Data and Applications

Namespace Management

Management of Heterogeneous Data Sources

Benefits

GPU Sharing and Scheduling

Topology-Aware GPU Scheduling

Binpack Scheduling

Gang Scheduling

Benefits

Arena AI Toolkit

Benefits

Dashboards for Real-Time GPU Utilization

Dataset Management and Acceleration

User and User Group Management

Elastic Quota Management

Sales Support

Technical Support

Connect & Report Abuse