This topic describes the architecture of Platform for AI (PAI).
The architecture of PAI consists of the following layers, as shown in the preceding figure:
Basic resources layer (computing resources and infrastructure):
Infrastructure: includes CPU, GPU, high-speed Remote Direct Memory Access (RDMA) network, and Container Service for Kubernetes (ACK) resources.
Computing resources: include cloud-native resources (intelligent computing LINGJUN resources and general-purpose computing resources) and big data computing resources (MaxCompute and Flink resources).
Platform and framework layer (PAI-Lingjun AI Computing Service and AI frameworks):
AI frameworks: include frameworks that can be used to run distributed computing tasks, such as Alink, TensorFlow, PyTorch, Megatron, DeepSpeed, and Reinforcement Learning from Human Feedback (RLHF).
Optimization and acceleration frameworks: include DatasetAcc for dataset acceleration, TorchAcc for training acceleration, Easy Parallel Library (EPL) for parallel training acceleration, Blade for inference acceleration, AIMaster for automatic fault tolerance trainings, and EasyCkpt for second-level asynchronous training snapshots.
PAI provides services for full-link machine learning development, including data preparation, model development and training, and model deployment.
Data preparation: iTAG allows you to label data and manage datasets in multiple scenarios.
Model development and training: PAI provides various services to meet different modeling requirements. These services are Machine Learning Designer, Data Science Workshop (DSW), Deep Learning Containers (DLC), and FeatureStore. Machine Learning Designer is a visualized modeling service. DSW allows you to create models by using interactive programming. DLC is a cloud-native platform for training deep learning models. FeatureStore allows you to manage model features.
Model deployment: You can use Elastic Algorithm Service (EAS) to deploy models as services.
Application layer (model services): model services, such as ModelScope community, PAI-DashScope, third-party MaaS platforms, and Alibaba Cloud Model Studio.
Business layer (Scenario-based solutions): PAI is widely used in business scenarios, such as autonomous driving, scientific research, financial risk management, and AI recommendations. The search systems, recommendation systems, and financial service systems of Alibaba Group use PAI to mine data and make informed business decisions.