EAS architecture and usage - Platform For AI - Alibaba Cloud Documentation Center

Platform for AI (PAI) provides a one-stop platform for model development and deployment. The Elastic Algorithm Service (EAS) module of PAI allows you to deploy models as online inference services by using the public resource group or dedicated resource groups. The models are loaded on heterogeneous hardware (CPUs and GPUs) to generate responses in real time.

EAS architecture

EAS is a model serving platform that allows you to deploy models as online inference services or AI-powered web applications with a few clicks. EAS provides features such as automatic scaling and blue-green deployment, which reduces the costs of developing stable online model services that can handle a large number of concurrent requests. EAS also provides features that help you manage model services, including resource group management, model versioning, and resource monitoring. You can use EAS in various AI inference scenarios, such as real-time synchronous inference and near-real-time asynchronous inference, and leverage the comprehensive O&M capabilities of EAS.

Expand for details about each layer in the EAS architecture

Infrastructure: This layer uses heterogeneous hardware (CPUs and GPUs) and provides General Unit (GU) specifications and preemptible instances that are designed for AI scenarios. This helps you reduce costs and improve efficiency.
Container scheduling: This layer provides the following features to help you efficiently manage cluster resources during periods of high and low workloads:
- Automatic scaling: The system automatically adjusts the number of service instances during significant workload fluctuations. Automatic scaling prevents resource waste of online services.
- Scheduled scaling: The system adjusts the number of service instances to a specific number at a specific point in time. Scheduled scaling is suitable for scenarios in which you can estimate the workload.
- Elastic resource pool: If the resources in the dedicated resource group are fully utilized, the system automatically scales out to pay-as-you-go instances in the public resource group to ensure service stability.
Model deployment: This layer provides the following features to streamline the process of model deployment, model updates, and real-time service monitoring while ensuring optimal resource utilization:
- One-click stress testing: The system automatically increases workloads to identify the maximum capacity of a service. You can view real-time monitoring data that is accurate to the second and stress testing reports.
- Canary release: You can add multiple services to a canary release group in which specific services are used in the production environment, while other services are used in the canary release environment. You can also adjust the traffic distribution across services in a flexible manner.
- Real-time monitoring: After you deploy a service, you can view the metrics related to the service status, such as queries per second (QPS), response time, and CPU utilization.
- Traffic mirroring: You can duplicate a proportion of traffic of the current service to the destination service without disrupting the current service. For example, you can use this feature to test the performance and reliability of new services.
Inference: EAS supports the following inference capabilities:
- Real-time synchronous inference: suitable for scenarios that require low latency or high throughput, such as custom search and conversational chatbots. You can integrate real-time synchronous inference to your existing business without affecting your business. The inference system can select models based on your business requirements to ensure optimal performance.
- Near-real-time asynchronous inference: integrates the queue service and is suitable for scenarios that require long processing times, such as text-to-image generation and video processing. This enables automatic scaling based on your business requirements and eliminates the need for O&M.

Deployment methods

You can deploy a model by using an image or a processor in EAS.

Use an image (recommended)

If you deploy a model by using an image, EAS pulls the image that contains the runtime environment from Container Registry (ACR) and mounts model files and code from storage services such as Object Storage Service (OSS) and File Storage NAS (NAS).

The following figure shows the workflow of deploying a model by using an image in EAS.

Take note of the following items:

You can use one of the following methods when you deploy a model by using an image:
- Deploy Service by Using Image: You can call the service by using API operations after deployment.
- Deploy Web App by Using Image: You can access the web application by using a link after deployment.
  For information about the differences between the two methods, see the "Step 2: Deploy a model" section of this topic.
PAI provides multiple prebuilt images to accelerate model deployment. You can also create a custom image and upload the image to ACR.
We recommend that you upload the model files and the code files that contain the preprocessing or postprocessing logic to storage services. This way, you can mount the files to the runtime environment. Compared with packaging the files into a custom image, this method allows you to update the model in a convenient manner.
When you deploy a model by using an image, we recommend that you build an HTTP server to receive requests that are forwarded by EAS. The HTTP server cannot receive requests on ports 8080 and 9090 because the EAS engine listens on these ports.

Note

If you use a custom image, you must upload the image to ACR before you use the image during deployment. Otherwise, EAS may fail to pull the image. If you use Data Science Workshop (DSW) to develop a model, you must upload the image to ACR before you use the image in EAS.
If you want to reuse your custom images or warm-up data in other scenarios, you can manage the images or data in a centralized manner by using the AI Computing Asset Management module of PAI. EAS does not support mounting CPFS datasets from NAS.

Use a processor

If you deploy a model by using a processor, prepare the model files and processor files, upload the files to storage services such as OSS or NAS before deployment, and then mount the files to EAS during deployment.

The following figure shows the workflow of deploying a model by using a processor in EAS.

Take note of the following items:

PAI provides multiple prebuilt images to accelerate model deployment. You can also create a custom image based on your business requirements and upload the image to ACR.
We recommend that you develop and store the model file and the processor file separately. You can call the get_model_path() method in the processor file to obtain the path of the model file. This allows you to update the model in a convenient manner.
When you deploy a model by using a processor, EAS automatically pulls an official image based on the inference framework of the model and deploys an HTTP server based on the processor file to receive service requests.

Note

When you deploy a model by using a processor, make sure that the inference framework of the model and the processor file meet the requirements of the development environment. This method is less flexible and efficient. We recommend that you deploy a model by using an image.

Terms

Term	Description
Resource group	EAS uses resource groups to isolate resources in a cluster. You can deploy a model by using the default public resource group or a dedicated resource group that you purchased. The public resource group supports the pay-as-you-go billing method. Resources are used after the service is deployed and released after the service ends. A dedicated resource group supports the subscription and pay-as-you-go billing methods. The lifecycle of a dedicated resource group is independent of the lifecycle of the deployed service. Billing starts after the resources are purchased and stops after the resources are released. For more information about EAS resource groups, see Overview of EAS resource groups.
Model service	A model service consists of model files and online prediction logic. You can create, update, start, stop, and scale a model service.
Model file	A model file contains an offline model that are obtained after offline training. The format of the model file varies based on the framework. In most cases, a model file is deployed together with a processor to provide a model service.
Processor	A processor is a package that contains online prediction logic. In most cases, a processor is deployed together with a model file to provide a model service. EAS provides prebuilt processors for common model frameworks, such as Predictive Model Markup Language (PMML), TensorFlow, and Caffe.
Custom processor	If the prebuilt processors of EAS cannot meet your business requirements, you can develop custom processors by using C++, Java, or Python.
Service instance	Each service instance independently handles requests. You can deploy multiple service instances to increase the maximum number of concurrent requests that a service can handle. If your resource group contains multiple machines, EAS automatically distributes the service instances to different machines to ensure service availability.
High-speed direct connection	EAS supports high-speed direct connection, which is enabled by connecting the resource group that is used for deployment to your virtual private cloud (VPC). After you enable high-speed direct connection, clients can bypass gateways and directly access the model service. This significantly improves performance and reduces latency.

Supported regions

EAS is available in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Heyuan), China (Chengdu), China (Hong Kong), Singapore, Indonesia (Jakarta), US (Silicon Valley), US (Virginia), and Germany (Frankfurt).

Billing methods

EAS resource groups:
EAS allows you to deploy a model service by using the public resource group or a dedicated resource group. For more information, see Billing of EAS.
- If you use the public resource group, you are charged based on the amount of resources that are used by your model services.
- If you use a dedicated resource group, you are charged for the dedicated resources based on the subscription or pay-as-you-go billing method.
(Optional) Related Alibaba Cloud services:
- Storage:
  You can use OSS or NAS to permanently store data that are mounted to the runtime environment during deployment in EAS. For information about the billing methods, see Billing overview (OSS) and Billing overview (NAS).
- NAT Gateway:
  You can use a public endpoint to access the model service free of charge. However, if the model service requires access to the Internet, you must activate NAT Gateway. For information about how to configure Internet access and a whitelist, see Configure Internet access and a whitelist. For information about the billing rules of Internet NAT gateways, see Billing overview.

Procedure

Step 1: Prepare for deployment

Prepare computing resources.
Select a resource group based on your business requirements. EAS provides the public resource group and dedicated resource groups. To use dedicated resource groups, you must purchase and configure the resources. For more information, see Overview of EAS resource groups.
Prepare the required files.
Prepare the files that contain the trained model and the processing logic, and then upload the files to storage services based on the deployment method you use. For information about the recommended storage services for each deployment method provided by EAS, see the "Deployment methods" section of this topic.

Step 2: Deploy a service

The following table describes the deployment tools.

Operation	GUI tools	CLI tools
Deploy services	Use the PAI console or Machine Learning Designer to deploy a service with a few clicks. For more information, see Deploy a model service in the PAI console or Deploy a model service by using Machine Learning Designer.	Use DSW or the EASCMD client to deploy a service. For more information, see Deploy model services by using EASCMD or DSW.
Manage services	Manage model services on the EAS-Online Model Services page. For more information, see Deploy a model service in the PAI console. The following operations are supported: View invocation information. View logs, monitoring information, and service deployment information. Scale, start, stop, and delete model services.	Use the EASCMD client to manage model services. For more information, see Run commands to use the EASCMD client.

If you use a dedicated resource group to deploy a model service, you can mount the required data from storage services. For more information, see Mount storage to services (advanced).

The following table describes the deployment methods.

Deployment method	Description	Reference
Deploy Service by Using Image (recommended)	Scenario: Use an image to deploy a model service. Benefits: Images ensure consistency between the model development environment and the runtime environments. Prebuilt images for common scenarios allow you to complete deployment with a few clicks. Custom images can be used for deployment without the need for modification.	Deployment methods Deploy a model service by using a custom image
Deploy Web App by Using Image (recommended)	Scenario: Use an image to deploy a web application. Benefits: Prebuilt images for common scenarios, such as Stable-Diffusion-Webui and Chat-LLM-Webui, allow you to complete deployment with a few clicks. You can build an HTTP server by using frameworks such as Gradio, Flask, and FastAPI. Custom images can be used for deployment without the need for modification.
Deploy Service by Using Model and Processor	EAS provides prebuilt processors for common model frameworks, such as PMML and XGBOOST, to accelerate deployment. If the prebuilt processors cannot meet your business requirements, you can build custom processors to obtain greater flexibility.	Use a processor Deploy model services by using built-in processors Deploy services by using custom processors

Step 3: Debug and perform stress testing

After you deploy a service, you can use the online debugging feature to send HTTP requests to verify the service performance.
For information about how to debug and perform stress testing, see Debug a service online.

Step 4: Monitor a service

After you complete debugging and stress testing, you can use the service monitoring feature to monitor the resource usage of the service.
You can also enable the automatic scaling or scheduled scaling feature to manage the computing resources of the service.

For more information, see Service monitoring.

Step 5: Call a service

If you deploy a model as an API service, you can call API operations to perform real-time inference and asynchronous inference. EAS allows you to call a service by using a public endpoint, a VPC endpoint, or the VPC direct connection channel. You can also create custom request data based on the processor of the model service. We recommend that you use the SDKs provided by PAI to test and call a service. For more information, see SDK for Java.
If you deploy a model as a web application, you can find the link to the application in the PAI console, open the link in a browser, and use the UI to access the model service in an interactive manner.

Step 6: Perform asynchronous inference

You can use the queue service to implement asynchronous inference based on your business requirements. When your inference service receives a large number of requests, create an input queue to store the requests. After the requests are processed, save the results to the output queue and asynchronously return the results. This prevents unprocessed requests from being discarded. EAS supports multiple methods of sending request data to the queue service and automatically scales the inference service by monitoring the amount of data in the queue. This effectively controls the number of service instances. For more information about asynchronous inference, see Asynchronous inference services.

References

For information about EAS use cases, see EAS use cases.
The DSW module of PAI is a cloud-based and interactive integrated development environment (IDE) for machine learning. You can use Notebooks to read data, develop algorithms, and train and deploy models in an efficient manner. For more information, see DSW overview.
The Machine Learning Designer module of PAI is a visualized modeling tool that provides hundreds of algorithm components. The module supports large-scale distributed training for traditional machine learning, deep learning, and reinforcement learning. The module also supports combining streaming training and batch training. For more information, see Overview of Machine Learning Designer.