Platform for AI (PAI) provides an SDK for Python that contains easy-to-use high-level APIs. You can use the SDK to deploy models as inference services in PAI. This topic describes how to use the PAI SDK for Python to deploy inference services in PAI.
Introduction
The PAI SDK for Python contains the following high-level APIs: pai.model.Model
and pai.predictor.Predictor
. You can use the SDK to deploy models to Elastic Algorithm Service (EAS) of PAI and test the model services.
To use the SDK to deploy an inference service, perform the following steps:
Specify the configurations of the inference service in the
pai. model.InferenceSpec
object. The configurations include the processor or image that you want to use for service deployment.Create a
pai.model.Model
object by using theInferenceSpec
object and the model file.Call the
pai.model.Model.deploy()
method to deploy the inference service. In the method, specify information about service deployment, such as the required resources and the service name.Call the
deploy
method to obtain apai.predictor.Predictor
object. Then, call thepredict
method to send an inference request.
Sample code:
from pai.model import InferenceSpec, Model, container_serving_spec
from pai.image import retrieve, ImageScope
# 1. Use a PyTorch image provided by PAI for model inference.
torch_image = retrieve("PyTorch", framework_version="latest",
image_scope=ImageScope.INFERENCE)
# 2. Specify the configurations of the inference service in the InferenceSpec object.
inference_spec = container_serving_spec(
# The startup command of the inference service.
command="python app.py",
source_dir="./src/"
# The image used for model inference.
image_uri=torch_image.image_uri,
)
# 3. Create a Model object.
model = Model(
# Use a model file stored in an Object Storage Service (OSS) bucket.
model_data="oss://<YourBucket>/path-to-model-data",
inference_spec=inference_spec,
)
# 4. Deploy the model as an online inference service in EAS and obtain a Predictor object.
predictor = model.deploy(
service_name="example_torch_service",
instance_type="ecs.c6.xlarge",
)
# 5. Test the inference service.
res = predictor.predict(data=data)
The following sections describe how to use the SDK for Python to deploy an inference service and provide the corresponding sample code.
Configure InferenceSpec
You can deploy an inference service by using a processor or an image. The pai.model.InferenceSpec
object defines the configurations of the inference service, such as the processor or image used for service deployment, service storage paths, warmup request files, and remote procedure call (RPC) batching.
Deploy an inference service by using a built-in processor
A processor is a package that contains online prediction logic. You can use a processor to directly deploy a model as an inference service. PAI provides built-in processors that support common machine learning model formats, such as TensorFlow SavedModel, PyTorch TorchScript, XGBoost, LightGBM, and PMML. For more information, see Built-in processors.
Sample
InferenceSpec
configurations:# Use a built-in TensorFlow processor. tf_infer_spec = InferenceSpec(processor="tensorflow_cpu_2.3") # Use a built-in PyTorch processor. tf_infer_spec = InferenceSpec(processor="pytorch_cpu_1.10") # Use a built-in XGBoost processor. xgb_infer_spec = InferenceSpec(processor="xgboost")
You can configure additional features for the inference service in the
InferenceSpec
object, such as warmup requests and RPC. For information about advanced configurations, see Parameters of model services.# Configure the properties of InferenceSpec. tf_infer_spec.warm_up_data_path = "oss://<YourOssBucket>/path/to/warmup-data" # Specify the path of the warmup request file. tf_infer_spec.metadata.rpc.keepalive=1000 # Specify the maximum processing time for a single request. print(tf_infer_spec.warm_up_data_path) print(tf_infer_spec.metadata.rpc.keepalive)
Deploy an inference service by using an image
Processors simplify model deployment procedures, but cannot meet custom deployment requirements, especially when models or inference services have complex dependencies. To address this issue, PAI supports flexible model deployment by using an image.
You can package the code and dependencies of a model into a Docker image and push the Docker image to Alibaba Cloud Container Registry (ACR). Then, you can create an InferenceSpec object based on the Docker image.
from pai.model import InferenceSpec, container_serving_spec # Call the container_serving_spec method to create an InferenceSpec object from an image. container_infer_spec = container_serving_spec( # The image used to run the inference service. image_uri="<CustomImageUri>", # The port on which the inference service listens. The inference requests are forwarded to this port by PAI. port=8000, environment_variables=environment_variables, # The startup command of the inference service. command=command, # The Python package required by the inference service. requirements=[ "scikit-learn", "fastapi==0.87.0", ], ) print(container_infer_spec.to_dict()) m = Model( model_data="oss://<YourOssBucket>/path-to-tensorflow-saved-model", inference_spec=custom_container_infer_spec, ) p = m.deploy( instance_type="ecs.c6.xlarge" )
If you want to use a custom image, integrate the inference code that you prepared into a container, build an image, and then push the image to ACR. The PAI SDK for Python simplifies this process. You can add your code to a base image to build a custom image. In this way, you do not need to build an image from scratch. In the
pai.model.container_serving_spec()
method, you can set thesource_dir
parameter to an on-premises directory that contains the inference code. The SDK automatically packages and uploads the directory to an OSS bucket, and mounts the OSS path to the container. You can specify the startup command to start the inference service.from pai.model import InferenceSpec inference_spec = container_serving_spec( # The on-premises directory that contains the inference code. The directory is uploaded to an OSS bucket, and the OSS path is mounted to the container. Default container path: /ml/usercode/. source_dir="./src", # The startup command of the inference service. If you specify the source_dir parameter, the /ml/usercode directory is used as the working directory of the container by default. command="python run.py", image_uri="<ServingImageUri>", requirements=[ "fastapi", "uvicorn", ] ) print(inference_spec.to_dict())
If you want to add code or models to the container, you can call the
pai.model.InferenceSpec.mount()
method to mount an on-premises directory or an OSS path to the container.# Upload the on-premises data to OSS and mount the OSS path to the /ml/tokenizers directory in the container. inference_spec.mount("./bert_tokenizers/", "/ml/tokenizers/") # Mount the OSS path to the /ml/data directory in the container. inference_spec.mount("oss://<YourOssBucket>/path/to/data/", "/ml/data/")
Obtain public images provided by PAI
PAI provides multiple inference images based on common machine learning frameworks, such as
TensorFlow
,PyTorch
, andXGBoost
. You can set theimage_scope
parameter to ImageScope.INFERENCE in thepai.image.list_images
andpai.image.retrieve
methods to obtain the inference images.from pai.image import retrieve, ImageScope, list_images # Obtain all PyTorch inference images provided by PAI. for image_info in list_images(framework_name="PyTorch", image_scope=ImageScope.INFERENCE): print(image_info) # Obtain PyTorch 1.12 images for CPU-based inference. retrieve(framework_name="PyTorch", framework_version="1.12", image_scope=ImageScope.INFERENCE) # Obtain PyTorch 1.12 images for GPU-based inference. retrieve(framework_name="PyTorch", framework_version="1.12", accelerator_type="GPU", image_scope=ImageScope.INFERENCE) # Obtain the images that support the latest version of PyTorch for GPU-based inference. retrieve(framework_name="PyTorch", framework_version="latest", accelerator_type="GPU", image_scope=ImageScope.INFERENCE)
Deploy an inference service and send inference requests
Deploy an inference service
Create a pai.model.Model
object by using the pai.model.InferenceSpec
object and the model_data
parameter. Then, call the deploy
method to deploy the model. The model_data
parameter specifies the path of the model. The value of the parameter can be an OSS URI or an on-premises path. If you specify an on-premises path, the model file stored in the path is uploaded to an OSS bucket and then loaded from the OSS bucket to the inference service.
In the deploy
method, specify the parameters of the inference service, such as the required resources, the number of instances, and the service name. For information about advanced configurations, see Parameters of model services.
from pai.model import Model, InferenceSpec
from pai.predictor import Predictor
model = Model(
# The path of the model, which can be an OSS URI or an on-premises path. If you specify an on-premises path, the model file stored in the path is uploaded to an OSS bucket by default.
model_data="oss://<YourBucket>/path-to-model-data",
inference_spec=inference_spec,
)
# Deploy the inference service in EAS.
predictor = m.deploy(
# The name of the inference service.
service_name="example_xgb_service",
# The instance type used for the inference service.
instance_type="ecs.c6.xlarge",
# The number of instances.
instance_count=2,
# Optional. Use a dedicated resource group for service deployment. By default, the public resource group is used.
# resource_id="<YOUR_EAS_RESOURCE_GROUP_ID>",
options={
"metadata.rpc.batching": True,
"metadata.rpc.keepalive": 50000,
"metadata.rpc.max_batch_size": 16,
"warm_up_data_path": "oss://<YourOssBucketName>/path-to-warmup-data",
},
)
You can also use the resource_config
parameter to specify the number of resources used for service deployment, such as the number of vCPUs and the memory size of each service instance.
from pai.model import ResourceConfig
predictor = m.deploy(
service_name="dedicated_rg_service",
# Specify the number of vCPUs and the memory size of each service instance.
# In this example, each service instance has two vCPUs and 4,000 MB of memory.
resource_config=ResourceConfig(
cpu=2,
memory=4000,
),
)
Send requests to an inference service
In the pai.model.Model.deploy
method, call EAS API operations to deploy an inference service. The corresponding pai.predictor.Predictor
object is returned. You can use the predict
and raw_predict
methods in the Predictor object to send inference requests.
The input and output of the pai.predictor.Predictor.raw_predict
method do not need to be processed by a serializer.
from pai.predictor import Predictor, EndpointType
# Deploy an inference service.
predictor = model.deploy(
instance_type="ecs.c6.xlarge",
service_name="example_xgb_service",
)
# The inference service to which the inference request is sent.
predictor = Predictor(
service_name="example_xgb_service",
# By default, you can access the inference service over the Internet. To access the inference service over a virtual private cloud (VPC) endpoint, you can set the endpoint type to INTRANET. In this case, the client must be deployed in the VPC.
# endpoint_type=EndpointType.INTRANET
)
# Use the predict method to send a request to the inference service and obtain the result. The input and output are processed by a serializer.
res = predictor.predict(data_in_nested_list)
# Use the raw_predict method to send a request to the inference service in a more flexible manner.
response: RawResponse = predictor.raw_predict(
# The input data of the bytes type and file-like objects can be directly passed to the HTTP request body.
# Other data is serialized into JSON-formatted data and then passed to the HTTP request body.
data=data_in_nested_list
# path="predict", # The path of HTTP requests. Default value: "/".
# headers=dict(), # The request header.
# method="POST", # The HTTP request method.
# timeout=30, # The request timeout period. Unit: seconds.
)
# Obtain the returned HTTP body and header.
print(response.content, response.headers)
# Deserialize the returned JSON-formatted data into a Python object.
print(response.json())
# Stop the inference service.
predictor.stop_service()
# Start the inference service.
predictor.start_service()
# Delete the inference service.
predictor.delete_service()
Use a serializer to process the input and output
When you call the pai.predictor.Predictor.predict
method for model inference, you must serialize the input Python data into a data format that is supported by the inference service and deserialize the returned result into a readable or operable Python object. The Predictor object uses the serializer class to perform serialization and deserialization.
When you call the
predict(data=<PredictionData>)
method, thedata
parameter serializes the request data into thebytes
format by calling theserilizer.serialize
method. Then, the converted request data is passed to the inference service through the HTTP request body.When the inference service returns an HTTP response, the
Predictor
object deserializes the response by calling theserializer.deserialize
method. You can obtain the converted response from thepredict
method.
The PAI SDK for Python provides multiple built-in serializers for common data formats. The serializers can process the input and output of the built-in processors provided by PAI.
JsonSerializer
JsonSerializer
serializes objects intoJSON
strings and deserializes JSON strings into objects. The inputdata
of thepredict
method can be aNumPy array
or alist
. TheJsonSerializer.serialize
method serializes the input data into aJSON
string. TheJsonSerializer.deserialize
method deserializes the returned JSON string into a Python object.Specific built-in processors, such as XGBoost processors and PMML processors, receive and return only JSON-formatted data. By default, JsonSerializer is used to process the input and output of these processors.
from pai.serializers import JsonSerializer
# In the deploy method, specify the serializer that you want to use.
p = Model(
inference_spec=InferenceSpec(processor="xgboost"),
model_data="oss://<YourOssBucket>/path-to-xgboost-model"
).deploy(
instance_type="ecs.c6.xlarge",
# Optional. By default, JsonSerializer is used to process the input and output of the XGBoost processor.
serializer=JsonSerializer()
)
# You can also specify a serializer when you create a Predictor object.
p = Predictor(
service_name="example_xgb_service"
serializer=JsonSerializer(),
)
# The returned result is a list.
res = p.predict([[2,3,4], [4,5,6]])
TensorFlowSerializer
You can use the built-in TensorFlow processor to deploy TensorFlow models in the SavedModel format in PAI. The input and output of the TensorFlow services are protocol buffers
messages. For information about the data format, see tf_predict.proto.
The PAI SDK for Python provides a built-in TensorFlowSerializer
, which allows you to send an inference request as a NumPy array
. The serializer serializes NumPy arrays
into protocol buffers
messages and deserializes the returned protocol buffers
messages into NumPy arrays
.
# Deploy a model service by using the TensorFlow processor.
tf_predictor = Model(
inference_spec=InferenceSpec(processor="tensorflow_cpu_2.7"),
model_data="oss://<YourOssBucket>/path-to-tensorflow-saved-model"
).deploy(
instance_type="ecs.c6.xlarge",
# Optional. By default, TensorFlowSerializer is used to process the input and output of the TensorFlow processor.
# serializer=TensorFlowSerializer(),
)
# You can obtain the service signature by calling an API operation.
print(tf_predictor.inspect_signature_def())
# The input of the TensorFlow processor is of the dictionary type. The dictionary key is the name of the input signature. The dictionary value is the specific input data.
tf_result = tf_predictor.predict(data={
"flatten_input": numpy.zeros(28*28*2).reshape((-1, 28, 28))
})
assert result["dense_1"].shape == (2, 10)
PyTorchSerializer
You can use the built-in PyTorch processor to deploy PyTorch models in the TorchScript format in PAI. The input and output of the PyTorch services are protocol buffers
messages. For information about the data format, see tf_predict.proto.
The PAI SDK for Python provides a built-in PyTorchSerializer
, which allows you to send an inference request as a NumPy array
. The serializer serializes NumPy arrays
into protocol buffers
messages and deserializes the returned protocol buffers messages into NumPy arrays
.
# Deploy a model service by using the PyTorch processor.
torch_predictor = Model(
inference_spec=InferenceSpec(processor="pytorch_cpu_1.10"),
model_data="oss://<YourOssBucket>/path-to-torch_script-model"
).deploy(
instance_type="ecs.c6.xlarge",
# Optional. By default, PyTorchSerializer is used to process the input and output of the PyTorch processor.
# serializer=PyTorchSerializer(),
)
#1. Convert the input data into a format supported by the model service.
#2. Use a list or tuple for multiple inputs. Each element is a NumPy array.
torch_result = torch_predictor.predict(data=numpy.zeros(28 * 28 * 2).reshape((-1, 28, 28)))
assert torch_result.shape == (2, 10)
Custom serializer
You can use the pai.serializers.SerializerBase
class to create a custom serializer
based on the supported data formats of the inference service.
In this section, a custom NumpySerializer
is used as an example to show how a serializer performs serialization and deserialization.
Client: The
NumpySerializer.serializer
method is called to serialize theNumPy array
orpandas DataFrame
input into the.npy
format. The converted data is sent to the server.Server: The inference service deserializes the received data in the
.npy
format, generates the inference result, and then serializes the result into the.npy
format. The result is returned to the client after serialization.Client: The
NumpySerializer.deserialize
method is called to deserialize the returned result into aNumPy array
.
Deploy an inference service in an on-premises environment
The PAI SDK for Python also allows you to deploy an inference service in an on-premises environment by using a custom image. To run an inference service in an on-premises environment, set the instance_type
parameter to local in the model.deploy
method. The SDK uses a Docker container
to run an inference service on your on-premises machine. The model is automatically downloaded from the OSS bucket and mounted to the container that runs on your on-premises machine.
from pai.predictor import LocalPredictor
p: LocalPredictor = model.deploy(
# Specify to deploy the inference service in an on-premises environment.
instance_type="local",
serializer=JsonSerializer()
)
p.predict(data)
# Delete the Docker container.
p.delete_service()
References
For information about how to use the PAI SDK for Python to train and deploy a PyTorch model, see Train and deploy a PyTorch model.