All Products
Search
Document Center

Platform For AI:Deploy inference services

Last Updated:Jul 23, 2024

Platform for AI (PAI) provides an SDK for Python that contains easy-to-use high-level APIs. You can use the SDK to deploy models as inference services in PAI. This topic describes how to use the PAI SDK for Python to deploy inference services in PAI.

Introduction

The PAI SDK for Python contains the following high-level APIs: pai.model.Model and pai.predictor.Predictor. You can use the SDK to deploy models to Elastic Algorithm Service (EAS) of PAI and test the model services.

To use the SDK to deploy an inference service, perform the following steps:

  • Specify the configurations of the inference service in the pai. model.InferenceSpec object. The configurations include the processor or image that you want to use for service deployment.

  • Create a pai.model.Model object by using the InferenceSpec object and the model file.

  • Call the pai.model.Model.deploy() method to deploy the inference service. In the method, specify information about service deployment, such as the required resources and the service name.

  • Call the deploy method to obtain a pai.predictor.Predictor object. Then, call the predict method to send an inference request.

Sample code:

from pai.model import InferenceSpec, Model, container_serving_spec
from pai.image import retrieve, ImageScope

# 1. Use a PyTorch image provided by PAI for model inference. 
torch_image = retrieve("PyTorch", framework_version="latest",
    image_scope=ImageScope.INFERENCE)


# 2. Specify the configurations of the inference service in the InferenceSpec object. 
inference_spec = container_serving_spec(
    # The startup command of the inference service. 
    command="python app.py",
    source_dir="./src/"
    # The image used for model inference. 
    image_uri=torch_image.image_uri,
)


# 3. Create a Model object. 
model = Model(
    # Use a model file stored in an Object Storage Service (OSS) bucket. 
    model_data="oss://<YourBucket>/path-to-model-data",
    inference_spec=inference_spec,
)

# 4. Deploy the model as an online inference service in EAS and obtain a Predictor object. 
predictor = model.deploy(
    service_name="example_torch_service",
    instance_type="ecs.c6.xlarge",
)

# 5. Test the inference service. 
res = predictor.predict(data=data)

The following sections describe how to use the SDK for Python to deploy an inference service and provide the corresponding sample code.

Configure InferenceSpec

You can deploy an inference service by using a processor or an image. The pai.model.InferenceSpec object defines the configurations of the inference service, such as the processor or image used for service deployment, service storage paths, warmup request files, and remote procedure call (RPC) batching.

Deploy an inference service by using a built-in processor

A processor is a package that contains online prediction logic. You can use a processor to directly deploy a model as an inference service. PAI provides built-in processors that support common machine learning model formats, such as TensorFlow SavedModel, PyTorch TorchScript, XGBoost, LightGBM, and PMML. For more information, see Built-in processors.

  • Sample InferenceSpec configurations:

    # Use a built-in TensorFlow processor. 
    tf_infer_spec = InferenceSpec(processor="tensorflow_cpu_2.3")
    
    
    # Use a built-in PyTorch processor. 
    tf_infer_spec = InferenceSpec(processor="pytorch_cpu_1.10")
    
    # Use a built-in XGBoost processor. 
    xgb_infer_spec = InferenceSpec(processor="xgboost")
    
  • You can configure additional features for the inference service in the InferenceSpec object, such as warmup requests and RPC. For information about advanced configurations, see Parameters of model services.

    # Configure the properties of InferenceSpec. 
    tf_infer_spec.warm_up_data_path = "oss://<YourOssBucket>/path/to/warmup-data" # Specify the path of the warmup request file. 
    tf_infer_spec.metadata.rpc.keepalive=1000 # Specify the maximum processing time for a single request. 
    
    print(tf_infer_spec.warm_up_data_path)
    print(tf_infer_spec.metadata.rpc.keepalive)
    

Deploy an inference service by using an image

Processors simplify model deployment procedures, but cannot meet custom deployment requirements, especially when models or inference services have complex dependencies. To address this issue, PAI supports flexible model deployment by using an image.

  • You can package the code and dependencies of a model into a Docker image and push the Docker image to Alibaba Cloud Container Registry (ACR). Then, you can create an InferenceSpec object based on the Docker image.

    from pai.model import InferenceSpec, container_serving_spec
    
    # Call the container_serving_spec method to create an InferenceSpec object from an image. 
    container_infer_spec = container_serving_spec(
        # The image used to run the inference service. 
        image_uri="<CustomImageUri>",
        # The port on which the inference service listens. The inference requests are forwarded to this port by PAI. 
        port=8000,
        environment_variables=environment_variables,
        # The startup command of the inference service. 
        command=command,
        # The Python package required by the inference service. 
        requirements=[
            "scikit-learn",
            "fastapi==0.87.0",
        ],
    )
    
    
    print(container_infer_spec.to_dict())
    
    m = Model(
        model_data="oss://<YourOssBucket>/path-to-tensorflow-saved-model",
        inference_spec=custom_container_infer_spec,
    )
    p = m.deploy(
        instance_type="ecs.c6.xlarge"
    )
  • If you want to use a custom image, integrate the inference code that you prepared into a container, build an image, and then push the image to ACR. The PAI SDK for Python simplifies this process. You can add your code to a base image to build a custom image. In this way, you do not need to build an image from scratch. In the pai.model.container_serving_spec() method, you can set the source_dir parameter to an on-premises directory that contains the inference code. The SDK automatically packages and uploads the directory to an OSS bucket, and mounts the OSS path to the container. You can specify the startup command to start the inference service.

    from pai.model import InferenceSpec
    
    inference_spec = container_serving_spec(
        # The on-premises directory that contains the inference code. The directory is uploaded to an OSS bucket, and the OSS path is mounted to the container. Default container path: /ml/usercode/. 
        source_dir="./src",
        # The startup command of the inference service. If you specify the source_dir parameter, the /ml/usercode directory is used as the working directory of the container by default. 
        command="python run.py",
        image_uri="<ServingImageUri>",
        requirements=[
            "fastapi",
            "uvicorn",
        ]
    )
    print(inference_spec.to_dict())
  • If you want to add code or models to the container, you can call the pai.model.InferenceSpec.mount() method to mount an on-premises directory or an OSS path to the container.

    # Upload the on-premises data to OSS and mount the OSS path to the /ml/tokenizers directory in the container. 
    inference_spec.mount("./bert_tokenizers/", "/ml/tokenizers/")
    
    # Mount the OSS path to the /ml/data directory in the container. 
    inference_spec.mount("oss://<YourOssBucket>/path/to/data/", "/ml/data/")
    
  • Obtain public images provided by PAI

    PAI provides multiple inference images based on common machine learning frameworks, such as TensorFlow, PyTorch, and XGBoost. You can set the image_scope parameter to ImageScope.INFERENCE in the pai.image.list_images and pai.image.retrieve methods to obtain the inference images.

    from pai.image import retrieve, ImageScope, list_images
    
    # Obtain all PyTorch inference images provided by PAI. 
    for image_info in list_images(framework_name="PyTorch", image_scope=ImageScope.INFERENCE):
      	print(image_info)
    
    
    # Obtain PyTorch 1.12 images for CPU-based inference. 
    retrieve(framework_name="PyTorch", framework_version="1.12", image_scope=ImageScope.INFERENCE)
    
    # Obtain PyTorch 1.12 images for GPU-based inference. 
    retrieve(framework_name="PyTorch", framework_version="1.12", accelerator_type="GPU", image_scope=ImageScope.INFERENCE)
    
    # Obtain the images that support the latest version of PyTorch for GPU-based inference. 
    retrieve(framework_name="PyTorch", framework_version="latest", accelerator_type="GPU", image_scope=ImageScope.INFERENCE)
    

Deploy an inference service and send inference requests

Deploy an inference service

Create a pai.model.Model object by using the pai.model.InferenceSpec object and the model_data parameter. Then, call the deploy method to deploy the model. The model_data parameter specifies the path of the model. The value of the parameter can be an OSS URI or an on-premises path. If you specify an on-premises path, the model file stored in the path is uploaded to an OSS bucket and then loaded from the OSS bucket to the inference service.

In the deploy method, specify the parameters of the inference service, such as the required resources, the number of instances, and the service name. For information about advanced configurations, see Parameters of model services.

from pai.model import Model, InferenceSpec
from pai.predictor import Predictor

model = Model(
    # The path of the model, which can be an OSS URI or an on-premises path. If you specify an on-premises path, the model file stored in the path is uploaded to an OSS bucket by default. 
    model_data="oss://<YourBucket>/path-to-model-data",
    inference_spec=inference_spec,
)

# Deploy the inference service in EAS. 
predictor = m.deploy(
    # The name of the inference service. 
    service_name="example_xgb_service",
    # The instance type used for the inference service. 
    instance_type="ecs.c6.xlarge",
    # The number of instances. 
    instance_count=2,
    # Optional. Use a dedicated resource group for service deployment. By default, the public resource group is used. 
    # resource_id="<YOUR_EAS_RESOURCE_GROUP_ID>",
    options={
        "metadata.rpc.batching": True,
        "metadata.rpc.keepalive": 50000,
        "metadata.rpc.max_batch_size": 16,
        "warm_up_data_path": "oss://<YourOssBucketName>/path-to-warmup-data",
    },
)

You can also use the resource_config parameter to specify the number of resources used for service deployment, such as the number of vCPUs and the memory size of each service instance.

from pai.model import ResourceConfig

predictor = m.deploy(
    service_name="dedicated_rg_service",
    # Specify the number of vCPUs and the memory size of each service instance. 
    # In this example, each service instance has two vCPUs and 4,000 MB of memory. 
    resource_config=ResourceConfig(
        cpu=2,
        memory=4000,
    ),
)

Send requests to an inference service

In the pai.model.Model.deploy method, call EAS API operations to deploy an inference service. The corresponding pai.predictor.Predictor object is returned. You can use the predict and raw_predict methods in the Predictor object to send inference requests.

Note

The input and output of the pai.predictor.Predictor.raw_predict method do not need to be processed by a serializer.

from pai.predictor import Predictor, EndpointType

# Deploy an inference service. 
predictor = model.deploy(
    instance_type="ecs.c6.xlarge",
    service_name="example_xgb_service",
)

# The inference service to which the inference request is sent. 
predictor = Predictor(
    service_name="example_xgb_service",
    # By default, you can access the inference service over the Internet. To access the inference service over a virtual private cloud (VPC) endpoint, you can set the endpoint type to INTRANET. In this case, the client must be deployed in the VPC. 
    # endpoint_type=EndpointType.INTRANET
)

# Use the predict method to send a request to the inference service and obtain the result. The input and output are processed by a serializer. 
res = predictor.predict(data_in_nested_list)


# Use the raw_predict method to send a request to the inference service in a more flexible manner. 
response: RawResponse = predictor.raw_predict(
  	# The input data of the bytes type and file-like objects can be directly passed to the HTTP request body. 
  	# Other data is serialized into JSON-formatted data and then passed to the HTTP request body. 
  	data=data_in_nested_list
  	# path="predict", # The path of HTTP requests. Default value: "/". 
  	# headers=dict(), # The request header. 
  	# method="POST", # The HTTP request method. 
  	# timeout=30, # The request timeout period. Unit: seconds. 
)

# Obtain the returned HTTP body and header. 
print(response.content, response.headers)
# Deserialize the returned JSON-formatted data into a Python object. 
print(response.json())

    
# Stop the inference service. 
predictor.stop_service()
# Start the inference service. 
predictor.start_service()
# Delete the inference service. 
predictor.delete_service()

Use a serializer to process the input and output

When you call the pai.predictor.Predictor.predict method for model inference, you must serialize the input Python data into a data format that is supported by the inference service and deserialize the returned result into a readable or operable Python object. The Predictor object uses the serializer class to perform serialization and deserialization.

  • When you call the predict(data=<PredictionData>) method, the data parameter serializes the request data into the bytes format by calling the serilizer.serialize method. Then, the converted request data is passed to the inference service through the HTTP request body.

  • When the inference service returns an HTTP response, the Predictor object deserializes the response by calling the serializer.deserialize method. You can obtain the converted response from the predict method.

The PAI SDK for Python provides multiple built-in serializers for common data formats. The serializers can process the input and output of the built-in processors provided by PAI.

  • JsonSerializer

    JsonSerializer serializes objects into JSON strings and deserializes JSON strings into objects. The input data of the predict method can be a NumPy array or a list. The JsonSerializer.serialize method serializes the input data into a JSON string. The JsonSerializer.deserialize method deserializes the returned JSON string into a Python object.

    Specific built-in processors, such as XGBoost processors and PMML processors, receive and return only JSON-formatted data. By default, JsonSerializer is used to process the input and output of these processors.

  • from pai.serializers import JsonSerializer
    
    # In the deploy method, specify the serializer that you want to use. 
    p = Model(
        inference_spec=InferenceSpec(processor="xgboost"),
        model_data="oss://<YourOssBucket>/path-to-xgboost-model"
    ).deploy(
        instance_type="ecs.c6.xlarge",
        # Optional. By default, JsonSerializer is used to process the input and output of the XGBoost processor. 
        serializer=JsonSerializer()
    )
    
    # You can also specify a serializer when you create a Predictor object. 
    p = Predictor(
        service_name="example_xgb_service"
        serializer=JsonSerializer(),
    )
    
    # The returned result is a list. 
    res = p.predict([[2,3,4], [4,5,6]])
  • TensorFlowSerializer

    You can use the built-in TensorFlow processor to deploy TensorFlow models in the SavedModel format in PAI. The input and output of the TensorFlow services are protocol buffers messages. For information about the data format, see tf_predict.proto.

    The PAI SDK for Python provides a built-in TensorFlowSerializer, which allows you to send an inference request as a NumPy array. The serializer serializes NumPy arrays into protocol buffers messages and deserializes the returned protocol buffers messages into NumPy arrays.

  • # Deploy a model service by using the TensorFlow processor. 
    tf_predictor = Model(
        inference_spec=InferenceSpec(processor="tensorflow_cpu_2.7"),
        model_data="oss://<YourOssBucket>/path-to-tensorflow-saved-model"
    ).deploy(
        instance_type="ecs.c6.xlarge",
        # Optional. By default, TensorFlowSerializer is used to process the input and output of the TensorFlow processor. 
        # serializer=TensorFlowSerializer(),
    )
    
    # You can obtain the service signature by calling an API operation. 
    print(tf_predictor.inspect_signature_def())
    
    # The input of the TensorFlow processor is of the dictionary type. The dictionary key is the name of the input signature. The dictionary value is the specific input data. 
    tf_result = tf_predictor.predict(data={
        "flatten_input": numpy.zeros(28*28*2).reshape((-1, 28, 28))
    })
    
    assert result["dense_1"].shape == (2, 10)
  • PyTorchSerializer

    You can use the built-in PyTorch processor to deploy PyTorch models in the TorchScript format in PAI. The input and output of the PyTorch services are protocol buffers messages. For information about the data format, see tf_predict.proto.

    The PAI SDK for Python provides a built-in PyTorchSerializer, which allows you to send an inference request as a NumPy array. The serializer serializes NumPy arrays into protocol buffers messages and deserializes the returned protocol buffers messages into NumPy arrays.

  • # Deploy a model service by using the PyTorch processor. 
    torch_predictor = Model(
        inference_spec=InferenceSpec(processor="pytorch_cpu_1.10"),
        model_data="oss://<YourOssBucket>/path-to-torch_script-model"
    ).deploy(
        instance_type="ecs.c6.xlarge",
        # Optional. By default, PyTorchSerializer is used to process the input and output of the PyTorch processor. 
        # serializer=PyTorchSerializer(),
    )
    
    #1. Convert the input data into a format supported by the model service. 
    #2. Use a list or tuple for multiple inputs. Each element is a NumPy array. 
    torch_result = torch_predictor.predict(data=numpy.zeros(28 * 28 * 2).reshape((-1, 28, 28)))
    assert torch_result.shape == (2, 10)
  • Custom serializer

    You can use the pai.serializers.SerializerBase class to create a custom serializer based on the supported data formats of the inference service.

    In this section, a custom NumpySerializer is used as an example to show how a serializer performs serialization and deserialization.

    1. Client: The NumpySerializer.serializer method is called to serialize the NumPy array or pandas DataFrame input into the .npy format. The converted data is sent to the server.

    2. Server: The inference service deserializes the received data in the .npy format, generates the inference result, and then serializes the result into the .npy format. The result is returned to the client after serialization.

    3. Client: The NumpySerializer.deserialize method is called to deserialize the returned result into a NumPy array.

Deploy an inference service in an on-premises environment

The PAI SDK for Python also allows you to deploy an inference service in an on-premises environment by using a custom image. To run an inference service in an on-premises environment, set the instance_type parameter to local in the model.deploy method. The SDK uses a Docker container to run an inference service on your on-premises machine. The model is automatically downloaded from the OSS bucket and mounted to the container that runs on your on-premises machine.

from pai.predictor import LocalPredictor

p: LocalPredictor = model.deploy(
    # Specify to deploy the inference service in an on-premises environment. 
    instance_type="local",
    serializer=JsonSerializer()
)

p.predict(data)

# Delete the Docker container. 
p.delete_service()

References

For information about how to use the PAI SDK for Python to train and deploy a PyTorch model, see Train and deploy a PyTorch model.