All Products
Search
Document Center

Container Service for Kubernetes:Deploy a PyTorch model as an inference service

Last Updated:Jul 03, 2024

PyTorch is a deep learning framework that can be used to train models. This topic describes how to use the NVIDIA Triton inference server or TorchServe to deploy a PyTorch model as an inference service.

Prerequisites

Step 1: Deploy a PyTorch model

(Recommended) Use the NVIDIA Triton inference server to deploy the model

A Bert model trained by using PyTorch 1.16 is used. Convert the model to TorchScript, save the model in the triton directory of a persistent volume (PV), and then use the NVIDIA Triton inference server to deploy the model.

The following model directory structure is required by Triton:

└── chnsenticorp # The name of the model. 
    ├── 1623831335 # The version of the model. 
    │   └── model.savedmodel # The model file. 
    │       ├── saved_model.pb
    │       └── variables
    │           ├── variables.data-00000-of-00001
    │           └── variables.index
    └── config.pbtxt # TheTriton configuration.

  1. Run a standalone PyTorch training job and convert the model to TorchScript. For more information, see Use Arena to submit standalone PyTorch training jobs.

  2. Run the following command to query the GPU resources available in the cluster:

    arena top node

    Expected output:

    NAME                      IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-beijing.192.168.0.100  192.168.0.100  <none>  Ready   1           0
    cn-beijing.192.168.0.101  192.168.0.101  <none>  Ready   1           0
    cn-beijing.192.168.0.99   192.168.0.99   <none>  Ready   1           0
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs of nodes which own resource nvidia.com/gpu In Cluster:
    0/3 (0.0%)

    The preceding output shows that the cluster has three GPU-accelerated nodes on which you can deploy the model.

  3. Upload the model to a bucket in Object Storage Service (OSS).

    Important

    This example shows how to upload the model to OSS from a Linux system. If you use other operating systems, see ossutil.

    1. Install ossutil.

    2. Create a bucket named examplebucket.

      • Run the following command to create a bucket named examplebucket:

        ossutil64 mb oss://examplebucket
      • If the following output is displayed, the bucket named examplebucket is created:

        0.668238(s) elapsed
    3. Upload the model to the examplebucket bucket.

      ossutil64 cp model.savedmodel oss://examplebucket
  4. Create a PV and a persistent volume claim (PVC).

    1. Create a file named PyTorch.yaml and add the following content to the file:

      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: model-csi-pv
      spec:
        capacity:
          storage: 5Gi
        accessModes:
          - ReadWriteMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: model-csi-pv   // The value must be the same as the name of the PV. 
          volumeAttributes:
            bucket: "Your Bucket"
            url: "Your oss url"
            akId: "Your Access Key Id"
            akSecret: "Your Access Key Secret"
            otherOpts: "-o max_stat_cache_size=0 -o allow_other"
      ---
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: model-pvc
      spec:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 5Gi

      Parameter

      Description

      bucket

      The name of the Object Storage Service (OSS) bucket, which is globally unique in OSS. For more information, see Bucket naming conventions.

      url

      The URL that is used to access an object in the bucket. For more information, see Obtain the URLs of multiple objects.

      akId

      The AccessKey ID and AccessKey secret that are used to access the OSS bucket. We recommend that you access the OSS bucket as a Resource Access Management (RAM) user. For more information, see Create an AccessKey pair.

      akSecret

      otherOpts

      Custom parameters for mounting the OSS bucket.

      • Set -o max_stat_cache_size=0 to disable metadata caching. If this feature is disabled, the system retrieves the latest metadata from OSS each time it attempts to access objects in OSS.

      • Set -o allow_other to allow other users to access the OSS bucket that you mounted.

      For more information about other parameters, see Custom parameters supported by ossfs.

    2. Run the following command to create a PV and a PVC:

      kubectl apply -f Tensorflow.yaml
  5. Run the following command to deploy the model by using the NVIDIA Triton inference server:

    arena serve triton \
     --name=bert-triton \
     --namespace=inference \
     --gpus=1 \
     --replicas=1 \
     --image=nvcr.io/nvidia/tritonserver:20.12-py3 \
     --data=model-pvc:/models \
     --model-repository=/models/triton

    Expected output:

    configmap/bert-triton-202106251740-triton-serving created
    configmap/bert-triton-202106251740-triton-serving labeled
    service/bert-triton-202106251740-tritoninferenceserver created
    deployment.apps/bert-triton-202106251740-tritoninferenceserver created
    INFO[0001] The Job bert-triton has been submitted successfully
    INFO[0001] You can run `arena get bert-triton --type triton-serving` to check the job status

Use TorchServe to deploy the model

  1. Use torch-model-archiver to package the PyTorch model into a .mar file. For more information, see torch-model-archiver.

  2. Upload the model to a bucket in Object Storage Service (OSS).

    Important

    This example shows how to upload the model to OSS from a Linux system. If you use other operating systems, see ossutil.

    1. Install ossutil.

    2. Create a bucket named examplebucket.

      • Run the following command to create a bucket named examplebucket:

        ossutil64 mb oss://examplebucket
      • If the following output is displayed, the bucket named examplebucket is created:

        0.668238(s) elapsed
    3. Upload the model to the examplebucket bucket.

      ossutil64 cp model.savedmodel oss://examplebucket
  3. Create a persistent volume (PV) and a persistent volume claim (PVC).

    1. Create a file named Tensorflow.yaml based on the following template:

      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: model-csi-pv
      spec:
        capacity:
          storage: 5Gi
        accessModes:
          - ReadWriteMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: model-csi-pv   // The value must be the same as the name of the PV. 
          volumeAttributes:
            bucket: "Your Bucket"
            url: "Your oss url"
            akId: "Your Access Key Id"
            akSecret: "Your Access Key Secret"
            otherOpts: "-o max_stat_cache_size=0 -o allow_other"
      ---
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: model-pvc
      spec:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 5Gi

      Parameter

      Description

      bucket

      The name of the OSS bucket, which is globally unique in OSS. For more information, see Bucket naming conventions.

      url

      The URL that is used to access an object in the bucket. For more information, see Obtain the URL of a single object or the URLs of multiple objects.

      akId

      The AccessKey ID and AccessKey secret that are used to access the OSS bucket. We recommend that you access the OSS bucket as a Resource Access Management (RAM) user. For more information, see Create an AccessKey pair.

      akSecret

      otherOpts

      Custom parameters for mounting the OSS bucket.

      • Set -o max_stat_cache_size=0 to disable metadata caching. If this feature is disabled, the system retrieves the latest metadata from OSS each time it attempts to access objects in OSS.

      • Set -o allow_other to allow other users to access the OSS bucket that you mounted.

      For more information about other parameters, see Custom parameters supported by ossfs.

    2. Run the following command to create a PV and a PVC:

      kubectl apply -f Tensorflow.yaml
  4. Run the following command to deploy the PyTorch model:

    arena serve custom \
      --name=torchserve-demo \
      --gpus=1 \
      --replicas=1 \
      --image=pytorch/torchserve:0.4.2-gpu \
      --port=8000 \
      --restful-port=8001 \
      --metrics-port=8002 \
      --data=model-pvc:/data \
      'torchserve --start --model-store /data/models --ts-config /data/config/ts.properties'
    Note
    • You can specify an official image or a custom TorchServe image.

    • You must set the --model-store field to the path where the PyTorch model is stored.

    Expected output:

    service/torchserve-demo-202109101624 created
    deployment.apps/torchserve-demo-202109101624-custom-serving created
    INFO[0001] The Job torchserve-demo has been submitted successfully
    INFO[0001] You can run `arena get torchserve-demo --type custom-serving` to check the job status

Step 2: Verify the deployment of the inference service

  1. Run the following command to check the deployment progress of the model:

    arena serve list -n inference

    Expected output:

    NAME            TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS        PORTS
    bert-triton     Triton      202106251740  1        1          172.16.70.14   RESTFUL:8000,GRPC:8001
  2. Run the following command to query the details about the inference service:

    arena serve get bert-tfserving -n inference

    Expected output:

    Name:       bert-triton
    Namespace:  inference
    Type:       Triton
    Version:    202106251740
    Desired:    1
    Available:  1
    Age:        5m
    Address:    172.16.70.14
    Port:       RESTFUL:8000,GRPC:8001
    
    
    Instances:
      NAME                                                             STATUS   AGE  READY  RESTARTS  NODE
      ----                                                             ------   ---  -----  --------  ----
      bert-triton-202106251740-tritoninferenceserver-667cf4c74c-s6nst  Running  5m   1/1    0         cn-beijing.192.168.0.89

    The output indicates that the model is deployed by using the NVIDIA Triton inference server. API port 8001 for gRPC and API port 8000 for REST are provided.

  3. By default, an inference service deployed by using the NVIDIA Triton inference server provides a cluster IP, which can be accessed only after you configure an Internet-facing Ingress.

    1. On the Clusters page, click the name of the cluster that you want to manage and choose Network > Ingresses in the left-side navigation pane.

    2. In the top navigation bar, select the namespace inference specified in Step 2 from the Namespace drop-down list.

    3. Click Create Ingress in the upper-right part of the page.

      • Set Name to the service name bert-triton specified in Step 1.

      • Set Port to 8501 (for REST).

      • Configure other parameters based on your business requirements. For more information, see Create an NGINX Ingress.

  4. After you create the Ingress, go to the Ingresses page and find the Ingress. The value in the Rules column contains the address of the Ingress.23

  5. Run the following command to call the inference service by using the address of the Ingress. The NVIDIA Triton inference server complies with the interface specifications of KFServing. For more information, see NVIDIA Triton Server API.

    curl "http://<Ingress address>"

    Expected output:

    {
        "name":"chnsenticorp",
        "versions":[
            "1623831335"
        ],
        "platform":"tensorflow_savedmodel",
        "inputs":[
            {
                "name":"input_ids",
                "datatype":"INT64",
                "shape":[
                    -1,
                    128
                ]
            }
        ],
        "outputs":[
            {
                "name":"probabilities",
                "datatype":"FP32",
                "shape":[
                    -1,
                    2
                ]
            }
        ]
    }