Use the AI component set to analyze and optimize models - Container Service for Kubernetes

To ensure that a model meets the deployment standards before you deploy it in a production environment, you can use the model analysis and optimization commands supported by the cloud-native AI suite to benchmark, analyze, and optimize the model. In this topic, a ResNet18 model provided by PyTorch is used as an example and V100 GPUs are used to accelerate the model.

Prerequisites

A Container Service for Kubernetes (ACK) Pro cluster is created and the Kubernetes version of the cluster is 1.20 or later. The cluster contains at least one GPU-accelerated node. For more information about how to update an ACK cluster, see Update an ACK cluster.
An Object Storage Service (OSS) bucket is created. A persistent volume (PV) and a persistent volume claim (PVC) are created. For more information, see Mount a statically provisioned OSS volume.
The latest version of the Arena client is installed. For more information, see Configure the Arena client.

Background information

Data scientists focus on the precision of models, whereas R&D engineers are more concerned about the performance of models. As a result, a model may not meet the performance requirements after you release the model as an online service. To prevent this issue, you may need to benchmark a model before you release the model. If the model does not meet the performance requirements, you can identify the performance bottlenecks and optimize the model.

Introduction to the model analysis and optimization commands

The cloud-native AI suite supports multiple model analysis and optimization commands. You can run the commands to benchmark models, analyze the network structure, check the duration of each operator, and view the GPU utilization. Then, you can identify the performance bottlenecks of a model and use TensorRT to optimize the model. This helps you release models that meet the performance requirements of a production environment. The following figure shows the model lifecycle assisted by the model analysis and optimization commands. 生命周期

Model Training: The model is trained based on a given dataset.
Model Benchmark: A benchmark is performed on the model to check whether the latency, throughput, and GPU utilization of the model meet the requirements.
Model Profile: The model is analyzed to identify performance bottlenecks.
Model Optimize: The GPU inference capability of the model is optimized by using tools such as TensorRT.
Model Serving: The model is deployed as an online service.

Note

If the model still does not meet the performance requirements after you optimize the model, you can repeat the preceding phases.

How to run the commands

You can use Arena to submit model analysis, optimization, benchmark,and evaluation jobs to ACK Pro clusters. You can run the arena model analyze --help command to view the help information.

$ arena model analyze --help
submit a model analyze job.

Available Commands:
  profile          Submit a model profile job.
  evaluate         Submit a model evaluate job.
  optimize         Submit a model optimize job.
  benchmark        Submit a model benchmark job

Usage:
  arena model analyze [flags]
  arena model analyze [command]

Available Commands:
  benchmark   Submit a model benchmark job
  delete      Delete a model job
  evaluate    Submit a model evaluate job
  get         Get a model job
  list        List all the model jobs
  optimize    Submit a model optimize job, this is a experimental feature
  profile     Submit a model profile job

Step 1: Prepare a model

We recommend that you use TorchScript to deploy PyTorch models. In this topic, a ResNet18 model provided by PyTorch is used as an example.

Convert the model. Convert the ResNet18 model to a TorchScript model and save the model.

import torch
import torchvision

model = torchvision.models.resnet18(pretrained=True)

# Switch the model to eval model
model.eval()

# An example input you would normally provide to your model's forward() method.
dummy_input = torch.rand(1, 3, 224, 224)

# Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing.
traced_script_module = torch.jit.trace(model, dummy_input)

# Save the TorchScript model
traced_script_module.save("resnet18.pt")

Parameter	Description
`model_name`	The name of the model.
`model_platform`	The platform or framework used by the model, such as `TorchScript` and `ONNX`.
`model_path`	The path in which the model is stored.
`inputs`	The input parameters.
`outputs`	The output parameters.

After the model is converted, upload the model configuration file resnet18.pt to OSS. The OSS path of the configuration file is oss://bucketname/models/resnet18/resnet18.pt. For more information, see Upload objects.

Step 2: Perform a benchmark

Before you deploy a model in a production environment, you can perform a benchmark to evaluate the performance of the model. In this step, a benchmark job is submitted in Arena and a PVC named oss-pvc in the default namespace of the cluster is used as an example. For more information, see Mount a statically provisioned OSS volume.

Prepare and upload the configuration file of the model.

Create a configuration file for the model. In this example, the configuration file is named config.json.

{
  "model_name": "resnet18",
  "model_platform": "torchscript",
  "model_path": "/data/models/resnet18/resnet18.pt",
  "inputs": [
    {
      "name": "input",
      "data_type": "float32",
      "shape": [1, 3, 224, 224]
    }
  ],
  "outputs": [
    {
        "name": "output",
        "data_type": "float32",
        "shape": [ 1000 ]
    }
  ]
}

Upload the configuration file to OSS. The OSS path of the configuration file is oss://bucketname/models/resnet18/config.json.

Run the following command to submit a benchmark job to the ACK Pro cluster:

arena model analyze benchmark \
  --name=resnet18-benchmark \
  --namespace=default \
  --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \
  --gpus=1 \
  --data=oss-pvc:/data \
  --model-config-file=/data/models/resnet18/config.json \
  --report-path=/data/models/resnet18 \
  --concurrency=5 \
  --duration=60

Parameter	Description
`--gpus`	The number of GPUs that are used.
`--data`	The PVC for the cluster and the path to which the PVC is mounted.
`--model-config-file`	The path of the configuration file.
`--report-path`	The path in which the benchmark report is stored.
`--concurrency`	The number of concurrent requests.
`--duration`	The duration of the benchmark job. Unit: seconds. Important You cannot specify the `--requests` and the `--duration` parameters at the same time. Specify only one of them when you submit a benchmark job. If you specify both of them, the system uses the `--duration` parameter by default. To specify the total number of requests sent by the benchmark job, specify the `--requests` parameter.

Run the following command to query the status of the job:

arena model analyze list -A

Expected output:

NAMESPACE      NAME                        STATUS    TYPE       DURATION  AGE  GPU(Requested)
default        resnet18-benchmark          COMPLETE  Benchmark  0s        2d   1

View the benchmark report. If the STATUS parameter displays COMPLETE, the benchmark job is completed. Then, you can find a benchmark report named benchmark_result.txt in the path specified by the --report-path parameter.

Expected output:

{
    "p90_latency":7.511,
    "p95_latency":7.86,
    "p99_latency":9.34,
    "min_latency":7.019,
    "max_latency":12.269,
    "mean_latency":7.312,
    "median_latency":7.206,
    "throughput":136,
    "gpu_mem_used":1.47,
    "gpu_utilization":21.280
}

The following table describes the metrics that are included in a benchmark report.

Metric	Description	Unit
p90_latency	90th percentile response time	Milliseconds
p95_latency	95th percentile response time	Milliseconds
p99_latency	99th percentile response time	Milliseconds
min_latency	Fastest response time	Milliseconds
max_latency	Slowest response time	Milliseconds
mean_latency	Average response time	Milliseconds
median_latency	Medium response time	Milliseconds
throughput	Throughput	Times
gpu_mem_used	GPU memory usage	GB
gpu_utilization	GPU utilization	Percentage

Step 3: Analyze the model

After you perform a benchmark, you can run the arena model analyze profile command to analyze the model and identify performance bottlenecks.

Run the following command to submit a model analysis job to the ACK Pro cluster.

arena model analyze profile \
  --name=resnet18-profile \
  --namespace=default \
  --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \
  --gpus=1 \
  --data=oss-pvc:/data \
  --model-config-file=/data/models/resnet18/config.json \
  --report-path=/data/models/resnet18/log/ \
  --tensorboard \
  --tensorboard-image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2

Parameter	Description
`--gpus`	The number of GPUs that are used.
`--data`	The PVC for the cluster and the path to which the PVC is mounted.
`--model-config-file`	The path of the configuration file.
`--report-path`	The path in which the analysis report is stored.
`--tensorboard`	Specifies whether to view the analysis report in TensorBoard.
`--tensorboard-image`	The URL of the image that is used to deploy Tensorboard.

Run the following command to query the status of the job:

arena model analyze list -A

Expected output:

NAMESPACE      NAME                        STATUS    TYPE       DURATION  AGE  GPU(Requested)
default        resnet18-profile            COMPLETE  Profile    13s       2d   1

Run the following command to query the status of TensorBoard:

kubectl get service -n default

Expected output:

NAME                           TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
resnet18-profile-tensorboard   NodePort   172.16.158.170   <none>        6006:30582/TCP   2d20h

Run the following command to enable port forwarding and access TensorBoard:

kubectl port-forward svc/resnet18-profile-tensorboard -n default 6006:6006

Expected output:

Forwarding from 127.0.X.X:6006 -> 6006
Forwarding from [::1]:6006 -> 6006

Enter http://localhost:6006 into the address bar of your browser to view the analysis results. In the left-side navigation pane, click Views to view the analysis results based on multiple dimensions and identify performance bottlenecks. You can optimize the model based on the analysis results.

Step 4: Optimize the model

You can use Arena to optimize a model.

Run the following command to submit a model optimization job to the ACK Pro cluster:

arena model analyze optimize \
  --name=resnet18-optimize \
  --namespace=default \
  --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \
  --gpus=1 \
  --data=oss-pvc:/data \
  --optimizer=tensorrt \
  --model-config-file=/data/models/resnet18/config.json \
  --export-path=/data/models/resnet18

Parameter	Description
`--gpus`	The number of GPUs that are used.
`--data`	The PVC for the cluster and the path to which the PVC is mounted.
`--optimizer`	The optimization method. Valid values: `tensorrt` (default) `aiacc-torch`.
`--model-config-file`	The path of the configuration file.
`--export-path`	The path in which the optimized model is stored.

Run the following command to query the status of the job:

arena model analyze list -A

Expected output:

NAMESPACE      NAME                        STATUS    TYPE       DURATION  AGE  GPU(Requested)
default        resnet18-optimize           COMPLETE  Optimize   16s       2d   1

View the configuration file of the optimized model. If the STATUS parameter displays COMPLETE, the optimization job is completed. Then, you can find the configuration file named opt_resnet18.pt in the path specified by the --export-path parameter.

Change the value of the --model_path parameter in the benchmark job to the path of the configuration file that you obtained in the preceding step, and perform a benchmark again. For more information about how to perform a benchmark, see Step 2: Perform a benchmark.

The following table describes the metric values before and after the model is optimized.

Metric	Before optimization	After optimization
p90_latency	7.511 milliseconds	5.162 milliseconds
p95_latency	7.86 milliseconds	5.428 milliseconds
p99_latency	9.34 milliseconds	6.64 milliseconds
min_latency	7.019 milliseconds	4.827 milliseconds
max_latency	12.269 milliseconds	8.426 milliseconds
mean_latency	7.312 milliseconds	5.046 milliseconds
median_latency	7.206 milliseconds	4.972 milliseconds
throughput	136 times	198 times
gpu_mem_used	1.47 GB	1.6 GB
gpu_utilization	21.280%	10.912%

The statistics show that the performance and GPU utilization of the model are greatly improved after optimization. If the model still does not meet the performance requirements, you can repeat the preceding steps to analyze and optimize the model.

Step 5: Deploy the model

If the model meets the performance requirements, you can deploy the model as an online service. Arena allows you to use NVIDIA Triton Inference Server to deploy TorchScript models. For more information, see Nvidia Triton Server.

Create a configuration file named config.pbtxt.

Important

Do not change the file name.

name: "resnet18"
platform: "pytorch_libtorch"
max_batch_size: 1
default_model_filename: "opt_resnet18.pt"
input [
    {
        name: "input__0"
        format: FORMAT_NCHW
        data_type: TYPE_FP32
        dims: [ 3, 224, 224 ]
    }
]
output [
    {
        name: "output__0",
        data_type: TYPE_FP32,
        dims: [ 1000 ]
    }
]

Note

For more information about the parameters in the configuration file, see Model Repository.

Create the following directory structure in OSS:
```
oss://bucketname/triton/model-repository/
    resnet18/
      config.pbtxt
      1/
        opt_resnet18.pt
```
Note
1/ is a convention of NVIDIA Triton Inference Server. The value indicates the version number of the model. A model repository can store different versions of a model. For more information, see Model Repository.
Use Arena to deploy the model. You can deploy a model in GPU sharing mode or GPU exclusive mode.
- GPU exclusive mode: You can use this mode to deploy inference services that require high stability. In this mode, each GPU accelerates only one model. Models do not compete for GPU resources. You can run the following command to deploy a model in GPU exclusive mode:
```
arena serve triton \
  --name=resnet18-serving \
  --gpus=1 \
  --replicas=1 \
  --image=nvcr.io/nvidia/tritonserver:21.05-py3 \
  --data=oss-pvc:/data \
  --model-repository=/data/triton/model-repository \
  --allow-metrics=true
```
- GPU sharing mode: You can use this mode to deploy long-tail inference services or inference services that require cost-efficiency. In this mode, a GPU is shared by multiple models. Each model is allowed to use only a specified amount of GPU memory. You can run the following command to deploy a model in GPU sharing mode:
  If you deploy models in GPU sharing mode, you must set the --gpumemory parameter. This parameter specifies the amount of memory that is allocated to each pod. You can specify a proper value based on the gpu_mem_used metric in the benchmark result. For example, if the value of the gpu_mem_used metric is 1.6 GB, you can set the --gpumemory parameter to 2 GB. The value of this parameter must be a positive integer.
```
arena serve triton \
  --name=resnet18 \
  --gpumemory=2 \
  --replicas=1 \
  --image=nvcr.io/nvidia/tritonserver:21.12-py3 \
  --data=oss-pvc:/data \
  --model-repository=/data/triton/model-repository \
  --allow-metrics=true
```

Run the following command to query the status of the deployment:

arena serve list -A

Expected output:

NAMESPACE      NAME              TYPE    VERSION       DESIRED  AVAILABLE  ADDRESS         PORTS                   GPU
default        resnet18-serving  Triton  202202141817  1        1          172.16.147.248  RESTFUL:8000,GRPC:8001  1

If the value of the AVAILABLE parameter equals the value of the DESIRED parameter, the model is deployed.