To ensure that a model meets the deployment standards before you deploy it in a production environment, you can use the model analysis and optimization commands supported by the cloud-native AI suite to benchmark, analyze, and optimize the model. In this topic, a ResNet18 model provided by PyTorch is used as an example and V100 GPUs are used to accelerate the model.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster is created and the Kubernetes version of the cluster is 1.20 or later. The cluster contains at least one GPU-accelerated node. For more information about how to update an ACK cluster, see Update an ACK cluster.
An Object Storage Service (OSS) bucket is created. A persistent volume (PV) and a persistent volume claim (PVC) are created. For more information, see Mount a statically provisioned OSS volume.
The latest version of the Arena client is installed. For more information, see Configure the Arena client.
Background information
Data scientists focus on the precision of models, whereas R&D engineers are more concerned about the performance of models. As a result, a model may not meet the performance requirements after you release the model as an online service. To prevent this issue, you may need to benchmark a model before you release the model. If the model does not meet the performance requirements, you can identify the performance bottlenecks and optimize the model.
Introduction to the model analysis and optimization commands
The cloud-native AI suite supports multiple model analysis and optimization commands. You can run the commands to benchmark models, analyze the network structure, check the duration of each operator, and view the GPU utilization. Then, you can identify the performance bottlenecks of a model and use TensorRT to optimize the model. This helps you release models that meet the performance requirements of a production environment. The following figure shows the model lifecycle assisted by the model analysis and optimization commands.
Model Training: The model is trained based on a given dataset.
Model Benchmark: A benchmark is performed on the model to check whether the latency, throughput, and GPU utilization of the model meet the requirements.
Model Profile: The model is analyzed to identify performance bottlenecks.
Model Optimize: The GPU inference capability of the model is optimized by using tools such as TensorRT.
Model Serving: The model is deployed as an online service.
If the model still does not meet the performance requirements after you optimize the model, you can repeat the preceding phases.
How to run the commands
You can use Arena to submit model analysis, optimization, benchmark,and evaluation jobs to ACK Pro clusters. You can run the arena model analyze --help
command to view the help information.
$ arena model analyze --help
submit a model analyze job.
Available Commands:
profile Submit a model profile job.
evaluate Submit a model evaluate job.
optimize Submit a model optimize job.
benchmark Submit a model benchmark job
Usage:
arena model analyze [flags]
arena model analyze [command]
Available Commands:
benchmark Submit a model benchmark job
delete Delete a model job
evaluate Submit a model evaluate job
get Get a model job
list List all the model jobs
optimize Submit a model optimize job, this is a experimental feature
profile Submit a model profile job
Step 1: Prepare a model
We recommend that you use TorchScript to deploy PyTorch models. In this topic, a ResNet18 model provided by PyTorch is used as an example.
Convert the model. Convert the ResNet18 model to a TorchScript model and save the model.
import torch import torchvision model = torchvision.models.resnet18(pretrained=True) # Switch the model to eval model model.eval() # An example input you would normally provide to your model's forward() method. dummy_input = torch.rand(1, 3, 224, 224) # Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing. traced_script_module = torch.jit.trace(model, dummy_input) # Save the TorchScript model traced_script_module.save("resnet18.pt")
Parameter
Description
model_name
The name of the model.
model_platform
The platform or framework used by the model, such as
TorchScript
andONNX
.model_path
The path in which the model is stored.
inputs
The input parameters.
outputs
The output parameters.
After the model is converted, upload the model configuration file
resnet18.pt
to OSS. The OSS path of the configuration file isoss://bucketname/models/resnet18/resnet18.pt
. For more information, see Upload objects.
Step 2: Perform a benchmark
Before you deploy a model in a production environment, you can perform a benchmark to evaluate the performance of the model. In this step, a benchmark job is submitted in Arena and a PVC named oss-pvc
in the default
namespace of the cluster is used as an example. For more information, see Mount a statically provisioned OSS volume.
Prepare and upload the configuration file of the model.
Create a configuration file for the model. In this example, the configuration file is named
config.json
.{ "model_name": "resnet18", "model_platform": "torchscript", "model_path": "/data/models/resnet18/resnet18.pt", "inputs": [ { "name": "input", "data_type": "float32", "shape": [1, 3, 224, 224] } ], "outputs": [ { "name": "output", "data_type": "float32", "shape": [ 1000 ] } ] }
Upload the configuration file to OSS. The OSS path of the configuration file is
oss://bucketname/models/resnet18/config.json
.
Run the following command to submit a benchmark job to the ACK Pro cluster:
arena model analyze benchmark \ --name=resnet18-benchmark \ --namespace=default \ --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \ --gpus=1 \ --data=oss-pvc:/data \ --model-config-file=/data/models/resnet18/config.json \ --report-path=/data/models/resnet18 \ --concurrency=5 \ --duration=60
Parameter
Description
--gpus
The number of GPUs that are used.
--data
The PVC for the cluster and the path to which the PVC is mounted.
--model-config-file
The path of the configuration file.
--report-path
The path in which the benchmark report is stored.
--concurrency
The number of concurrent requests.
--duration
The duration of the benchmark job. Unit: seconds.
ImportantYou cannot specify the
--requests
and the--duration
parameters at the same time. Specify only one of them when you submit a benchmark job. If you specify both of them, the system uses the--duration
parameter by default.To specify the total number of requests sent by the benchmark job, specify the
--requests
parameter.
Run the following command to query the status of the job:
arena model analyze list -A
Expected output:
NAMESPACE NAME STATUS TYPE DURATION AGE GPU(Requested) default resnet18-benchmark COMPLETE Benchmark 0s 2d 1
View the benchmark report. If the
STATUS
parameter displaysCOMPLETE
, the benchmark job is completed. Then, you can find a benchmark report namedbenchmark_result.txt
in the path specified by the--report-path
parameter.Expected output:
{ "p90_latency":7.511, "p95_latency":7.86, "p99_latency":9.34, "min_latency":7.019, "max_latency":12.269, "mean_latency":7.312, "median_latency":7.206, "throughput":136, "gpu_mem_used":1.47, "gpu_utilization":21.280 }
The following table describes the metrics that are included in a benchmark report.
Metric
Description
Unit
p90_latency
90th percentile response time
Milliseconds
p95_latency
95th percentile response time
Milliseconds
p99_latency
99th percentile response time
Milliseconds
min_latency
Fastest response time
Milliseconds
max_latency
Slowest response time
Milliseconds
mean_latency
Average response time
Milliseconds
median_latency
Medium response time
Milliseconds
throughput
Throughput
Times
gpu_mem_used
GPU memory usage
GB
gpu_utilization
GPU utilization
Percentage
Step 3: Analyze the model
After you perform a benchmark, you can run the arena model analyze profile
command to analyze the model and identify performance bottlenecks.
Run the following command to submit a model analysis job to the ACK Pro cluster.
arena model analyze profile \ --name=resnet18-profile \ --namespace=default \ --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \ --gpus=1 \ --data=oss-pvc:/data \ --model-config-file=/data/models/resnet18/config.json \ --report-path=/data/models/resnet18/log/ \ --tensorboard \ --tensorboard-image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2
Parameter
Description
--gpus
The number of GPUs that are used.
--data
The PVC for the cluster and the path to which the PVC is mounted.
--model-config-file
The path of the configuration file.
--report-path
The path in which the analysis report is stored.
--tensorboard
Specifies whether to view the analysis report in TensorBoard.
--tensorboard-image
The URL of the image that is used to deploy Tensorboard.
Run the following command to query the status of the job:
arena model analyze list -A
Expected output:
NAMESPACE NAME STATUS TYPE DURATION AGE GPU(Requested) default resnet18-profile COMPLETE Profile 13s 2d 1
Run the following command to query the status of TensorBoard:
kubectl get service -n default
Expected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE resnet18-profile-tensorboard NodePort 172.16.158.170 <none> 6006:30582/TCP 2d20h
Run the following command to enable port forwarding and access TensorBoard:
kubectl port-forward svc/resnet18-profile-tensorboard -n default 6006:6006
Expected output:
Forwarding from 127.0.X.X:6006 -> 6006 Forwarding from [::1]:6006 -> 6006
Enter
http://localhost:6006
into the address bar of your browser to view the analysis results. In the left-side navigation pane, click Views to view the analysis results based on multiple dimensions and identify performance bottlenecks. You can optimize the model based on the analysis results.
Step 4: Optimize the model
You can use Arena to optimize a model.
Run the following command to submit a model optimization job to the ACK Pro cluster:
arena model analyze optimize \ --name=resnet18-optimize \ --namespace=default \ --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \ --gpus=1 \ --data=oss-pvc:/data \ --optimizer=tensorrt \ --model-config-file=/data/models/resnet18/config.json \ --export-path=/data/models/resnet18
Parameter
Description
--gpus
The number of GPUs that are used.
--data
The PVC for the cluster and the path to which the PVC is mounted.
--optimizer
The optimization method. Valid values:
tensorrt
(default)aiacc-torch
.
--model-config-file
The path of the configuration file.
--export-path
The path in which the optimized model is stored.
Run the following command to query the status of the job:
arena model analyze list -A
Expected output:
NAMESPACE NAME STATUS TYPE DURATION AGE GPU(Requested) default resnet18-optimize COMPLETE Optimize 16s 2d 1
View the configuration file of the optimized model. If the
STATUS
parameter displaysCOMPLETE
, the optimization job is completed. Then, you can find the configuration file namedopt_resnet18.pt
in the path specified by the--export-path
parameter.Change the value of the
--model_path
parameter in the benchmark job to the path of the configuration file that you obtained in the preceding step, and perform a benchmark again. For more information about how to perform a benchmark, see Step 2: Perform a benchmark.The following table describes the metric values before and after the model is optimized.
Metric
Before optimization
After optimization
p90_latency
7.511 milliseconds
5.162 milliseconds
p95_latency
7.86 milliseconds
5.428 milliseconds
p99_latency
9.34 milliseconds
6.64 milliseconds
min_latency
7.019 milliseconds
4.827 milliseconds
max_latency
12.269 milliseconds
8.426 milliseconds
mean_latency
7.312 milliseconds
5.046 milliseconds
median_latency
7.206 milliseconds
4.972 milliseconds
throughput
136 times
198 times
gpu_mem_used
1.47 GB
1.6 GB
gpu_utilization
21.280%
10.912%
The statistics show that the performance and GPU utilization of the model are greatly improved after optimization. If the model still does not meet the performance requirements, you can repeat the preceding steps to analyze and optimize the model.
Step 5: Deploy the model
If the model meets the performance requirements, you can deploy the model as an online service. Arena allows you to use NVIDIA Triton Inference Server to deploy TorchScript models. For more information, see Nvidia Triton Server.
Create a configuration file named
config.pbtxt
.ImportantDo not change the file name.
name: "resnet18" platform: "pytorch_libtorch" max_batch_size: 1 default_model_filename: "opt_resnet18.pt" input [ { name: "input__0" format: FORMAT_NCHW data_type: TYPE_FP32 dims: [ 3, 224, 224 ] } ] output [ { name: "output__0", data_type: TYPE_FP32, dims: [ 1000 ] } ]
NoteFor more information about the parameters in the configuration file, see Model Repository.
Create the following directory structure in OSS:
oss://bucketname/triton/model-repository/ resnet18/ config.pbtxt 1/ opt_resnet18.pt
Note1/
is a convention of NVIDIA Triton Inference Server. The value indicates the version number of the model. A model repository can store different versions of a model. For more information, see Model Repository.Use Arena to deploy the model. You can deploy a model in GPU sharing mode or GPU exclusive mode.
GPU exclusive mode: You can use this mode to deploy inference services that require high stability. In this mode, each GPU accelerates only one model. Models do not compete for GPU resources. You can run the following command to deploy a model in GPU exclusive mode:
arena serve triton \ --name=resnet18-serving \ --gpus=1 \ --replicas=1 \ --image=nvcr.io/nvidia/tritonserver:21.05-py3 \ --data=oss-pvc:/data \ --model-repository=/data/triton/model-repository \ --allow-metrics=true
GPU sharing mode: You can use this mode to deploy long-tail inference services or inference services that require cost-efficiency. In this mode, a GPU is shared by multiple models. Each model is allowed to use only a specified amount of GPU memory. You can run the following command to deploy a model in GPU sharing mode:
If you deploy models in GPU sharing mode, you must set the
--gpumemory
parameter. This parameter specifies the amount of memory that is allocated to each pod. You can specify a proper value based on thegpu_mem_used
metric in the benchmark result. For example, if the value of thegpu_mem_used
metric is 1.6 GB, you can set the--gpumemory
parameter to 2 GB. The value of this parameter must be a positive integer.arena serve triton \ --name=resnet18 \ --gpumemory=2 \ --replicas=1 \ --image=nvcr.io/nvidia/tritonserver:21.12-py3 \ --data=oss-pvc:/data \ --model-repository=/data/triton/model-repository \ --allow-metrics=true
Run the following command to query the status of the deployment:
arena serve list -A
Expected output:
NAMESPACE NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS GPU default resnet18-serving Triton 202202141817 1 1 172.16.147.248 RESTFUL:8000,GRPC:8001 1
If the value of the
AVAILABLE
parameter equals the value of theDESIRED
parameter, the model is deployed.