All Products
Search
Document Center

Container Service for Kubernetes:Analyze and optimize models

Last Updated:Aug 22, 2024

To ensure that a model meets the deployment standards before you deploy it in a production environment, you can use the model analysis and optimization commands supported by the cloud-native AI suite to benchmark, analyze, and optimize the model. In this topic, a ResNet18 model provided by PyTorch is used as an example and V100 GPUs are used to accelerate the model.

Prerequisites

  • A Container Service for Kubernetes (ACK) Pro cluster is created and the Kubernetes version of the cluster is 1.20 or later. The cluster contains at least one GPU-accelerated node. For more information about how to update an ACK cluster, see Update an ACK cluster.

  • An Object Storage Service (OSS) bucket is created. A persistent volume (PV) and a persistent volume claim (PVC) are created. For more information, see Mount a statically provisioned OSS volume.

  • The latest version of the Arena client is installed. For more information, see Configure the Arena client.

Background information

Data scientists focus on the precision of models, whereas R&D engineers are more concerned about the performance of models. As a result, a model may not meet the performance requirements after you release the model as an online service. To prevent this issue, you may need to benchmark a model before you release the model. If the model does not meet the performance requirements, you can identify the performance bottlenecks and optimize the model.

Introduction to the model analysis and optimization commands

The cloud-native AI suite supports multiple model analysis and optimization commands. You can run the commands to benchmark models, analyze the network structure, check the duration of each operator, and view the GPU utilization. Then, you can identify the performance bottlenecks of a model and use TensorRT to optimize the model. This helps you release models that meet the performance requirements of a production environment. The following figure shows the model lifecycle assisted by the model analysis and optimization commands.生命周期

  1. Model Training: The model is trained based on a given dataset.

  2. Model Benchmark: A benchmark is performed on the model to check whether the latency, throughput, and GPU utilization of the model meet the requirements.

  3. Model Profile: The model is analyzed to identify performance bottlenecks.

  4. Model Optimize: The GPU inference capability of the model is optimized by using tools such as TensorRT.

  5. Model Serving: The model is deployed as an online service.

Note

If the model still does not meet the performance requirements after you optimize the model, you can repeat the preceding phases.

How to run the commands

You can use Arena to submit model analysis, optimization, benchmark,and evaluation jobs to ACK Pro clusters. You can run the arena model analyze --help command to view the help information.

$ arena model analyze --help
submit a model analyze job.

Available Commands:
  profile          Submit a model profile job.
  evaluate         Submit a model evaluate job.
  optimize         Submit a model optimize job.
  benchmark        Submit a model benchmark job

Usage:
  arena model analyze [flags]
  arena model analyze [command]

Available Commands:
  benchmark   Submit a model benchmark job
  delete      Delete a model job
  evaluate    Submit a model evaluate job
  get         Get a model job
  list        List all the model jobs
  optimize    Submit a model optimize job, this is a experimental feature
  profile     Submit a model profile job

Step 1: Prepare a model

We recommend that you use TorchScript to deploy PyTorch models. In this topic, a ResNet18 model provided by PyTorch is used as an example.

  1. Convert the model. Convert the ResNet18 model to a TorchScript model and save the model.

    import torch
    import torchvision
    
    model = torchvision.models.resnet18(pretrained=True)
    
    # Switch the model to eval model
    model.eval()
    
    # An example input you would normally provide to your model's forward() method.
    dummy_input = torch.rand(1, 3, 224, 224)
    
    # Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing.
    traced_script_module = torch.jit.trace(model, dummy_input)
    
    # Save the TorchScript model
    traced_script_module.save("resnet18.pt")
    

    Parameter

    Description

    model_name

    The name of the model.

    model_platform

    The platform or framework used by the model, such as TorchScript and ONNX.

    model_path

    The path in which the model is stored.

    inputs

    The input parameters.

    outputs

    The output parameters.

  2. After the model is converted, upload the model configuration file resnet18.pt to OSS. The OSS path of the configuration file is oss://bucketname/models/resnet18/resnet18.pt. For more information, see Upload objects.

Step 2: Perform a benchmark

Before you deploy a model in a production environment, you can perform a benchmark to evaluate the performance of the model. In this step, a benchmark job is submitted in Arena and a PVC named oss-pvc in the default namespace of the cluster is used as an example. For more information, see Mount a statically provisioned OSS volume.

  1. Prepare and upload the configuration file of the model.

    1. Create a configuration file for the model. In this example, the configuration file is named config.json.

      {
        "model_name": "resnet18",
        "model_platform": "torchscript",
        "model_path": "/data/models/resnet18/resnet18.pt",
        "inputs": [
          {
            "name": "input",
            "data_type": "float32",
            "shape": [1, 3, 224, 224]
          }
        ],
        "outputs": [
          {
              "name": "output",
              "data_type": "float32",
              "shape": [ 1000 ]
          }
        ]
      }
    2. Upload the configuration file to OSS. The OSS path of the configuration file is oss://bucketname/models/resnet18/config.json.

  2. Run the following command to submit a benchmark job to the ACK Pro cluster:

    arena model analyze benchmark \
      --name=resnet18-benchmark \
      --namespace=default \
      --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \
      --gpus=1 \
      --data=oss-pvc:/data \
      --model-config-file=/data/models/resnet18/config.json \
      --report-path=/data/models/resnet18 \
      --concurrency=5 \
      --duration=60

    Parameter

    Description

    --gpus

    The number of GPUs that are used.

    --data

    The PVC for the cluster and the path to which the PVC is mounted.

    --model-config-file

    The path of the configuration file.

    --report-path

    The path in which the benchmark report is stored.

    --concurrency

    The number of concurrent requests.

    --duration

    The duration of the benchmark job. Unit: seconds.

    Important
    • You cannot specify the --requests and the --duration parameters at the same time. Specify only one of them when you submit a benchmark job. If you specify both of them, the system uses the --duration parameter by default.

    • To specify the total number of requests sent by the benchmark job, specify the --requests parameter.

  3. Run the following command to query the status of the job:

    arena model analyze list -A

    Expected output:

    NAMESPACE      NAME                        STATUS    TYPE       DURATION  AGE  GPU(Requested)
    default        resnet18-benchmark          COMPLETE  Benchmark  0s        2d   1
  4. View the benchmark report. If the STATUS parameter displays COMPLETE, the benchmark job is completed. Then, you can find a benchmark report named benchmark_result.txt in the path specified by the --report-path parameter.

    Expected output:

    {
        "p90_latency":7.511,
        "p95_latency":7.86,
        "p99_latency":9.34,
        "min_latency":7.019,
        "max_latency":12.269,
        "mean_latency":7.312,
        "median_latency":7.206,
        "throughput":136,
        "gpu_mem_used":1.47,
        "gpu_utilization":21.280
    }

    The following table describes the metrics that are included in a benchmark report.

    Metric

    Description

    Unit

    p90_latency

    90th percentile response time

    Milliseconds

    p95_latency

    95th percentile response time

    Milliseconds

    p99_latency

    99th percentile response time

    Milliseconds

    min_latency

    Fastest response time

    Milliseconds

    max_latency

    Slowest response time

    Milliseconds

    mean_latency

    Average response time

    Milliseconds

    median_latency

    Medium response time

    Milliseconds

    throughput

    Throughput

    Times

    gpu_mem_used

    GPU memory usage

    GB

    gpu_utilization

    GPU utilization

    Percentage

Step 3: Analyze the model

After you perform a benchmark, you can run the arena model analyze profile command to analyze the model and identify performance bottlenecks.

  1. Run the following command to submit a model analysis job to the ACK Pro cluster.

    arena model analyze profile \
      --name=resnet18-profile \
      --namespace=default \
      --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \
      --gpus=1 \
      --data=oss-pvc:/data \
      --model-config-file=/data/models/resnet18/config.json \
      --report-path=/data/models/resnet18/log/ \
      --tensorboard \
      --tensorboard-image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2

    Parameter

    Description

    --gpus

    The number of GPUs that are used.

    --data

    The PVC for the cluster and the path to which the PVC is mounted.

    --model-config-file

    The path of the configuration file.

    --report-path

    The path in which the analysis report is stored.

    --tensorboard

    Specifies whether to view the analysis report in TensorBoard.

    --tensorboard-image

    The URL of the image that is used to deploy Tensorboard.

  2. Run the following command to query the status of the job:

    arena model analyze list -A

    Expected output:

    NAMESPACE      NAME                        STATUS    TYPE       DURATION  AGE  GPU(Requested)
    default        resnet18-profile            COMPLETE  Profile    13s       2d   1
  3. Run the following command to query the status of TensorBoard:

    kubectl get service -n default

    Expected output:

    NAME                           TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
    resnet18-profile-tensorboard   NodePort   172.16.158.170   <none>        6006:30582/TCP   2d20h
  4. Run the following command to enable port forwarding and access TensorBoard:

    kubectl port-forward svc/resnet18-profile-tensorboard -n default 6006:6006

    Expected output:

    Forwarding from 127.0.X.X:6006 -> 6006
    Forwarding from [::1]:6006 -> 6006
  5. Enter http://localhost:6006 into the address bar of your browser to view the analysis results. In the left-side navigation pane, click Views to view the analysis results based on multiple dimensions and identify performance bottlenecks. You can optimize the model based on the analysis results.查看分析结果

Step 4: Optimize the model

You can use Arena to optimize a model.

  1. Run the following command to submit a model optimization job to the ACK Pro cluster:

    arena model analyze optimize \
      --name=resnet18-optimize \
      --namespace=default \
      --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \
      --gpus=1 \
      --data=oss-pvc:/data \
      --optimizer=tensorrt \
      --model-config-file=/data/models/resnet18/config.json \
      --export-path=/data/models/resnet18

    Parameter

    Description

    --gpus

    The number of GPUs that are used.

    --data

    The PVC for the cluster and the path to which the PVC is mounted.

    --optimizer

    The optimization method. Valid values:

    • tensorrt (default)

    • aiacc-torch.

    --model-config-file

    The path of the configuration file.

    --export-path

    The path in which the optimized model is stored.

  2. Run the following command to query the status of the job:

    arena model analyze list -A

    Expected output:

    NAMESPACE      NAME                        STATUS    TYPE       DURATION  AGE  GPU(Requested)
    default        resnet18-optimize           COMPLETE  Optimize   16s       2d   1
  3. View the configuration file of the optimized model. If the STATUS parameter displays COMPLETE, the optimization job is completed. Then, you can find the configuration file named opt_resnet18.pt in the path specified by the --export-path parameter.

  4. Change the value of the --model_path parameter in the benchmark job to the path of the configuration file that you obtained in the preceding step, and perform a benchmark again. For more information about how to perform a benchmark, see Step 2: Perform a benchmark.

    The following table describes the metric values before and after the model is optimized.

    Metric

    Before optimization

    After optimization

    p90_latency

    7.511 milliseconds

    5.162 milliseconds

    p95_latency

    7.86 milliseconds

    5.428 milliseconds

    p99_latency

    9.34 milliseconds

    6.64 milliseconds

    min_latency

    7.019 milliseconds

    4.827 milliseconds

    max_latency

    12.269 milliseconds

    8.426 milliseconds

    mean_latency

    7.312 milliseconds

    5.046 milliseconds

    median_latency

    7.206 milliseconds

    4.972 milliseconds

    throughput

    136 times

    198 times

    gpu_mem_used

    1.47 GB

    1.6 GB

    gpu_utilization

    21.280%

    10.912%

    The statistics show that the performance and GPU utilization of the model are greatly improved after optimization. If the model still does not meet the performance requirements, you can repeat the preceding steps to analyze and optimize the model.

Step 5: Deploy the model

If the model meets the performance requirements, you can deploy the model as an online service. Arena allows you to use NVIDIA Triton Inference Server to deploy TorchScript models. For more information, see Nvidia Triton Server.

  1. Create a configuration file named config.pbtxt.

    Important

    Do not change the file name.

    name: "resnet18"
    platform: "pytorch_libtorch"
    max_batch_size: 1
    default_model_filename: "opt_resnet18.pt"
    input [
        {
            name: "input__0"
            format: FORMAT_NCHW
            data_type: TYPE_FP32
            dims: [ 3, 224, 224 ]
        }
    ]
    output [
        {
            name: "output__0",
            data_type: TYPE_FP32,
            dims: [ 1000 ]
        }
    ]
    Note

    For more information about the parameters in the configuration file, see Model Repository.

  2. Create the following directory structure in OSS:

    oss://bucketname/triton/model-repository/
        resnet18/
          config.pbtxt
          1/
            opt_resnet18.pt
    Note

    1/ is a convention of NVIDIA Triton Inference Server. The value indicates the version number of the model. A model repository can store different versions of a model. For more information, see Model Repository.

  3. Use Arena to deploy the model. You can deploy a model in GPU sharing mode or GPU exclusive mode.

    • GPU exclusive mode: You can use this mode to deploy inference services that require high stability. In this mode, each GPU accelerates only one model. Models do not compete for GPU resources. You can run the following command to deploy a model in GPU exclusive mode:

      arena serve triton \
        --name=resnet18-serving \
        --gpus=1 \
        --replicas=1 \
        --image=nvcr.io/nvidia/tritonserver:21.05-py3 \
        --data=oss-pvc:/data \
        --model-repository=/data/triton/model-repository \
        --allow-metrics=true
    • GPU sharing mode: You can use this mode to deploy long-tail inference services or inference services that require cost-efficiency. In this mode, a GPU is shared by multiple models. Each model is allowed to use only a specified amount of GPU memory. You can run the following command to deploy a model in GPU sharing mode:

      If you deploy models in GPU sharing mode, you must set the --gpumemory parameter. This parameter specifies the amount of memory that is allocated to each pod. You can specify a proper value based on the gpu_mem_used metric in the benchmark result. For example, if the value of the gpu_mem_used metric is 1.6 GB, you can set the --gpumemory parameter to 2 GB. The value of this parameter must be a positive integer.

      arena serve triton \
        --name=resnet18 \
        --gpumemory=2 \
        --replicas=1 \
        --image=nvcr.io/nvidia/tritonserver:21.12-py3 \
        --data=oss-pvc:/data \
        --model-repository=/data/triton/model-repository \
        --allow-metrics=true
  4. Run the following command to query the status of the deployment:

    arena serve list -A

    Expected output:

    NAMESPACE      NAME              TYPE    VERSION       DESIRED  AVAILABLE  ADDRESS         PORTS                   GPU
    default        resnet18-serving  Triton  202202141817  1        1          172.16.147.248  RESTFUL:8000,GRPC:8001  1

    If the value of the AVAILABLE parameter equals the value of the DESIRED parameter, the model is deployed.