All Products
Search
Document Center

Container Service for Kubernetes:Quickly deploy a model in ACK

Last Updated:Jul 02, 2024

When you deploy a model, you can choose the source of the model and the platform where the model is deployed on demand. This topic uses the Qwen1.5-4B-Chat model and T4 GPU as an example to demonstrate how to quickly deploy a ModelScope model, a HuggingFace model, and a local model in Container Service for Kubernetes (ACK).

Warning

This topic is only for experiencing the model feature. We recommend that you do not follow the steps to deploy models in a production environment.

Model overview

ModelScope

ModelScope is a green and open source AI development and model service platform with large numbers of industry-leading pretrained models. This platform can efficiently help developers reduce model development costs. ModelScope provides high-quality open source models. You can test these modes on ModelScope or download them free of charge. For more information, see Introduction to ModelScope.

HuggingFace

HuggingFace is a platform that provides more than 350,000 models, 75,000 datasets, and 150,000 application demos. All models, datasets, and applications are open source. You can create machine learning projects with other developers on HuggingFace. For more information, see HuggingFace documentation.

Prerequisites

  • An ACK Pro cluster that contains GPU-accelerated nodes is created. The Kubernetes version of the cluster is 1.22 or later. Each GPU-accelerated node provides 16 GB of GPU memory or above. For more information, see Create an ACK managed cluster.

    We recommend that you install a GPU driver whose version is 525. You can add the ack.aliyun.com/nvidia-driver-version:525.105.17 label to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label.

  • The latest version of the Arena client is installed. For more information, see Configure the Arena client.

Deploy a ModelScope model

Step 1: Deploy an inference service

  1. Run the following command to use Arena to deploy a custom service. The name of the service is modelscope and its version is v1.

    After the application launches, it downloads the model named qwen/Qwen1.5-4B-Chat from ModelScope. To download other models, modify the MODEL_ID boot parameter. You can set the DASHSCOPE_API_KEY environment variable to configure the token used by the ModelScope SDK.

    Important

    The model is downloaded to pods. Make sure that the GPU-accelerated node that hosts the pods have at least 30 GB free disk space.

    arena serve custom \
        --name=modelscope \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/modelscope:v1 \
        "MODEL_ID=qwen/Qwen1.5-4B-Chat python3 server.py"

    The following table describes the parameters.

    Parameter

    Description

    --name

    The name of the inference service.

    --version

    The version of the inference service.

    --gpus

    The number of GPUs for each inference service replica.

    --replicas

    The number of inference service replicas.

    --restful-port

    The port of the inference service to be exposed.

    --readiness-probe-action

    The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.

    --readiness-probe-action-option

    The connection method of readiness probes.

    --readiness-probe-option

    The readiness probe configuration.

    --image

    The address of the inference service image.

    Expected output:

    service/modelscope-v1 created
    deployment.apps/modelscope-v1-custom-serving created
    INFO[0002] The Job modelscope has been submitted successfully
    INFO[0002] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status

    The output indicates that Kubernetes resources related to the modelscope-v1 model are created.

  2. Run the following command to query the details of the inference service.

    Model downloading is time-consuming. To view the details of the inference service, wait about 10 minutes after the service is deployed.

    arena serve get modelscope

    Expected output:

    Name:       modelscope
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  1
    Age:        10m
    Address:    172.16.XX.XX
    Port:       RESTFUL:8000
    GPU:        1
    
    Instances:
      NAME                                           STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                           ------   ---  -----  --------  ---  ----
      modelscope-v1-custom-serving-5bb85d6555-2p6z9  Running  10m  1/1    0         1    cn-beijing.192.168.XX.XX

    The output indicates that the modelscope inference service is deployed and ready to accept requests. The model and service environment is deployed on a GPU-accelerated node.

Step 2: Verify the inference service

  1. Run the following command to set up port forwarding between the inference service and local environment.

    Important

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.

    kubectl port-forward svc/modelscope-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8088 -> 8000
    Forwarding from [::1]:8088 -> 8000
  2. Run the following command to send a request to the modelscope inference service:

    curl -XPOST http://localhost:8000/generate -H "Content-Type: application/json"  -d '{"text_input": "What is AI? AI is", "parameters": {"stream": false, "temperature": 0.9, "seed": 10}}'

    Expected output:

    {"model_name":"/root/.cache/modelscope/hub/qwen/Qwen1___5-4B-Chat","text_output":"What is AI? AI is technology that enables computers and machines to simulate human intelligence and problem-solving capabilities."}

    The output indicates that the model can provide the definition of AI.

(Optional) Step 3: Clear the inference service

If you no longer need the resources, run the following command to delete the inference service:

arena serve del modelscope

Deploy a HuggingFace model

Step 1: Deploy an inference service

  1. Make sure that the pods can access the HuggingFace repository.

  2. Run the following command to use Arena to deploy a custom service. The name of the service is huggingface and its version is v1.

    In this example, the MODEL_SOURCE environment variable is configured to specify the model repository named HuggingFace. After the application launches, it downloads the model named qwen/Qwen1.5-4B-Chat from HuggingFace. To download other HuggingFace models, modify the MODEL_ID boot parameter. You can set the HUGGINGFACE_TOKEN environment variable to configure the token used to access HuggingFace.

    Important

    The model is downloaded to pods. Make sure that the GPU-accelerated node that hosts the pods have at least 30 GB free disk space.

    arena serve custom \
        --name=huggingface \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/modelscope:v1 \
        "MODEL_ID=Qwen/Qwen1.5-4B-Chat MODEL_SOURCE=Huggingface python3 server.py"

    The following table describes the parameters.

    Parameter

    Description

    --name

    The name of the inference service.

    --version

    The version of the inference service.

    --gpus

    The number of GPUs for each inference service replica.

    --replicas

    The number of inference service replicas.

    --restful-port

    The port of the inference service to be exposed.

    --readiness-probe-action

    The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.

    --readiness-probe-action-option

    The connection method of readiness probes.

    --readiness-probe-option

    The readiness probe configuration.

    --image

    The address of the inference service image.

    Expected output:

    service/huggingface-v1 created
    deployment.apps/huggingface-v1-custom-serving created
    INFO[0003] The Job huggingface has been submitted successfully 
    INFO[0003] You can run `arena serve get huggingface --type custom-serving -n default` to check the job status 

    The output indicates that the inference service is deployed.

  3. Run the following command to query the details of the inference service.

    Model downloading is time-consuming. To view the details of the inference service, wait about 10 minutes after the service is deployed.

    arena serve get huggingface

    Expected output:

    Name:       huggingface
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  0
    Age:        1h
    Address:    172.16.XX.XX
    Port:       RESTFUL:8000
    GPU:        1
    
    Instances:
      NAME                                           STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                           ------   ---  -----  --------  ---  ----
      huggingface-v1-custom-serving-dcf6cf6c8-2lqzr  Running  1h   1/1    0         1    cn-beijing.192.168.XX.XX

    The output indicates that a pod (huggingface-v1-custom-serving-dcf6cf6c8-2lqzr) is deployed for the inference service and ready to provide services.

Step 2: Verify the inference service

  1. Run the following command to set up port forwarding between the inference service and local environment.

    Important

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.

    kubectl port-forward svc/huggingface-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8088 -> 8000
    Forwarding from [::1]:8088 -> 8000
  2. Run the following command to send a request to the HuggingFace inference service:

    curl -XPOST http://localhost:8000/generate -H "Content-Type: application/json"  -d '{"text_input": "What is AI? AI is", "parameters": {"stream": false, "temperature": 0.9, "seed": 10}}'

    Expected output:

    {"model_name":"Qwen/Qwen1.5-4B-Chat","text_output":"What is AI? AI is a branch of computer science that seeks to create machines to simulate human intelligence."}

    The output indicates that the model can provide the definition of AI.

(Optional) Step 3: Clear the inference service

If you no longer need the resources, run the following command to delete the inference service:

arena serve del huggingface

Deploy a local model

Step 1: Download a model file

This section uses the Qwen1.5-4B-Chat model as an example to demonstrate how to download models from and upload models to Object Storage Service (OSS) and how to create persistent volumes (PVs) and persistent volume claims (PVCs) in ACK clusters.

  1. Download the model file.

    1. Run the following command to install Git:

      # Run yum install git or apt install git. 
      yum install git
    2. Run the following command to install the Git Large File Support (LFS) plug-in:

      # Run yum install git-lfs or apt install git-lfs. 
      yum install git-lfs
    3. Run the following command to clone the Qwen1.5-4B-Chat repository on ModelScope to the local environment:

      GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
    4. Run the following command to enter the Qwen1.5-4B-Chat directory and pull large files managed by LFS:

      cd Qwen1.5-4B-Chat
      git lfs pull
  2. Upload the Qwen1.5-4B-Chat model file to OSS.

    1. Log on to the OSS console, and view and record the name of the OSS bucket that you created.

      For more information about how to create an OSS bucket, see Create a bucket.

    2. Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.

    3. Run the following command to create a directory named Qwen1.5-4B-Chat in OSS:

      ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
    4. Run the following command to upload the model file to OSS:

      ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
  3. Configure PVs and PVCs in the destination cluster. For more information, see Mount a statically provisioned OSS volume.

    • The following table describes the parameters of the PV.

      Parameter

      Description

      PV Type

      OSS

      Volume Name

      llm-model

      Access Certificate

      Specify the AccessKey ID and AccessKey secret used to access the OSS bucket.

      Bucket ID

      Specify the name of the OSS bucket that you created.

      OSS Path

      Select the path of the model, such as /models/Qwen1.5-4B-Chat.

    • The following table describes the parameters of the PVC.

      Parameter

      Description

      PVC Type

      OSS

      Volume Name

      llm-model

      Allocation Mode

      Select Existing Volumes.

      Existing Volumes

      Click the Existing Volumes hyperlink and select the PV that you created.

Step 2: Deploy an inference service

  1. Run the following command to use Arena to deploy a custom service. The name of the service is local-model.

    The --data parameter mounts an existing persistent volume claim (PVC) named llm-model to the /model/Qwen1.5-4B-Chat directory. After the application launches, a model is loaded from the /model/Qwen1.5-4B-Chat directory. To load another local model, modify the MODEL_ID parameter.

    arena serve custom \
        --name=local-model \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --data=llm-model:/model/Qwen1.5-4B-Chat \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/modelscope:v1 \
        "MODEL_ID=/model/Qwen1.5-4B-Chat python3 server.py"

    The following table describes the parameters.

    Parameter

    Description

    --name

    The name of the inference service.

    --version

    The version of the inference service.

    --gpus

    The number of GPUs for each inference service replica.

    --replicas

    The number of inference service replicas.

    --restful-port

    The port of the inference service to be exposed.

    --readiness-probe-action

    The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.

    --readiness-probe-action-option

    The connection method of readiness probes.

    --readiness-probe-option

    The readiness probe configuration.

    --data

    Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the arena data list command to query the list of existing PVCs in the cluster. Specify the runtime environment on the right side of the colon. You can also specify the local path of the training data or model. This way, your script can access the data or model in the specified PV.

    --image

    The address of the inference service image.

    Expected output:

    service/local-model-v1 created
    deployment.apps/local-model-v1-custom-serving created
    INFO[0001] The Job local-model has been submitted successfully
    INFO[0001] You can run `arena serve get local-model --type custom-serving -n default` to check the job status

    The output indicates that the inference service is deployed.

  2. Run the following command to query the details of the inference service.

    Model downloading is time-consuming. To view the details of the inference service, wait about 10 minutes after the service is deployed.

    arena serve get local-model

    Expected output:

    Name:       local-model
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  1
    Age:        1m
    Address:    172.16.XX.XX
    Port:       RESTFUL:8000
    GPU:        1
    
    Instances:
      NAME                                            STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                            ------   ---  -----  --------  ---  ----
      local-model-v1-custom-serving-8458fb6cf6-6mvzp  Running  1m   1/1    0         1    cn-beijing.192.168.XX.XX

    The output indicates that a pod (local-model-v1-custom-serving-8458fb6cf6-6mvzp) is deployed for the inference service and ready to provide services.

Step 3: Verify the inference service

  1. Run the following command to set up port forwarding between the inference service and local environment:

    Important

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.

    kubectl port-forward svc/local-model-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8088 -> 8000
    Forwarding from [::1]:8088 -> 8000
  2. Run the following command to send a request to the local-model inference service:

    curl -XPOST http://localhost:8000/generate -H "Content-Type: application/json"  -d '{"text_input": "What is AI? AI is", "parameters": {"stream": false, "temperature": 0.9, "seed": 10}}'

    Expected output:

    {"model_name":"/model/Qwen1.5-4B-Chat","text_output":"What is AI? AI is a branch of computer science that studies how to make computers intelligent."}

    The output indicates that the model can provide the definition of AI.

(Optional) Step 4: Clear the environment

If you no longer need the resources, clear the environment promptly.

  • Run the following command to delete the inference service:

    arena serve del local-model
  • Run the following command to delete the PV and PVC:

    kubectl delete pvc llm-model
    kubectl delete pv llm-model