Integrate the KServe on ASM feature with Fluid to implement AI Serving that accelerates data access - Alibaba Cloud Service Mesh

KServe, formerly known as KFServing, is an AI model service and inference engine for cloud-native environments. It supports automatic scaling, scale-to-zero, and canary deployment. Service Mesh (ASM) integrates the capabilities of the Knative Serving component that is deployed in either a Container Service for Kubernetes (ACK) cluster or a serverless Kubernetes (ASK) cluster. In addition, ASM provides the KServe on ASM feature for you to integrate KServe with ASM for AI Serving with a few clicks. Fluid is an open source Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data applications and AI applications. You can directly integrate Fluid with the KServe on ASM feature to accelerate model loading. This topic describes how to integrate the KServe on ASM feature with Fluid to implement AI Serving that accelerates data access.

Prerequisites

An ASM instance of version 1.17 or later is created, and a Kubernetes cluster is added to the ASM instance. For more information, see Create an ASM instance and Add a cluster to an ASM instance.
Note
- For more information about how to update an ASM instance, see Update an ASM instance.
- For the Kubernetes cluster:
  - If an ACK cluster is used, the version of the cluster must be 1.22 or later. For more information, see Create an ACK managed cluster or Update an ACK cluster. If you use a graphics processing unit (GPU) for AI Serving, make sure that the ACK cluster contains GPU-accelerated nodes, such as ecs.gn6i-c16g1.4xlarge.
  - If an ASK cluster is used, the version of the cluster must be 1.18 or later, and the CoreDNS component must be installed in the cluster. For more information, see Create an ACK Serverless cluster and Manage system components.
The feature of using the Kubernetes API of clusters on the data plane to access Istio resources is enabled for the ASM instance. For more information, see Use the Kubernetes API of clusters on the data plane to access Istio resources.
An ingress gateway is created for the cluster. In this example, an ASM ingress gateway is created for the cluster. The ASM ingress gateway uses the default name ingressgateway, and ports 80 and 443 are exposed. For more information, see Create an ingress gateway.
The Knative Serving component is deployed in the ACK or ASK cluster and the Knative on ASM feature is enabled. For more information, see Use Knative on ASM to deploy a serverless application.
Show how to deploy the Knative Serving component
- For more information about how to install Knative components in an ACK cluster, see Deploy Knative.
- For more information about how to install Knative components in an ASK cluster, see Enable Knative.
Note
If you select Kourier as the gateway to install Knative components, we recommend that you uninstall Kourier after Knative components are installed. You can perform the following steps to install Knative components and uninstall Kourier: On the Clusters page in the ACK console, click your cluster. In the left-side navigation pane, choose Applications > Knative. On the page that appears, install Knative components and select Kourier as the gateway. After Knative components are installed, click the Components tab and uninstall Kourier in the Add-on Component section.
Show how to enable Knative on ASM
1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Ecosystem > Knative on ASM.
3. On the Knative on ASM page, click Enable Knative on ASM.
Object Storage Service (OSS) is activated, and a bucket is created. For more information, see Activate OSS and Create buckets.

Step 1: Enable the KServe on ASM feature

Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Ecosystem > KServe on ASM.
On the KServe on ASM page, turn on or off the Automatically install the CertManager component in the cluster switch and click Enable KServe on ASM.
cert-manager is a certificate lifecycle management system that can be used to issue and deploy certificates. The deployment and use of the KServe on ASM feature depends on the CertManager component. When you install KServe, the CertManager component is automatically installed.
- If you have not installed CertManager in the cluster, turn on Automatically install the CertManager component in the cluster.
- If you have installed CertManager in the cluster on the data plane, turn off Automatically install the CertManager component in the cluster.

Step 2: Install the ack-fluid component and enable AI model caching and acceleration

Deploy the ack-fluid component of version 0.9.10 or later in the cluster.
- If the cluster on the data plane is an ACK cluster, install the cloud-native AI suite and deploy the ack-fluid component in the cluster.
  Note
  If you have installed open source Fluid, you must uninstall it before you can install the ack-fluid component.
  - If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
  - If you have already installed the cloud-native AI suite, log on to the ACK console, click the desired cluster, and choose Applications > Cloud-native AI Suite to deploy the ack-fluid component.
- If the cluster on the data plane is an ASK cluster, deploy the ack-fluid component in the cluster. For more information, see the Deploy the control plane components of Fluid section of the Accelerate Jobs topic.
Prepare an AI model and upload it to the OSS bucket.
1. Prepare an AI SavedModel.
  This topic uses the BLOOM model, which is an open source transformer large language model (LLM) based on PyTorch. For more information about the model data, see Hugging Face.
2. Upload downloaded model data files to the OSS bucket and record the storage location of the model data files.
  The format of the storage location is oss://{bucket}/{path}. For example, if you create a bucket named fluid-demo and upload all the model data files to the models/bloom directory in the bucket, the storage location of the model data files is oss://fluid-demo/models/bloom.
  Note
  You can use the ossutil tool provided by OSS to upload data. For more information, see Install ossutil.
Create a namespace for deploying Fluid cache and AI Serving, and configure OSS access permissions.
1. Use kubectl to connect to the cluster on the data plane. For more information, see Connect to an ACK cluster by using kubectl.
2. Run the following command to create a namespace named kserve-fluid-demo for deploying Fluid cache and KServe-based AI Serving:
```
kubectl create ns kserve-fluid-demo
```
3. Create a file named oss-secret.yaml and copy the following content to the file.
  fs.oss.accessKeyId is the AccessKey ID of an account that can access OSS, and fs.oss.accessKeySecret is the AccessKey secret of the account.
```
apiVersion: v1
kind: Secret
metadata:
  name: access-key
stringData:
  fs.oss.accessKeyId: xxx # Replace the value with the AccessKey ID of an Alibaba Cloud account that can access OSS. 
  fs.oss.accessKeySecret: xxx # Replace the value with the AccessKey secret of the Alibaba Cloud account.
```
4. Run the following command to apply the OSS AccessKey pair:
```
kubectl apply -f oss-secret.yaml -n kserve-fluid-demo
```

Declare the AI model data that you want to access in Fluid.

You must submit a Dataset custom resource (CR) and a JindoRuntime CR. The Dataset CR describes the URL of the data in the external storage system. The JindoRuntime CR describes the cache system and its specific configuration.

Create a file named oss-jindo.yaml and copy the following content to the file.

In the Dataset CR, replace oss://{bucket}/{path} with the recorded storage location of model data files in step 2.b and replace {endpoint} with an endpoint that can be used to access OSS. For more information about endpoints that can be used to access OSS in different regions, see Regions and endpoints.

Show the content of oss-jindo.yaml

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: oss-data
spec:
  mounts:
  - mountPoint: "oss://{bucket}/{path}" # Replace the value with the storage location of model data files. 
    name: bloom-560m
    path: /bloom-560m
    options:
      fs.oss.endpoint: "{endpoint}" # Replace the value with an actual endpoint that can be used to access OSS. 
    encryptOptions:
      - name: fs.oss.accessKeyId
        valueFrom:
          secretKeyRef:
            name: access-key
            key: fs.oss.accessKeyId
      - name: fs.oss.accessKeySecret
        valueFrom:
          secretKeyRef:
            name: access-key
            key: fs.oss.accessKeySecret
  accessModes:
    - ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: oss-data
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: SSD
        volumeType: emptyDir
        path: /mnt/ssd0/cache
        quota: 50Gi
        high: "0.95"
        low: "0.7"
  fuse:
    properties:
      fs.jindofsx.data.cache.enable: "true"
    args:
      - -okernel_cache
      - -oro
      - -oattr_timeout=7200
      - -oentry_timeout=7200
      - -ometrics_port=9089
    cleanPolicy: OnDemand

Run the following command to deploy the Dataset CR and the JindoRuntime CR:
```
kubectl create -f oss-jindo.yaml -n kserve-fluid-demo
```

Run the following command to check the deployment of the Dataset CR and the JindoRuntime CR:

kubectl get jindoruntime,dataset -n kserve-fluid-demo

Expected output:

NAME                                  MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
jindoruntime.data.fluid.io/oss-data   Ready          Ready          Ready        3m

NAME                             UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
dataset.data.fluid.io/oss-data   3.14GiB          0.00B    100.00GiB        0.0%                Bound   3m

The output shows that PHASE of the Dataset CR is Bound and FUSE PHASE of the JindoRuntime CR is Ready. This indicates that the Dataset CR and the JindoRuntime CR are deployed.

Prefetch data in Fluid to improve data access performance.

Create a file named oss-dataload.yaml and copy the following content to the file:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: oss-dataload
spec:
  dataset:
    name: oss-data
    namespace: kserve-fluid-demo
  target:
    - path: /bloom-560m
      replicas: 2

Run the following command to deploy Dataload to prefetch data:
```
kubectl create -f oss-dataload.yaml -n kserve-fluid-demo
```
Run the following command to query the progress of data prefetching:
```
kubectl get dataload -n kserve-fluid-demo
```
Expected output:
```
NAME           DATASET    PHASE      AGE     DURATION
oss-dataload   oss-data   Complete   1m      45s
```
The output indicates that data prefetching takes about 45s. You need to wait before the data is prefetched.

Step 3: Deploy an inference service based on the AI model

Create a file named oss-fluid-isvc.yaml and copy the following content to the file based on your cluster.

ACK Cluster

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "fluid-bloom"
spec:
  predictor:
    timeout: 600
    minReplicas: 0
    containers:
      - name: kserve-container
        image: registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu
        resources:
          limits:
            cpu: "12"
            memory: 48Gi
            nvidia.com/gpu: 1 # If GPUs are used, set the value to the number of GPUs required. Otherwise, you do not need to set the value.
          requests:
            cpu: "12"
            memory: 48Gi
        env:
          - name: STORAGE_URI
            value: "pvc://oss-data/bloom-560m"
          - name: MODEL_NAME
            value: "bloom"
            # Set this parameter to True if GPUs are used. Otherwise, set this parameter to False. 
          - name: GPU_ENABLED
            value: "True"

ASK Cluster

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "fluid-bloom"
  labels:
    alibabacloud.com/fluid-sidecar-target: "eci"
  annotations:
    k8s.aliyun.com/eci-use-specs : "ecs.gn6i-c16g1.4xlarge" # Replace the value with the specifications of the used Elastic Compute Service (ECS) instance. 
    knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.gn6i-c16g1.4xlarge" # Replace the value with the specifications of the used ECS instance. 
spec:
  predictor:
    timeout: 600
    minReplicas: 0
    containers:
      - name: kserve-container
        image: registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu
        resources:
          limits:
            cpu: "12"
            memory: 48Gi
          requests:
            cpu: "12"
            memory: 48Gi
        env:
          - name: STORAGE_URI
            value: "pvc://oss-data/bloom-560m"
          - name: MODEL_NAME
            value: "bloom"
            # Set this parameter to True if GPUs are used. Otherwise, set this parameter to False. 
          - name: GPU_ENABLED
            value: "True"

Note

In this topic, an LLM is used. Therefore, a resource of 12 cores and 48 Gi is applied. Modify the resources field of InferenceService based on the load of your cluster.
In this example, the image field is set to the registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu sample image. This image provides interfaces for loading the model and inference service. You can view the code of this sample image in the KServe open source community and customize the image. For more information, see Docker.

Run the following command to deploy the inference service based on the AI model:
```
kubectl create -f oss-fluid-isvc.yaml -n kserve-fluid-demo
```

Run the following command to check the deployment of the inference service based on the AI model:

kubectl get inferenceservice -n kserve-fluid-demo

Expected output:

NAME          URL                                                READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION           AGE
fluid-bloom   http://fluid-bloom.kserve-fluid-demo.example.com   True           100                              fluid-bloom-predictor-00001   2d

In the output, the READY field is True, which indicates that the inference service based on the AI model is deployed.

Step 4: Access the inference service based on the AI model

Obtain the address of the ASM ingress gateway.
1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose ASM Gateways > Ingress Gateway.
3. In the Service address section of ingressgateway, view and obtain the service address of the ASM gateway.

Run the following command to access the inference service fluid-bloom based on the sample AI model and replace the service address of the ASM gateway with the address that you obtained in substep 1:

curl -v -H "Content-Type: application/json" -H "Host: fluid-bloom.kserve-fluid-demo.example.com" "http://{service address of the ASM gateway}:80/v1/models/bloom:predict" -d '{"prompt": "It was a dark and stormy night", "result_length": 50}'

Expected output:

*   Trying xxx.xx.xx.xx :80...
* Connected to xxx.xx.xx.xx  (xxx.xx.xx.xx ) port 80 (#0)
> POST /v1/models/bloom:predict HTTP/1.1
> Host: fluid-bloom-predictor.kserve-fluid-demo.example.com
> User-Agent: curl/7.84.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 65
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 227
< content-type: application/json
< date: Thu, 20 Apr 2023 09:49:00 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 1142
< 
{
  "result": "It was a dark and stormy night, and the wind was blowing in the\ndirection of the west. The wind was blowing in the direction of the\nwest, and the wind was blowing in the direction of the west. The\nwind was"
}
* Connection # 0 to host xxx.xx.xx.xx left intact

The output indicates that the inference service based on the AI model renews the sample input and the inference result is returned.

Alibaba Cloud Service Mesh:Integrate the KServe on ASM feature with Fluid to implement AI Serving that accelerates data access

Prerequisites

Step 1: Enable the KServe on ASM feature

Step 2: Install the ack-fluid component and enable AI model caching and acceleration

Step 3: Deploy an inference service based on the AI model

ACK Cluster

ASK Cluster

Step 4: Access the inference service based on the AI model

References