KServe, formerly known as KFServing, is an AI model service and inference engine for cloud-native environments. It supports automatic scaling, scale-to-zero, and canary deployment. Service Mesh (ASM) integrates the capabilities of the Knative Serving component that is deployed in either a Container Service for Kubernetes (ACK) cluster or a serverless Kubernetes (ASK) cluster. In addition, ASM provides the KServe on ASM feature for you to integrate KServe with ASM for AI Serving with a few clicks. Fluid is an open source Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data applications and AI applications. You can directly integrate Fluid with the KServe on ASM feature to accelerate model loading. This topic describes how to integrate the KServe on ASM feature with Fluid to implement AI Serving that accelerates data access.
Prerequisites
An ASM instance of version 1.17 or later is created, and a Kubernetes cluster is added to the ASM instance. For more information, see Create an ASM instance and Add a cluster to an ASM instance.
NoteFor more information about how to update an ASM instance, see Update an ASM instance.
For the Kubernetes cluster:
If an ACK cluster is used, the version of the cluster must be 1.22 or later. For more information, see Create an ACK managed cluster or Update an ACK cluster. If you use a graphics processing unit (GPU) for AI Serving, make sure that the ACK cluster contains GPU-accelerated nodes, such as ecs.gn6i-c16g1.4xlarge.
If an ASK cluster is used, the version of the cluster must be 1.18 or later, and the CoreDNS component must be installed in the cluster. For more information, see Create an ACK Serverless cluster and Manage system components.
The feature of using the Kubernetes API of clusters on the data plane to access Istio resources is enabled for the ASM instance. For more information, see Use the Kubernetes API of clusters on the data plane to access Istio resources.
An ingress gateway is created for the cluster. In this example, an ASM ingress gateway is created for the cluster. The ASM ingress gateway uses the default name ingressgateway, and ports 80 and 443 are exposed. For more information, see Create an ingress gateway.
The Knative Serving component is deployed in the ACK or ASK cluster and the Knative on ASM feature is enabled. For more information, see Use Knative on ASM to deploy a serverless application.
Object Storage Service (OSS) is activated, and a bucket is created. For more information, see Activate OSS and Create buckets.
Step 1: Enable the KServe on ASM feature
Log on to the ASM console. In the left-side navigation pane, choose .
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose .
On the KServe on ASM page, turn on or off the Automatically install the CertManager component in the cluster switch and click Enable KServe on ASM.
cert-manager is a certificate lifecycle management system that can be used to issue and deploy certificates. The deployment and use of the KServe on ASM feature depends on the CertManager component. When you install KServe, the CertManager component is automatically installed.
If you have not installed CertManager in the cluster, turn on Automatically install the CertManager component in the cluster.
If you have installed CertManager in the cluster on the data plane, turn off Automatically install the CertManager component in the cluster.
Step 2: Install the ack-fluid component and enable AI model caching and acceleration
Deploy the ack-fluid component of version 0.9.10 or later in the cluster.
If the cluster on the data plane is an ACK cluster, install the cloud-native AI suite and deploy the ack-fluid component in the cluster.
NoteIf you have installed open source Fluid, you must uninstall it before you can install the ack-fluid component.
If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
If you have already installed the cloud-native AI suite, log on to the ACK console, click the desired cluster, and choose to deploy the ack-fluid component.
If the cluster on the data plane is an ASK cluster, deploy the ack-fluid component in the cluster. For more information, see the Deploy the control plane components of Fluid section of the Accelerate Jobs topic.
Prepare an AI model and upload it to the OSS bucket.
Prepare an AI SavedModel.
This topic uses the BLOOM model, which is an open source transformer large language model (LLM) based on PyTorch. For more information about the model data, see Hugging Face.
Upload downloaded model data files to the OSS bucket and record the storage location of the model data files.
The format of the storage location is
oss://{bucket}/{path}
. For example, if you create a bucket named fluid-demo and upload all the model data files to themodels/bloom
directory in the bucket, the storage location of the model data files isoss://fluid-demo/models/bloom
.NoteYou can use the ossutil tool provided by OSS to upload data. For more information, see Install ossutil.
Create a namespace for deploying Fluid cache and AI Serving, and configure OSS access permissions.
Use kubectl to connect to the cluster on the data plane. For more information, see Connect to an ACK cluster by using kubectl.
Run the following command to create a namespace named
kserve-fluid-demo
for deploying Fluid cache and KServe-based AI Serving:kubectl create ns kserve-fluid-demo
Create a file named oss-secret.yaml and copy the following content to the file.
fs.oss.accessKeyId
is the AccessKey ID of an account that can access OSS, andfs.oss.accessKeySecret
is the AccessKey secret of the account.apiVersion: v1 kind: Secret metadata: name: access-key stringData: fs.oss.accessKeyId: xxx # Replace the value with the AccessKey ID of an Alibaba Cloud account that can access OSS. fs.oss.accessKeySecret: xxx # Replace the value with the AccessKey secret of the Alibaba Cloud account.
Run the following command to apply the OSS AccessKey pair:
kubectl apply -f oss-secret.yaml -n kserve-fluid-demo
Declare the AI model data that you want to access in Fluid.
You must submit a Dataset custom resource (CR) and a JindoRuntime CR. The Dataset CR describes the URL of the data in the external storage system. The JindoRuntime CR describes the cache system and its specific configuration.
Create a file named oss-jindo.yaml and copy the following content to the file.
In the Dataset CR, replace
oss://{bucket}/{path}
with the recorded storage location of model data files in step 2.b and replace{endpoint}
with an endpoint that can be used to access OSS. For more information about endpoints that can be used to access OSS in different regions, see Regions and endpoints.Run the following command to deploy the Dataset CR and the JindoRuntime CR:
kubectl create -f oss-jindo.yaml -n kserve-fluid-demo
Run the following command to check the deployment of the Dataset CR and the JindoRuntime CR:
kubectl get jindoruntime,dataset -n kserve-fluid-demo
Expected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE jindoruntime.data.fluid.io/oss-data Ready Ready Ready 3m NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE dataset.data.fluid.io/oss-data 3.14GiB 0.00B 100.00GiB 0.0% Bound 3m
The output shows that
PHASE
of the Dataset CR isBound
andFUSE PHASE
of the JindoRuntime CR isReady
. This indicates that the Dataset CR and the JindoRuntime CR are deployed.
Prefetch data in Fluid to improve data access performance.
Create a file named oss-dataload.yaml and copy the following content to the file:
apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: oss-dataload spec: dataset: name: oss-data namespace: kserve-fluid-demo target: - path: /bloom-560m replicas: 2
Run the following command to deploy Dataload to prefetch data:
kubectl create -f oss-dataload.yaml -n kserve-fluid-demo
Run the following command to query the progress of data prefetching:
kubectl get dataload -n kserve-fluid-demo
Expected output:
NAME DATASET PHASE AGE DURATION oss-dataload oss-data Complete 1m 45s
The output indicates that data prefetching takes about
45s
. You need to wait before the data is prefetched.
Step 3: Deploy an inference service based on the AI model
Create a file named oss-fluid-isvc.yaml and copy the following content to the file based on your cluster.
ACK Cluster
apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "fluid-bloom" spec: predictor: timeout: 600 minReplicas: 0 containers: - name: kserve-container image: registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu resources: limits: cpu: "12" memory: 48Gi nvidia.com/gpu: 1 # If GPUs are used, set the value to the number of GPUs required. Otherwise, you do not need to set the value. requests: cpu: "12" memory: 48Gi env: - name: STORAGE_URI value: "pvc://oss-data/bloom-560m" - name: MODEL_NAME value: "bloom" # Set this parameter to True if GPUs are used. Otherwise, set this parameter to False. - name: GPU_ENABLED value: "True"
ASK Cluster
apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "fluid-bloom" labels: alibabacloud.com/fluid-sidecar-target: "eci" annotations: k8s.aliyun.com/eci-use-specs : "ecs.gn6i-c16g1.4xlarge" # Replace the value with the specifications of the used Elastic Compute Service (ECS) instance. knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.gn6i-c16g1.4xlarge" # Replace the value with the specifications of the used ECS instance. spec: predictor: timeout: 600 minReplicas: 0 containers: - name: kserve-container image: registry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu resources: limits: cpu: "12" memory: 48Gi requests: cpu: "12" memory: 48Gi env: - name: STORAGE_URI value: "pvc://oss-data/bloom-560m" - name: MODEL_NAME value: "bloom" # Set this parameter to True if GPUs are used. Otherwise, set this parameter to False. - name: GPU_ENABLED value: "True"
NoteIn this topic, an LLM is used. Therefore, a resource of 12 cores and 48 Gi is applied. Modify the
resources
field ofInferenceService
based on the load of your cluster.In this example, the
image
field is set to theregistry.cn-hangzhou.aliyuncs.com/acs/kserve-fluid:bloom-gpu
sample image. This image provides interfaces for loading the model and inference service. You can view the code of this sample image in the KServe open source community and customize the image. For more information, see Docker.
Run the following command to deploy the inference service based on the AI model:
kubectl create -f oss-fluid-isvc.yaml -n kserve-fluid-demo
Run the following command to check the deployment of the inference service based on the AI model:
kubectl get inferenceservice -n kserve-fluid-demo
Expected output:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE fluid-bloom http://fluid-bloom.kserve-fluid-demo.example.com True 100 fluid-bloom-predictor-00001 2d
In the output, the
READY
field isTrue
, which indicates that the inference service based on the AI model is deployed.
Step 4: Access the inference service based on the AI model
Obtain the address of the ASM ingress gateway.
Log on to the ASM console. In the left-side navigation pane, choose .
On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose .
In the Service address section of ingressgateway, view and obtain the service address of the ASM gateway.
Run the following command to access the inference service fluid-bloom based on the sample AI model and replace the service address of the ASM gateway with the address that you obtained in substep 1:
curl -v -H "Content-Type: application/json" -H "Host: fluid-bloom.kserve-fluid-demo.example.com" "http://{service address of the ASM gateway}:80/v1/models/bloom:predict" -d '{"prompt": "It was a dark and stormy night", "result_length": 50}'
Expected output:
* Trying xxx.xx.xx.xx :80... * Connected to xxx.xx.xx.xx (xxx.xx.xx.xx ) port 80 (#0) > POST /v1/models/bloom:predict HTTP/1.1 > Host: fluid-bloom-predictor.kserve-fluid-demo.example.com > User-Agent: curl/7.84.0 > Accept: */* > Content-Type: application/json > Content-Length: 65 > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < content-length: 227 < content-type: application/json < date: Thu, 20 Apr 2023 09:49:00 GMT < server: istio-envoy < x-envoy-upstream-service-time: 1142 < { "result": "It was a dark and stormy night, and the wind was blowing in the\ndirection of the west. The wind was blowing in the direction of the\nwest, and the wind was blowing in the direction of the west. The\nwind was" } * Connection # 0 to host xxx.xx.xx.xx left intact
The output indicates that the inference service based on the AI model renews the sample input and the inference result is returned.