Fluid allows you to use JindoRuntime to accelerate access to data stored in Object Storage Service (OSS) in serverless cloud computing scenarios. You can accelerate data access in cache mode and no cache mode. This topic describes how to accelerate Argo workflows in cache mode.
Prerequisites
Argo or the ack-workflow component is installed. For more information, see Argo or Argo Workflows.
Virtual nodes are deployed in an ACK Pro cluster. For more information, see Schedule pods to elastic container instances through virtual nodes.
A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
ImportantThe ack-fluid component is not currently supported on the ContainerOS.
The Cloud-native AI Suite is installed and the ack-fluid component is deployed.
ImportantIf you have already installed open-source Fluid, uninstall it before deploying the ack-fluid component.
Due to compatibility issues between ack-ai-pipeline and Argo Workflows in the Cloud-native AI Suite, to use the accelerated Argo task data access functionality, you must deselect ack-ai-pipeline during the deployment of the Cloud-native AI Suite.
If you have not installed the Cloud-native AI Suite, enable Fluid under Data Access Acceleration when you install the suite. For more information, see Deploy Cloud-native AI Suite.
If you have already installed the Cloud-native AI Suite, go to the Cloud-native AI Component Set page of the ACK console and deploy the ack-fluid component.
A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
OSS is activated and a bucket is created. For more information, see Activate OSS and Create buckets.
Limits
This feature is mutually exclusive with the elastic scheduling feature of ACK. For more information about the elastic scheduling feature of ACK, see Configure priority-based resource scheduling.
Step 1: Upload the test dataset to the OSS bucket
Create a test dataset of 2 GB in size. In this example, the test dataset is used.
Upload the test dataset to the OSS bucket that you created.
You can use the ossutil tool provided by OSS to upload data. For more information, see Install ossutil.
Step 2: Create a dataset and a JindoRuntime
After you set up the ACK cluster and OSS bucket, you need to deploy the dataset and JindoRuntime. The deployment requires only a few minutes.
Create a file named secret.yaml based on the following content.
The file stores the
fs.oss.accessKeyIdandfs.oss.accessKeySecretthat are used to access OSS.apiVersion: v1 kind: Secret metadata: name: access-key stringData: fs.oss.accessKeyId: **** fs.oss.accessKeySecret: ****Run the following command to deploy the Secret:
kubectl create -f secret.yamlCreate a file named dataset.yaml based on the following content.
The YAML file stores the following information:
Dataset: specifies the dataset that is stored in a remote datastore and the Unix file system (UFS) information.JindoRuntime: enables JindoFS for data caching in the cluster.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: serverless-data spec: mounts: - mountPoint: oss://<oss_bucket>/<bucket_dir> name: demo path: / options: fs.oss.endpoint: <oss_endpoint> encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: access-key key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: access-key key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: serverless-data spec: replicas: 1 tieredstore: levels: - mediumtype: MEM volumeType: emptyDir path: /dev/shm quota: 5Gi high: "0.95" low: "0.7"The following table describes some parameters that are specified in the preceding code block.
Parameter
Description
mountPointThe path to which the UFS file system is mounted. The format of the path is
oss://<oss_bucket>/<bucket_dir>. Do not include endpoint information in the path. Example: oss://mybucket/path/to/dir. If you use only one mount target, you can setpathto/.fs.oss.endpointThe public or private endpoint of the OSS bucket.
You can specify the private endpoint of the bucket to enhance data security. However, if you specify the private endpoint, make sure that your ACK cluster is deployed in the region where OSS is activated. For example, if your OSS bucket is created in the China (Hangzhou) region, the public endpoint of the bucket is
oss-cn-hangzhou.aliyuncs.comand the private endpoint isoss-cn-hangzhou-internal.aliyuncs.com.fs.oss.accessKeyIdThe AccessKey ID that is used to access the bucket.
fs.oss.accessKeySecretThe AccessKey secret that is used to access the bucket.
replicasThe number of workers to be created in the JindoFS cluster.
mediumtypeThe type of cache. Supported cache types are HDD, SSD, and MEM.
For more information about the recommended configurations of the mediumtype, see Policy 2: Select proper cache media.
volumeTypeThe volume type of the cache medium. Valid values:
emptyDirandhostPath. Default value:hostPath.If you use memory or local system disks as the cache medium, we recommend that you use the
emptyDirtype to avoid residual cache data on the node and ensure node availability.If you use local data disks as the cache medium, you can use the
hostPathtype and configure thepathto specify the mount path of the data disk on the host.
For more information about the recommended configurations of the volumeType, see Policy 2: Select proper cache media.
pathThe path of the cache. You can specify only one path.
quotaThe maximum size of the cache. For example, 100 Gi indicates that the maximum size of the cache is 100 GiB.
highThe upper limit of the storage.
lowThe lower limit of the storage.
ImportantThe default access mode is read-only mode. If you want to use the read/write mode, refer to Configure the access mode of a dataset.
Run the following command to deploy the dataset and JindoRuntime:
kubectl create -f dataset.yamlRun the following command to check whether the dataset is deployed:
kubectl get dataset serverless-dataExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE serverless-data 1.16GiB 0.00B 5.00GiB 0.0% Bound 2m8sPHASEin the preceding output displaysBound, which indicates that the dataset is deployed.Run the following command to check whether the JindoRuntime is deployed:
kubectl get jindo serverless-dataExpected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE serverless-data Ready Ready Ready 2m51sFUSEin the preceding output displaysReady, which indicates that the JindoRuntime is deployed.
(Optional) Step 3: Prefetch data
Prefetching can efficiently accelerate first-time data access. We recommend that you use this feature if this is the first time you retrieve data.
Create a file named dataload.yaml based on the following content:
apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: serverless-data-warmup spec: dataset: name: serverless-data namespace: default loadMetadata: trueRun the following command to deploy the DataLoad:
kubectl create -f dataload.yamlRun the following command to query the progress of data prefetching:
kubectl get dataloadExpected output:
NAME DATASET PHASE AGE DURATION serverless-data-warmup serverless-data Complete 2m49s 45sThe output shows that the duration of data prefetching is
45 seconds.Run the following command to query the caching result:
kubectl get datasetExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE serverless-data 1.16GiB 1.16GiB 5.00GiB 100.0% Bound 5m20sThe output shows that the value of
CACHEDis0.0%before data is prefetched. The value ofCACHEDis100.0%after data is prefetched.
Step 4: Use an Argo workflow to create containers to access OSS
Create a file named workflow.yaml and copy the following content to the file:
Deploy an application pod as an elastic container instance
apiVersion: apps/v1 kind: Deployment metadata: name: model-serving spec: selector: matchLabels: app: model-serving template: metadata: labels: app: model-serving alibabacloud.com/fluid-sidecar-target: eci alibabacloud.com/eci: "true" spec: containers: - image: fluidcloudnative/serving name: serving ports: - name: http1 containerPort: 8080 env: - name: TARGET value: "World" volumeMounts: - mountPath: /data name: data volumes: - name: data persistentVolumeClaim: claimName: serverless-dataAdd the
alibabacloud.com/fluid-sidecar-target=ecilabel to the application pod to indicate that it will run as an elastic container instance. When the application pod is created, Fluid automatically converts it to a format compatible with elastic container instances, requiring no user intervention.Create an ACS application pod
ImportantTo access cached Fluid data in Alibaba Cloud Container Compute Service (ACS) application containers, ensure that you are using ack-fluid v1.0.11 or later in your cluster.
Accessing cached Fluid data in ACS application containers relies on advanced features of ACS pods. Submit a support ticket to enable this feature.
apiVersion: apps/v1 kind: Deployment metadata: name: model-serving spec: selector: matchLabels: app: model-serving template: metadata: labels: app: model-serving alibabacloud.com/fluid-sidecar-target: acs alibabacloud.com/acs: "true" alibabacloud.com/compute-qos: default alibabacloud.com/compute-class: general-purpose spec: containers: - image: fluidcloudnative/serving name: serving ports: - name: http1 containerPort: 8080 env: - name: TARGET value: "World" volumeMounts: - mountPath: /data name: data volumes: - name: data persistentVolumeClaim: claimName: serverless-dataAdd the
alibabacloud.com/fluid-sidecar-target=acslabel to the application pod to declare that it will use ACS compute resources. When the application pod is created, Fluid automatically adapts it to run in the ACS environment, requiring no user intervention.Run the following command to create an Argo workflow:
kubectl create -f workflow.yamlRun the following command to print the container log:
kubectl logs serverless-workflow-g5knn-3271897614Expected output:
real 0m1.948s user 0m0.000s sys 0m0.668sThe
realfield in the output shows that it took 1.948 seconds (0m1.948s) to replicate the Serving file. It took 24.966 seconds (0m24.966s) to replicate the file in no cache mode. For more information, see Accelerate Argo workflows. The duration in no cache mode increases by 13 times compared with the duration in cache mode.
Step 5: Clear data
After you test data access acceleration, clear the relevant data at the earliest opportunity.
Run the following command to delete the containers:
kubectl delete workflow serverless-workflow-g5knnRun the following command to delete the dataset:
kubectl delete dataset serverless-data