How to Accelerate Data Access from Jobs with Cache Mode - Container Service for Kubernetes

Fluid allows you to use JindoRuntime to accelerate access to data stored in Object Storage Service (OSS) in serverless cloud computing scenarios. You can accelerate data access in cache mode and no cache mode. This topic describes how to accelerate Jobs in cache mode.

Prerequisites

A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
Important
The ack-fluid component is not currently supported on the ContainerOS.
The cloud-native AI suite is installed and the ack-fluid component is deployed.
Important
If you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.
- If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
- If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.
Virtual nodes are deployed in the ACK Pro cluster. For more information, see Schedule pods to elastic container instances through virtual nodes.
A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
OSS is activated and a bucket is created. For more information, see Activate OSS and Create buckets.

Limits

This feature is mutually exclusive with the elastic scheduling feature of ACK. For more information about the elastic scheduling feature of ACK, see Configure priority-based resource scheduling.

Step 1: Upload the test dataset to the OSS bucket

Create a test dataset of 2 GB in size. In this example, the test dataset is used.
Upload the test dataset to the OSS bucket that you created.
You can use the ossutil tool provided by OSS to upload data. For more information, see Install ossutil.

Step 2: Create a dataset and a JindoRuntime

After you set up the ACK cluster and OSS bucket, you need to deploy the dataset and JindoRuntime. The deployment requires only a few minutes.

Create a file named secret.yaml based on the following content.
The file stores the fs.oss.accessKeyId and fs.oss.accessKeySecret that are used to access OSS.
```
apiVersion: v1
kind: Secret
metadata:
  name: access-key
stringData:
  fs.oss.accessKeyId: ****
  fs.oss.accessKeySecret: ****
```
Run the following command to deploy the Secret:
```
kubectl create -f secret.yaml
```

Create a file named dataset.yaml based on the following content.

The YAML file stores the following information:

Dataset: specifies the dataset that is stored in a remote datastore and the Unix file system (UFS) information.
JindoRuntime: enables JindoFS for data caching in the cluster.

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: serverless-data
spec:
  mounts:
  - mountPoint: oss://<oss_bucket>/<bucket_dir>
    name: demo
    path: /
    options:
      fs.oss.endpoint: <oss_endpoint>
    encryptOptions:
      - name: fs.oss.accessKeyId
        valueFrom:
          secretKeyRef:
            name: access-key
            key: fs.oss.accessKeyId
      - name: fs.oss.accessKeySecret
        valueFrom:
          secretKeyRef:
            name: access-key
            key: fs.oss.accessKeySecret
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: serverless-data
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: MEM
        volumeType: emptyDir
        path: /dev/shm
        quota: 5Gi
        high: "0.95"
        low: "0.7"

The following table describes some parameters that are specified in the preceding code block.

Parameter	Description
`mountPoint`	The path to which the UFS file system is mounted. The format of the path is `oss://<oss_bucket>/<bucket_dir>`. Do not include endpoint information in the path. Example: oss://mybucket/path/to/dir. If you use only one mount target, you can set `path` to `/`.
`fs.oss.endpoint`	The public or private endpoint of the OSS bucket. You can specify the private endpoint of the bucket to enhance data security. However, if you specify the private endpoint, make sure that your ACK cluster is deployed in the region where OSS is activated. For example, if your OSS bucket is created in the China (Hangzhou) region, the public endpoint of the bucket is `oss-cn-hangzhou.aliyuncs.com` and the private endpoint is `oss-cn-hangzhou-internal.aliyuncs.com`.
`fs.oss.accessKeyId`	The AccessKey ID that is used to access the bucket.
`fs.oss.accessKeySecret`	The AccessKey secret that is used to access the bucket.
`replicas`	The number of workers to be created in the JindoFS cluster.
`mediumtype`	The type of cache. Supported cache types are HDD, SSD, and MEM. For more information about the recommended configurations of the mediumtype, see Policy 2: Select proper cache media.
`volumeType`	The volume type of the cache medium. Valid values: `emptyDir` and `hostPath`. Default value: `hostPath`. If you use memory or local system disks as the cache medium, we recommend that you use the `emptyDir` type to avoid residual cache data on the node and ensure node availability. If you use local data disks as the cache medium, you can use the `hostPath` type and configure the `path` to specify the mount path of the data disk on the host. For more information about the recommended configurations of the volumeType, see Policy 2: Select proper cache media.
`path`	The path of the cache. You can specify only one path.
`quota`	The maximum size of the cache. For example, 100 Gi indicates that the maximum size of the cache is 100 GiB.
`high`	The upper limit of the storage.
`low`	The lower limit of the storage.

Important

The default access mode is read-only mode. If you want to use the read/write mode, refer to Configure the access mode of a dataset.

Run the following command to deploy the dataset and JindoRuntime:
```
kubectl create -f dataset.yaml
```

Run the following command to check whether the dataset is deployed:

kubectl get dataset serverless-data

Expected output:

NAME              UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
serverless-data   1.16GiB          0.00B    5.00GiB          0.0%                Bound   2m8s

PHASE in the preceding output displays Bound, which indicates that the dataset is deployed.

Run the following command to check whether the JindoRuntime is deployed:
```
kubectl get jindo serverless-data
```
Expected output:
```
NAME              MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
serverless-data   Ready          Ready          Ready        2m51s
```
FUSE in the preceding output displays Ready, which indicates that the JindoRuntime is deployed.

(Optional) Step 3: Prefetch data

Prefetching can efficiently accelerate first-time data access. We recommend that you use this feature if this is the first time you retrieve data.

Create a file named dataload.yaml based on the following content:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: serverless-data-warmup
spec:
  dataset:
    name: serverless-data
    namespace: default
  loadMetadata: true

Run the following command to deploy the DataLoad:
```
kubectl create -f dataload.yaml
```

Run the following command to query the progress of data prefetching:

kubectl get dataload

Expected output:

NAME                     DATASET           PHASE      AGE     DURATION
serverless-data-warmup   serverless-data   Complete   2m49s   45s

The output shows that the duration of data prefetching is 45 seconds.

Run the following command to query the caching result:

kubectl get dataset

Expected output:

NAME              UFS TOTAL SIZE   CACHED    CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
serverless-data   1.16GiB          1.16GiB   5.00GiB          100.0%              Bound   5m20s

The output shows that the value of CACHED is 0.0% before data is prefetched. The value of CACHED is 100.0% after data is prefetched.

Step 4: Use a Job to create containers to access OSS

You can create containers to test data access accelerated by JindoFS, or submit machine learning jobs to use relevant features. This section describes how to use a Job to create containers to access the data stored in OSS.

Create a file named job.yaml based on the following content:

Deploy an application pod as an elastic container instance

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
spec:
  selector:
    matchLabels:
      app: model-serving
  template:
    metadata:
      labels:
        app: model-serving
        alibabacloud.com/fluid-sidecar-target: eci
        alibabacloud.com/eci: "true"
    spec:
      containers:
        - image: fluidcloudnative/serving
          name: serving
          ports:
            - name: http1
              containerPort: 8080
          env:
            - name: TARGET
              value: "World"
          volumeMounts:
            - mountPath: /data
              name: data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: serverless-data

Add the alibabacloud.com/fluid-sidecar-target=eci label to the application pod to indicate that it will run as an elastic container instance. When the application pod is created, Fluid automatically converts it to a format compatible with elastic container instances, requiring no user intervention.

Create an ACS application pod

Important

To access cached Fluid data in Alibaba Cloud Container Compute Service (ACS) application containers, ensure that you are using ack-fluid v1.0.11 or later in your cluster.
Accessing cached Fluid data in ACS application containers relies on advanced features of ACS pods. Submit a support ticket to enable this feature.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
spec:
  selector:
    matchLabels:
      app: model-serving
  template:
    metadata:
      labels:
        app: model-serving
        alibabacloud.com/fluid-sidecar-target: acs
        alibabacloud.com/acs: "true"
        alibabacloud.com/compute-qos: default
        alibabacloud.com/compute-class: general-purpose
    spec:
      containers:
        - image: fluidcloudnative/serving
          name: serving
          ports:
            - name: http1
              containerPort: 8080
          env:
            - name: TARGET
              value: "World"
          volumeMounts:
            - mountPath: /data
              name: data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: serverless-data

Add the alibabacloud.com/fluid-sidecar-target=acs label to the application pod to declare that it will use ACS compute resources. When the application pod is created, Fluid automatically adapts it to run in the ACS environment, requiring no user intervention.

Run the following command to deploy the Job:
```
kubectl create -f job.yaml
```
Run the following command to print the container log:
```
kubectl  logs demo-app--1-7zqdm  -c demo
```
Expected output:
```
real    0m1.760s
user    0m0.002s
sys     0m0.740s
```
The real field in the output shows that it took 1.76 seconds (0m1.760s) to replicate the Serving file. In the Accelerate Jobs topic, it took 23.644 seconds (0m23.644s) to replicate the file in no cache mode. The duration in no cache mode increases by almost 14 times compared with the duration in cache mode.

Step 5: Clear data

After you test data access acceleration, clear the relevant data at the earliest opportunity.

Run the following command to delete the containers:
```
kubectl delete job demo-app
```
Run the following command to delete the dataset:
```
kubectl delete dataset serverless-data
```