Use JindoFS to accelerate access to OSS

0.0.201

JindoRuntime is the execution engine of JindoFS developed by the Alibaba Cloud E-MapReduce (EMR) team. JindoRuntime is based on C++ and provides dataset management and caching. JindoRuntime also supports Object Storage Service (OSS). Alibaba Cloud provides cloud service-level support for JindoFS. Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use JindoFS to accelerate access to OSS.

Prerequisites

A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
Important
The ack-fluid component is not currently supported on the ContainerOS.
The cloud-native AI suite is installed and the ack-fluid component is deployed.
Important
If you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.
- If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
- If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.
A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
OSS is activated. For more information, see Activate OSS.

Background information

After you set up the Container Service for Kubernetes (ACK) cluster and OSS bucket, you need to deploy JindoRuntime. The deployment requires about 10 minutes.

Step 1: Upload data to OSS

Run the following command to download a test dataset to an Elastic Compute Service (ECS) instance:
```
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
```
Upload the test dataset to the OSS bucket.
Important
This example describes how to upload a test dataset to OSS from an ECS instance that runs the Alibaba Cloud Linux 3.2104 LTS 64-bit operating system. If you use other operating systems, see ossutil and Overview.
1. Install ossutil.
2. Create a bucket named examplebucket.
  - Run the following command to create a bucket named examplebucket:
```
ossutil64 mb oss://examplebucket
```
  - If the following output is displayed, the bucket named examplebucket is created:
```
0.668238(s) elapsed
```
3. Upload the test dataset to examplebucket.
```
ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket
```

Step 2: Create a dataset and a JindoRuntime

Before you create a dataset, create a file named mySecret.yaml in the root directory of the ECS instance.
```
apiVersion: v1
kind: Secret
metadata:
  name: mysecret
stringData:
  fs.oss.accessKeyId: xxx
  fs.oss.accessKeySecret: xxx
```
Specify fs.oss.accessKeyId and fs.oss.accessKeySecret as the AccessKey ID and AccessKey secret that are used to access OSS in Step 1.
Run the following command to create a Secret: Kubernetes encrypts the created Secret to prevent the stored information from being exposed as plaintext.
```
kubectl create -f mySecret.yaml
```

Create a file named resource.yaml by using the following YAML template. This template is used to perform the following operations:

Create a dataset to specify information about the datasets in remote storage and the underlying file system (UFS).
Create a JindoRuntime to launch a JindoFS cluster for data caching.

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: hadoop
spec:
  mounts:
    - mountPoint: oss://<oss_bucket>/<bucket_dir>
      options:
        fs.oss.endpoint: <oss_endpoint>
      name: hadoop
      path: "/"
      encryptOptions:
        - name: fs.oss.accessKeyId
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeyId
        - name: fs.oss.accessKeySecret
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeySecret
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: hadoop
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        volumeType: emptyDir
        quota: 2Gi
        high: "0.99"
        low: "0.95"

The following table describes the parameters in the YAML template.

Parameter	Description

Parameter	Description
mountPoint	oss://<oss_bucket>/<bucket_dir> specifies the path to the UFS that is mounted. The endpoint is not required in the path.
fs.oss.endpoint	The public or private endpoint of the OSS bucket. For more information, see Regions and endpoints.
replicas	The number of workers in the JindoFS cluster.
mediumtype	The cache type. This parameter specifies the cache type used when you create JindoRuntime templates. Valid values: HDD, SDD, and MEM.
path	The storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a local path to store data, such as logs.
quota	The maximum size of cached data. Unit: GB.
high	The upper limit of the storage capacity.
low	The lower limit of the storage capacity.

Run the following command to create a dataset and a JindoRuntime:
```
kubectl create -f resource.yaml
```

Run the following command to check whether the dataset is deployed:

kubectl get dataset hadoop

Expected output:

NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hadoop        210MiB       0.00B    4.00GiB              0.0%          Bound   1h

Run the following command to check whether the JindoRuntime is deployed:

kubectl get jindoruntime hadoop

Expected output:

NAME     MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
hadoop   Ready          Ready          Ready        4m45s

Run the following command to check whether the persistent volume (PV) and persistent volume claim (PVC) are created:

kubectl get pv,pvc

Expected output:

NAME                      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   REASON   AGE
persistentvolume/hadoop   100Gi      RWX            Retain           Bound    default/hadoop                           52m

NAME                           STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/hadoop   Bound    hadoop   100Gi      RWX                           52m

The preceding outputs indicate that the dataset and JindoRuntime are created.

Step 3: Create applications to test data acceleration

You can deploy applications in containers to test data acceleration of JindoFS. You can also submit machine learning jobs to use relevant features. In this topic, an application is deployed in a container to test access to the same data. The test is run multiple times to compare the time consumption.

Create a file named app.yaml by using the following YAML template:

apiVersion: v1
kind: Pod
metadata:
  name: demo-app
spec:
  containers:
    - name: demo
      image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
      volumeMounts:
        - mountPath: /data
          name: hadoop
  volumes:
    - name: hadoop
      persistentVolumeClaim:
        claimName: hadoop

Run the following command to deploy the application:
```
kubectl create -f app.yaml
```

Run the following command to query the size of the specified file:

kubectl exec -it demo-app -- bash
du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz

Expected output:

210M    /data/spark-3.0.1-bin-hadoop2.7.tgz

Run the following command to query the time consumed to copy the file:
```
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
```
Expected output:
```
real    0m18.386s
user    0m0.002s
sys    0m0.105s
```
The output indicates that 18 seconds are consumed to copy the file.

Run the following command to check the cached data of the dataset:

kubectl get dataset hadoop

Expected output:

NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hadoop   210.00MiB       210.00MiB    4.00GiB        100.0%           Bound   1h

The output indicates that 210 MiB of data is cached to the on-premises storage.

Run the following command to delete the current application and then create the same application:
Note
This step is performed to avoid other factors, such as the page cache, from affecting the result.
```
kubectl delete -f app.yaml && kubectl create -f app.yaml
```
Run the following command to query the time consumed to copy the file:
```
kubectl exec -it demo-app -- bash
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
```
Expected output:
```
real    0m0.048s
user    0m0.001s
sys     0m0.046s
```
The output indicates that file copy takes 48 milliseconds, a reduction of more than 300 times.
Note
This is because the file is cached by JindoFS.