JindoRuntime is the execution engine of JindoFS developed by the Alibaba Cloud E-MapReduce (EMR) team. JindoRuntime is based on C++ and provides dataset management and caching. JindoRuntime also supports Object Storage Service (OSS). Alibaba Cloud provides cloud service-level support for JindoFS. Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use JindoFS to accelerate access to OSS.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
Important
The ack-fluid component is not currently supported on the ContainerOS.
The cloud-native AI suite is installed and the ack-fluid component is deployed.
Important
If you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.
If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.
A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
OSS is activated. For more information, see Activate OSS.
Background information
After you set up the Container Service for Kubernetes (ACK) cluster and OSS bucket, you need to deploy JindoRuntime. The deployment requires about 10 minutes.
Step 1: Upload data to OSS
Run the following command to download a test dataset to an Elastic Compute Service (ECS) instance:
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
Upload the test dataset to the OSS bucket.
Important
This example describes how to upload a test dataset to OSS from an ECS instance that runs the Alibaba Cloud Linux 3.2104 LTS 64-bit operating system. If you use other operating systems, see ossutil and Overview.
Install ossutil.
Create a bucket named examplebucket
.
Run the following command to create a bucket named examplebucket
:
ossutil64 mb oss://examplebucket
If the following output is displayed, the bucket named examplebucket
is created:
Upload the test dataset to examplebucket
.
ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket
Step 2: Create a dataset and a JindoRuntime
Before you create a dataset, create a file named mySecret.yaml
in the root directory of the ECS instance.
apiVersion: v1
kind: Secret
metadata:
name: mysecret
stringData:
fs.oss.accessKeyId: xxx
fs.oss.accessKeySecret: xxx
Specify fs.oss.accessKeyId
and fs.oss.accessKeySecret
as the AccessKey ID
and AccessKey secret
that are used to access OSS in Step 1.
Run the following command to create a Secret: Kubernetes encrypts the created Secret to prevent the stored information from being exposed as plaintext.
kubectl create -f mySecret.yaml
Create a file named resource.yaml
by using the following YAML template. This template is used to perform the following operations:
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: hadoop
spec:
mounts:
- mountPoint: oss://<oss_bucket>/<bucket_dir>
options:
fs.oss.endpoint: <oss_endpoint>
name: hadoop
path: "/"
encryptOptions:
- name: fs.oss.accessKeyId
valueFrom:
secretKeyRef:
name: mysecret
key: fs.oss.accessKeyId
- name: fs.oss.accessKeySecret
valueFrom:
secretKeyRef:
name: mysecret
key: fs.oss.accessKeySecret
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
name: hadoop
spec:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
volumeType: emptyDir
quota: 2Gi
high: "0.99"
low: "0.95"
The following table describes the parameters in the YAML template.
Parameter | Description |
mountPoint | oss://<oss_bucket>/<bucket_dir> specifies the path to the UFS that is mounted. The endpoint is not required in the path. |
fs.oss.endpoint | The public or private endpoint of the OSS bucket. For more information, see Regions and endpoints. |
replicas | The number of workers in the JindoFS cluster. |
mediumtype | The cache type. This parameter specifies the cache type used when you create JindoRuntime templates. Valid values: HDD, SDD, and MEM. |
path | The storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a local path to store data, such as logs. |
quota | The maximum size of cached data. Unit: GB. |
high | The upper limit of the storage capacity. |
low | The lower limit of the storage capacity. |
Run the following command to create a dataset and a JindoRuntime:
kubectl create -f resource.yaml
Run the following command to check whether the dataset is deployed:
kubectl get dataset hadoop
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
hadoop 210MiB 0.00B 4.00GiB 0.0% Bound 1h
Run the following command to check whether the JindoRuntime is deployed:
kubectl get jindoruntime hadoop
Expected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE
hadoop Ready Ready Ready 4m45s
Run the following command to check whether the persistent volume (PV) and persistent volume claim (PVC) are created:
Expected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/hadoop 100Gi RWX Retain Bound default/hadoop 52m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/hadoop Bound hadoop 100Gi RWX 52m
The preceding outputs indicate that the dataset and JindoRuntime are created.
Step 3: Create applications to test data acceleration
You can deploy applications in containers to test data acceleration of JindoFS. You can also submit machine learning jobs to use relevant features. In this topic, an application is deployed in a container to test access to the same data. The test is run multiple times to compare the time consumption.
Create a file named app.yaml by using the following YAML template:
apiVersion: v1
kind: Pod
metadata:
name: demo-app
spec:
containers:
- name: demo
image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
volumeMounts:
- mountPath: /data
name: hadoop
volumes:
- name: hadoop
persistentVolumeClaim:
claimName: hadoop
Run the following command to deploy the application:
kubectl create -f app.yaml
Run the following command to query the size of the specified file:
kubectl exec -it demo-app -- bash
du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz
Expected output:
210M /data/spark-3.0.1-bin-hadoop2.7.tgz
Run the following command to query the time consumed to copy the file:
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
Expected output:
real 0m18.386s
user 0m0.002s
sys 0m0.105s
The output indicates that 18 seconds are consumed to copy the file.
Run the following command to check the cached data of the dataset:
kubectl get dataset hadoop
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
hadoop 210.00MiB 210.00MiB 4.00GiB 100.0% Bound 1h
The output indicates that 210 MiB of data is cached to the on-premises storage.
Run the following command to delete the current application and then create the same application:
Note
This step is performed to avoid other factors, such as the page cache, from affecting the result.
kubectl delete -f app.yaml && kubectl create -f app.yaml
Run the following command to query the time consumed to copy the file:
kubectl exec -it demo-app -- bash
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
Expected output:
real 0m0.048s
user 0m0.001s
sys 0m0.046s
The output indicates that file copy takes 48 milliseconds, a reduction of more than 300 times.
Note
This is because the file is cached by JindoFS.
Clear the environment
If you no longer use data acceleration, clear the environment.
Run the following command to delete the application:
kubectl delete pod demo-app
Run the following command to delete the dataset and JindoRuntime:
kubectl delete dataset hadoop