JindoRuntime is the execution engine of JindoFS developed by the Alibaba Cloud E-MapReduce (EMR) team. JindoRuntime is based on C++ and provides dataset management and caching. JindoRuntime also supports Object Storage Service (OSS). Alibaba Cloud provides cloud service-level support for JindoFS. Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use JindoFS to accelerate access to OSS.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
ImportantThe ack-fluid component is not currently supported on the ContainerOS.
The cloud-native AI suite is installed and the ack-fluid component is deployed.
ImportantIf you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.
If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.
A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
OSS is activated. For more information, see Activate OSS.
Background information
After you set up the Container Service for Kubernetes (ACK) cluster and OSS bucket, you need to deploy JindoRuntime. The deployment requires about 10 minutes.
Step 1: Upload data to OSS
Run the following command to download a test dataset to an Elastic Compute Service (ECS) instance:
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
Upload the test dataset to the OSS bucket.
Create a bucket named
examplebucket
.Run the following command to create a bucket named
examplebucket
:ossutil64 mb oss://examplebucket
If the following output is displayed, the bucket named
examplebucket
is created:0.668238(s) elapsed
Upload the test dataset to
examplebucket
.ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket
Step 2: Create a dataset and a JindoRuntime
Before you create a dataset, create a file named
mySecret.yaml
in the root directory of the ECS instance.apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: xxx fs.oss.accessKeySecret: xxx
Specify
fs.oss.accessKeyId
andfs.oss.accessKeySecret
as theAccessKey ID
andAccessKey secret
that are used to access OSS in Step 1.Run the following command to create a Secret: Kubernetes encrypts the created Secret to prevent the stored information from being exposed as plaintext.
kubectl create -f mySecret.yaml
Create a file named
resource.yaml
by using the following YAML template. This template is used to perform the following operations:Create a dataset to specify information about the datasets in remote storage and the underlying file system (UFS).
Create a JindoRuntime to launch a JindoFS cluster for data caching.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: hadoop spec: mounts: - mountPoint: oss://<oss_bucket>/<bucket_dir> options: fs.oss.endpoint: <oss_endpoint> name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: hadoop spec: replicas: 2 tieredstore: levels: - mediumtype: MEM path: /dev/shm volumeType: emptyDir quota: 2Gi high: "0.99" low: "0.95"
The following table describes the parameters in the YAML template.
Parameter
Description
mountPoint
oss://<oss_bucket>/<bucket_dir> specifies the path to the UFS that is mounted. The endpoint is not required in the path.
fs.oss.endpoint
The public or private endpoint of the OSS bucket. For more information, see Regions and endpoints.
replicas
The number of workers in the JindoFS cluster.
mediumtype
The cache type. This parameter specifies the cache type used when you create JindoRuntime templates. Valid values: HDD, SDD, and MEM.
path
The storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a local path to store data, such as logs.
quota
The maximum size of cached data. Unit: GB.
high
The upper limit of the storage capacity.
low
The lower limit of the storage capacity.
Run the following command to create a dataset and a JindoRuntime:
kubectl create -f resource.yaml
Run the following command to check whether the dataset is deployed:
kubectl get dataset hadoop
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210MiB 0.00B 4.00GiB 0.0% Bound 1h
Run the following command to check whether the JindoRuntime is deployed:
kubectl get jindoruntime hadoop
Expected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE hadoop Ready Ready Ready 4m45s
Run the following command to check whether the persistent volume (PV) and persistent volume claim (PVC) are created:
kubectl get pv,pvc
Expected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/hadoop 100Gi RWX Retain Bound default/hadoop 52m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/hadoop Bound hadoop 100Gi RWX 52m
The preceding outputs indicate that the dataset and JindoRuntime are created.
Step 3: Create applications to test data acceleration
You can deploy applications in containers to test data acceleration of JindoFS. You can also submit machine learning jobs to use relevant features. In this topic, an application is deployed in a container to test access to the same data. The test is run multiple times to compare the time consumption.
Create a file named app.yaml by using the following YAML template:
apiVersion: v1 kind: Pod metadata: name: demo-app spec: containers: - name: demo image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 volumeMounts: - mountPath: /data name: hadoop volumes: - name: hadoop persistentVolumeClaim: claimName: hadoop
Run the following command to deploy the application:
kubectl create -f app.yaml
Run the following command to query the size of the specified file:
kubectl exec -it demo-app -- bash du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz
Expected output:
210M /data/spark-3.0.1-bin-hadoop2.7.tgz
Run the following command to query the time consumed to copy the file:
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
Expected output:
real 0m18.386s user 0m0.002s sys 0m0.105s
The output indicates that 18 seconds are consumed to copy the file.
Run the following command to check the cached data of the dataset:
kubectl get dataset hadoop
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210.00MiB 210.00MiB 4.00GiB 100.0% Bound 1h
The output indicates that 210 MiB of data is cached to the on-premises storage.
Run the following command to delete the current application and then create the same application:
NoteThis step is performed to avoid other factors, such as the page cache, from affecting the result.
kubectl delete -f app.yaml && kubectl create -f app.yaml
Run the following command to query the time consumed to copy the file:
kubectl exec -it demo-app -- bash time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
Expected output:
real 0m0.048s user 0m0.001s sys 0m0.046s
The output indicates that file copy takes 48 milliseconds, a reduction of more than 300 times.
NoteThis is because the file is cached by JindoFS.
Clear the environment
If you no longer use data acceleration, clear the environment.
Run the following command to delete the application:
kubectl delete pod demo-app
Run the following command to delete the dataset and JindoRuntime:
kubectl delete dataset hadoop