Fluid is an open source, Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data applications and AI applications. JindoRuntime is the execution engine of JindoFS developed by the Alibaba Cloud E-MapReduce (EMR) team. JindoRuntime is based on C++ and provides dataset management and caching. JindoRuntime also supports Object Storage Service (OSS). Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use Fluid in a registered cluster to accelerate access to OSS objects.
How it works
The following figure shows how Fluid is used to accelerate access to OSS objects.
Prerequisites
An external cluster is registered with Container Service for Kubernetes (ACK) through a registered cluster. For more information, see Create a registered cluster.
A kubectl client is connected to the registered cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
OSS is activated and a bucket is created. For more information, see Activate OSS and Create a bucket.
Step 1: Install ack-fluid
Use onectl
Install onectl on your on-premises machine. For more information, see Use onectl to manage registered clusters.
Run the following command to install ack-fluid:
onectl addon install ack-fluid --set pullImageByVPCNetwork=false
pullImageByVPCNetwork
: optional. This parameter specifies whether to pull the component image through a virtual private cloud (VPC).Expected output:
Addon ack-fluid, version **** installed.
Use the console
Log on to the ACK console. In the left-side navigation pane, choose .
On the App Catalog tab, find and click ack-fluid.
In the upper-right part of the page, click Deploy.
In the Deploy panel, specify Cluster, keep the default settings for Namespace and Release Name, and then click Next.
Set Chart Version to the latest version, configure component parameters, and then click OK.
Step 2: Prepare data
Run the following command to download a test dataset:
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
Upload the test dataset to the OSS bucket. You can use the client ossutil provided by OSS to upload the dataset. For more information, see Install ossutil.
Step 3: Add labels to nodes in the external Kubernetes cluster
Run the following command to add the demo-oss=true
label to all nodes in the external Kubernetes cluster. The label adds constraints to limit the nodes where the master and worker components of JindoRuntime can be deployed.
kubectl label node **** demo-oss=true
Step 4: Create a Dataset CR and a JindoRuntime CR
Create a file named mySecret.yaml and add the following content to the file.
The file is used to store the fs.oss.accessKeyId and fs.oss.accessKeySecret of OSS. You must create this file before creating the Dataset CustomResource (CR).
apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: **** fs.oss.accessKeySecret: ****
Run the following command to deploy the mySecret file to generate a Secret:
kubectl create -f mySecret.yaml
Kubernetes automatically encrypts Secrets to avoid disclosing sensitive data in plaintext.
Create a file named resource.yaml and add the following content to file. The file contains a Dataset CR and a JindoRuntime CR.
Dataset: describes the dataset stored in the bucket and the underlying file system (UFS).
JindoRuntime: launches a JindoFS cluster to provide caching services.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: hadoop spec: mounts: - mountPoint: oss://<oss_bucket>/<bucket_dir> options: fs.oss.endpoint: <oss_endpoint> name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: hadoop spec: # Make sure that the cache runtime runs only on the nodes in the external Kubernetes cluster. master: nodeSelector: demo-oss: "true" worker: nodeSelector: demo-oss: "true" fuse: nodeSelector: demo-oss: "true" replicas: 2 tieredstore: levels: - mediumtype: HDD path: /mnt/disk1 quota: 100G high: "0.99" low: "0.8"
Resource
Parameter
Description
Dataset
mountPoint
oss://<oss_bucket>/<bucket_dir>
specifies the path of the UFS to be mounted. You do not need to include the endpoint in the path.fs.oss.endpoint
The public or private endpoint of the OSS bucket.
JindoRuntime
replicas
The number of workers in the JindoFS cluster.
mediumtype
The type of cache. You can select HDD, SSD, or MEM for JindoFS when you create the JindoRuntime template.
path
The cache path. You can specify only one path. If you select MEM as the cache type, you need to specify a local path to store logs.
quota
The maximum size of the cache. Unit: GB.
high
The upper limit of the storage capacity.
low
The lower limit of the storage capacity.
Run the following command to create a Dataset CR and a JindoRuntime CR:
kubectl create -f resource.yaml
Run the following command to query the deployment of the Dataset CR:
kubectl get dataset hadoop
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210MiB 0.00B 100.00GiB 0.0% Bound 1h
Run the following command to query the deployment of the JindoRuntime CR:
kubectl get jindoruntime hadoop
Expected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE hadoop Ready Ready Ready 4m45s
Run the following command to query the status of the persistent volume (PV) and persistent volume claim (PVC):
kubectl get pv,pvc
Expected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/hadoop 100Gi RWX Retain Bound default/hadoop 52m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/hadoop Bound hadoop 100Gi RWX 52m
The output indicates that the Dataset and JindoRuntime CRs are created.
Step 5: Create a containerized application to verify the acceleration service
You can create a containerized application or submit a machine learning job to verify the JindoFS acceleration service. This section describes how to create a containerized application to access the same dataset multiple times and then compare the time consumption to verify the acceleration service.
Create a file named app.yaml and add the following content to the file:
apiVersion: v1 kind: Pod metadata: name: demo-app spec: containers: - name: demo image: fluidcloudnative/serving volumeMounts: - mountPath: /data name: hadoop volumes: - name: hadoop persistentVolumeClaim: claimName: hadoop
Run the following command to create a containerized application:
kubectl create -f app.yaml
Run the following command to query the size of the file to be accessed:
kubectl exec -it demo-app -- bash du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz
Expected output:
209.7M /data/spark-3.0.1-bin-hadoop2.7.tgz
Run the following command to query the time required to copy the file:
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /test
Expected output:
real 1m2.374s user 0m0.000s sys 0m0.256s
The output indicates that it takes 62 seconds to copy the file.
Run the following command to query the cache information of the dataset:
kubectl get dataset hadoop
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 209.74MiB 209.74MiB 100.00GiB 100.0% Bound 1h
The output indicates that 209.7 MiB of data is cached.
Run the following command to delete the current application and then create the same application.
NoteThis operation helps eliminate the impact of other factors, such as page cache, on the verification result.
kubectl delete -f app.yaml && kubectl create -f app.yaml
Run the following command to query the time required to copy the file:
kubectl exec -it demo-app -- bash time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /test
Expected output:
real 0m3.454s user 0m0.000s sys 0m0.268s
The output indicates that it takes 3 seconds to copy the file, which is only one eighteenth of the original time. This is because the file is cached by JindoFS. It is much faster to access a cached file.
(Optional) Step 6: Clear the environment
If you no longer need the acceleration service, run the following commands to clear the environment.
Run the following command to delete the JindoRuntime and application:
kubectl delete jindoruntime hadoop
Run the following command to delete the dataset:
kubectl delete dataset hadoop