Fluid is an open-source, Kubernetes-native, and distributed dataset orchestration and acceleration engine that can be used in data-intensive applications in cloud-native scenarios, such as big data applications and AI applications. In edge computing, the dataset acceleration engine of Fluid significantly enhances the speed of edge nodes accessing OSS files. This topic describes how to use the data acceleration feature of Fluid in a Container Service for Kubernetes (ACK) Edge cluster.
Prerequisites
An ACK Edge cluster that runs Kubernetes 1.18 or later is created. For more information, see Create an ACK Edge cluster in the console.
An edge node pool is created with edge nodes added to it. For more information, see Create an edge node pool and Add edge nodes.
The cloud-native AI suite is installed and the ack-fluid component is deployed.
ImportantIf you have installed open source Fluid, you must uninstall it before you can install the ack-fluid component.
If the cloud-native AI suite is not deployed: Enable Fluid under Data Access Acceleration when you deploy the suite.
If the cloud-native AI suite is deployed: On the Cloud-native AI Suite page of the ACK Console, deploy ack-fluid.
A kubectl client is connected to the ACK cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Object Storage Service (OSS) is activated. For more information, see Activate OSS.
Step 1: Upload data to OSS
Run the following command to download a test dataset to an Elastic Compute Service (ECS) instance:
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
Upload the test dataset to the OSS bucket.
Create a bucket named
examplebucket
.Run the following command to create a bucket named
examplebucket
:ossutil64 mb oss://examplebucket
If the following output is displayed, the bucket named
examplebucket
is created:0.668238(s) elapsed
Upload the test dataset to
examplebucket
.ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket
Step 2: Create a dataset and a JindoRuntime
Before you create a dataset, create a file named
mySecret.yaml
in the root directory of the ECS instance.apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: xxx fs.oss.accessKeySecret: xxx
Specify
fs.oss.accessKeyId
andfs.oss.accessKeySecret
as theAccessKey ID
andAccessKey secret
that are used to access OSS in Step 1.Run the following command to create a Secret: Kubernetes encrypts the created Secret to prevent the stored information from being exposed as plaintext.
kubectl create -f mySecret.yaml
Create a file named
resource.yaml
with the following YAML template. This template is used to perform the following operations:Create a dataset to specify information about the datasets in remote storage and the underlying file system (UFS).
Create a JindoRuntime to launch a JindoFS cluster for data caching.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: hadoop spec: nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: alibabacloud.com/nodepool-id operator: In values: - npxxxxxxxxxxxxxx mounts: - mountPoint: oss://<oss_bucket>/<bucket_dir> options: fs.oss.endpoint: <oss_endpoint> name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: hadoop spec: nodeSelector: alibabacloud.com/nodepool-id: npxxxxxxxxxxxxxx replicas: 2 tieredstore: levels: - mediumtype: MEM path: /dev/shm volumeType: emptyDir quota: 2Gi high: "0.99" low: "0.95"
NoteIn the ACK Edge cluster, you must deploy the dataset and JindoRuntime to the same node pool by using the
nodeAffinity
andnodeSelector
fields to ensure network connectivity within the node pool.Since both edge node management and OSS access require cloud-edge network to access the cloud, we recommend that you reserve sufficient network bandwidth to ensure the stability of the management channel.
The following table describes the parameters in the YAML template:
Parameter
Description
mountPoint
oss://<oss_bucket>/<bucket_dir>
specifies the path to the UFS that is mounted. This path must point to a directory instead of a file. The endpoint is not required in the path.fs.oss.endpoint
The public or private endpoint of the OSS bucket. For more information, see Regions and endpoints.
replicas
The number of workers in the JindoFS cluster.
mediumtype
The cache type. This parameter specifies the cache type used when you create JindoRuntime templates. Valid values: HDD, SDD, and MEM.
path
The storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a local path to store data, such as logs.
quota
The maximum size of cached data. Unit: GB.
high
The upper limit of the storage capacity.
low
The lower limit of the storage capacity.
Run the following command to create a dataset and a JindoRuntime:
kubectl create -f resource.yaml
Run the following command to check whether the dataset is deployed:
kubectl get dataset hadoop
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210MiB 0.00B 4.00GiB 0.0% Bound 1h
Run the following command to check whether the JindoRuntime is deployed:
kubectl get jindoruntime hadoop
Expected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE hadoop Ready Ready Ready 4m45s
Run the following command to check whether the persistent volume (PV) and persistent volume claim (PVC) are created:
kubectl get pv,pvc
Expected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/hadoop 100Gi RWX Retain Bound default/hadoop 52m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/hadoop Bound hadoop 100Gi RWX 52m
The preceding outputs indicate that the dataset and JindoRuntime are created.
Step 3: Create an application container to test data access acceleration
You can deploy applications in containers to test data acceleration of JindoFS. You can also submit machine learning jobs to use relevant features. In this topic, an application is deployed in a container to test access to the same data. The test is run multiple times to compare the time consumption.
Create a file named app.yaml and copy the following YAML template to the file:
apiVersion: v1 kind: Pod metadata: name: demo-app spec: nodeSelector: alibabacloud.com/nodepool-id: npxxxxxxxxxxxxx containers: - name: demo image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 volumeMounts: - mountPath: /data name: hadoop volumes: - name: hadoop persistentVolumeClaim: claimName: hadoop
NoteIn the ACK Edge cluster, you must deploy the test pod to the node pool specified in Step 2 by using the
nodeSelector
field.Run the following command to deploy the application:
kubectl create -f app.yaml
Run the following command to query the size of the specified file:
kubectl exec -it demo-app -- bash du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz
Expected output:
210M /data/spark-3.0.1-bin-hadoop2.7.tgz
Run the following command to query the time consumed to copy the file:
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
Expected output:
real 0m18.386s user 0m0.002s sys 0m0.105s
The output indicates that 18 seconds are consumed to copy the file.
Run the following command to check the cached data of the dataset:
kubectl get dataset hadoop
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 210.00MiB 210.00MiB 4.00GiB 100.0% Bound 1h
The output indicates that 210 MiB of data is cached to the on-premises storage.
Run the following command to delete the current application and then create the same application:
NoteThis step is performed to avoid other factors, such as the page cache, from affecting the result.
kubectl delete -f app.yaml && kubectl create -f app.yaml
Run the following command to query the time consumed to copy the file:
kubectl exec -it demo-app -- bash time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/null
Expected output:
real 0m0.048s user 0m0.001s sys 0m0.046s
The output indicates that file copy takes 48 milliseconds, a reduction of more than 300 times.
NoteThis is because the file is cached by JindoFS.
Clear the environment
If you no longer use data acceleration, clear the environment.
Run the following command to delete the application:
kubectl delete pod demo-app
Run the following command to delete the dataset and JindoRuntime:
kubectl delete dataset hadoop