JindoRuntime is a Fluid runtime engine developed by the Alibaba Cloud E-MapReduce (EMR) team based on JindoFS. JindoFS is developed based on C++ and provides dataset management and caching for Fluid. JindoRuntime can cache data stored in persistent volumes (PVs) of Kubernetes clusters to accelerate data access. In addition, PVs can use any self-managed file systems, such as CephFS. This topic describes how to use JindoRuntime to accelerate access to PVs.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
ImportantThe ack-fluid component is not currently supported on the ContainerOS.
The cloud-native AI suite is installed and the ack-fluid component is deployed. The version of the ack-fluid component must be later than 1.0.6.
ImportantIf you have installed open source Fluid, you must uninstall Fluid before you can install the ack-fluid component.
If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI set.
If you have installed the cloud-native AI suite, log on to the ACK console and deploy ack-fluid from the Cloud-native AI Suite page.
A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
A PV and a persistent volume claim (PVC) that use the specified file system are created.
In Kubernetes clusters, different methods are used to create volumes for different file systems. To ensure the stability of the connection between a file system and a Kubernetes cluster, refer to the official documentation of the file system and complete the prerequisites.
Step 1: Query the PV and PVC
Run the following command to query the PV and PVC:
kubectl get pvc,pv
Expected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/demo-pvc Bound demo-pv 5Gi RWX 19h
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/demo-pv 30Gi RWX Retain Bound default/demo-pvc 19h
The PV named demo-pv
is 30 GB in size and supports the ReadOnlyMany (RWX) access mode. The PV is bound to a PVC named demo-pvc
. The PV and the PVC can be used as expected.
Step 2: Create a Fluid Dataset object and a JindoRuntime object
Create a file named
dataset.yaml
and copy the following content to the file.The following configuration defines two Fluid resource objects: Dataset and JindoRuntime.
Dataset: specifies information about the PVC.
JindoRuntime: specifies the configuration of the JindoFS distributed cache system, including the number of workers and the maximum size of data that can be cached on each worker.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: pv-demo-dataset spec: mounts: - mountPoint: pvc://demo-pvc name: data path: / accessModes: - ReadOnlyMany --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: pv-demo-dataset spec: replicas: 2 tieredstore: levels: - mediumtype: MEM volumeType: emptyDir path: /dev/shm quota: 10Gi high: "0.9" low: "0.8"
The following table describes the parameters.
Parameter
Description
mountPoint
The information about the data source to be mounted. When a PVC is specified as the data source, you can specify a path in the
pvc://<pvc_name>/<path>
format:pvc_name: the name of the PVC. The PVC and the Dataset object must belong to the same namespace.
path: the subpath of the volume to be mounted. Make sure that the subpath exists. Otherwise, the volume fails to be mounted.
replicas
The number of workers for the JindoFS cache system. You can modify the number based on your requirements.
mediumtype
The cache type. Valid values: HDD, SSD, and MEM.
For more information about the recommended configurations of the mediumtype, see Policy 2: Select proper cache media.
volumeType
The volume type of the cache medium. Valid values:
emptyDir
andhostPath
. Default value:hostPath
.If you use memory or local system disks as the cache medium, we recommend that you use the
emptyDir
type to avoid residual cache data on the node and ensure node availability.If you use local data disks as the cache medium, you can use the
hostPath
type and configure thepath
to specify the mount path of the data disk on the host.
For more information about the recommended configurations of the volumeType, see Policy 2: Select proper cache media.
path
The path where the workers store the cached data. To ensure the optimal data access experience, we recommend that you use
/dev/shm
or a path that is mounted as a memory file system.quota
The maximum size of data that can be cached on each worker. You can modify the size based on your business requirements.
Run the following commands to create a Dataset object and a JindoRuntime object:
kubectl create -f dataset.yaml
Run the following command to check whether the Dataset object is deployed:
kubectl get dataset pv-demo-dataset
Expected output:
NoteThe system needs to pull an image during the first time you start up the JindoFS cache system. The image pulling process may require 2 to 3 minutes depending on the network conditions.
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE pv-demo-dataset 10.96GiB 0.00B 20.00GiB 0.0% Bound 2m13s
If the Dataset object is in the Bound state, the JindoFS cache system has been launched. Application pods can access the data defined in the Dataset object as expected.
(Optional) Step 3: Create a DataLoad object to prefetch data
First-time queries cannot hit the cache. Fluid allows you to create DataLoad objects to prefetch data to accelerate first-time queries.
Create a file named
dataload.yaml
and add the following content to the file:apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: dataset-warmup spec: dataset: name: pv-demo-dataset namespace: default loadMetadata: true target: - path: / replicas: 1
The following table describes the parameters.
Parameter
Description
dataset.name
The name of the Dataset object to be prefetched.
dataset.namespace
The namespace to which the Dataset object belongs. The namespace must be the same as the namespace of the DataLoad object.
loadMetadata
Specifies whether to synchronize the metadata before prefetching. Set the value to true for JindoRuntime.
target[*].path
The path or file to be prefetched. The path must be a relative path of the mount point specified in the Dataset object.
For example, if the data source in the Dataset object is
pvc://my-pvc/mydata
and you set path to/test
, the/mydata/test
path in the file system used by PVCmy-pvc
is prefetched.target[*].replicas
The number of workers created to cache the prefetched path or file.
Run the following command to create the DataLoad object:
kubectl create -f dataload.yaml
Run the following command to query the status of the DataLoad object:
kubectl get dataload dataset-warmup
Expected output:
NAME DATASET PHASE AGE DURATION dataset-warmup pv-demo-dataset Complete 62s 12s
Run the following command to query the status of the Dataset object:
kubectl get dataset
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE pv-demo-dataset 10.96GiB 10.96GiB 20.00GiB 100.0% Bound 3m13s
After the prefetching process is complete, the size of the cached data (CACHED) equals the size of the dataset. This indicates that the entire dataset is cached and the percentage of data that is cached (CACHED PERCENTAGE) is 100%.
Step 4: Create application pods to access the data stored in the PV
Create a file named
pod.yaml
, add the following content to the file, and then set claimName to the name of the Dataset object created in Step 2.apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - name: nginx image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 command: - "bash" - "-c" - "sleep inf" volumeMounts: - mountPath: /data name: data-vol volumes: - name: data-vol persistentVolumeClaim: claimName: pv-demo-dataset # Specify the name of the Dataset object.
Run the following command to create application pods:
kubectl create -f pod.yaml
Run the following command to access data from a pod:
kubectl exec -it nginx bash
Expected output:
# A file named demofile is stored in the /data path of the Nginx pod. The file is 11 GB in size. ls -lh /data total 11G -rw-r----- 1 root root 11G Jul 22 2022 demofile # Run the cat /data/demofile > /dev/null command to read the demofile file and write the file to /dev/null, which takes 11.004 seconds. time cat /data/demofile > /dev/null real 0m11.004s user 0m0.065s sys 0m3.089s
The entire dataset is cached to the JindoFS cache system. When queries hit the cache, data is directly retrieved from the cache instead of remotely fetched from the file system. This reduces the distance of data transmission and accelerates data access.