Fluid isolates resources in Kubernetes by namespaces. You can use Fluid to regulate access to datasets from different computing jobs and isolate data that belongs to different teams. In addition, Fluid supports data access and cache sharing across namespaces. With Fluid, you need to cache your data only once when you need to share data among multiple teams. This greatly improves data utilization efficiency and data management flexibility, and facilitates collaboration between R&D teams. This topic describes how to share datasets across namespaces by using Fluid.
How it works
Fluid supports ThinRuntime, which allows you to access various storage systems in a low-code way and reuse the key capabilities of Fluid, such as data orchestration and data access through the runtime platform. With ThinRuntime, Fluid allows you to associate a dataset in a namespace with a dataset in another namespace. This way, you can share the same cache runtime among applications in different namespaces.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
ImportantThe ack-fluid component is not currently supported on the ContainerOS.
The cloud-native AI suite is installed and the ack-fluid component is deployed.
ImportantIf you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.
If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.
A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
Step 1: Upload the test dataset to the OSS bucket
Create a test dataset of 2 GB in size. In this example, the test dataset is used.
Upload the test dataset to the OSS bucket that you created.
You can use the ossutil tool provided by OSS to upload data. For more information, see Install ossutil.
Step 2: Create a shared dataset and a Runtime object
JindoRuntime
Create a namespace named
share. In the following example, the shared dataset and runtime object are created in this namespace.kubectl create ns shareRun the following command to create a Secret to store the AccessKey pair used to access the Object Storage Service (OSS) bucket:
kubectl apply -f-<<EOF apiVersion: v1 kind: Secret metadata: name: dataset-secret namespace: share stringData: fs.oss.accessKeyId: <YourAccessKey ID> fs.oss.accessKeySecret: <YourAccessKey Secret> EOFIn the preceding code, the
fs.oss.accessKeyIdparameter specifies the AccessKey ID and thefs.oss.accessKeySecretparameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.Create a file named shared-dataset.yaml and copy the following content to the file. The file is used to create a dataset and a JindoRuntime. For more information about how to configure a dataset and a JindoRuntime, see Use JindoFS to accelerate access to OSS.
# Create a dataset that describes the dataset stored in the OSS bucket and the underlying file system (UFS). apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: shared-dataset namespace: share spec: mounts: - mountPoint: oss://<oss_bucket>/<bucket_dir> # Replace the value with the path of the file that you want to share in the OSS bucket. options: fs.oss.endpoint: <oss_endpoint> # Replace the value with the endpoint of the OSS bucket that you use. name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: dataset-secret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: dataset-secret key: fs.oss.accessKeySecret --- # Create a JindoRuntime to enable JindoFS for data caching in the cluster. apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: shared-dataset namespace: share spec: replicas: 1 tieredstore: levels: - mediumtype: MEM path: /dev/shm quota: 4Gi high: "0.95" low: "0.7"Run the following command to create the dataset and the JindoRuntime:
kubectl apply -f shared-dataset.yamlExpected output:
dataset.data.fluid.io/shared-dataset created jindoruntime.data.fluid.io/shared-dataset createdThe output shows that the dataset and the JindoRuntime are created.
Wait a few minutes and run the following command to query the dataset and the JindoRuntime that are created:
kubectl get dataset,jindoruntime -nshareExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE dataset.data.fluid.io/shared-dataset 1.16GiB 0.00B 4.00GiB 0.0% Bound 4m1s NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE jindoruntime.data.fluid.io/shared-dataset Ready Ready Ready 15mThe output shows that the dataset is associated with the JindoRuntime.
JuiceFSRuntime
Create a namespace named
share. In the following example, the shared dataset and runtime object are created in this namespace.kubectl create ns shareRun the following command to create a Secret to store the AccessKey pair used to access the OSS bucket:
kubectl apply -f-<<EOF apiVersion: v1 kind: Secret metadata: name: dataset-secret namespace: share type: Opaque stringData: token: <JUICEFS_VOLUME_TOKEN> access-key: <OSS_ACCESS_KEY> secret-key: <OSS_SECRET_KEY> EOFIn the preceding code, the
access-keyparameter specifies the AccessKey ID and thesecret-keyparameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.Create a file named shared-dataset.yaml and copy the following content to the file. The file is used to create a dataset and a JuiceFSRuntime.
# Create a dataset that describes the dataset stored in the OSS bucket and the UFS. apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: shared-dataset namespace: share spec: accessModes: ["ReadOnlyMany"] sharedEncryptOptions: - name: access-key valueFrom: secretKeyRef: name: dataset-secret key: access-key - name: secret-key valueFrom: secretKeyRef: name: dataset-secret key: secret-key - name: token valueFrom: secretKeyRef: name: dataset-secret key: token mounts: - name: <JUICEFS_VOLUME_NAME> mountPoint: juicefs:/// # The mount point of the file system is juicefs:///. options: bucket: https://<OSS_BUCKET_NAME>.oss-<REGION_ID>.aliyuncs.com # Replace the value with the endpoint of the OSS bucket that you use. Example: https://mybucket.oss-cn-beijing-internal.aliyuncs.com. --- # Create a JuiceFSRuntime to enable JuiceFS for data caching in the cluster. apiVersion: data.fluid.io/v1alpha1 kind: JuiceFSRuntime metadata: name: shared-dataset namespace: share spec: replicas: 1 tieredstore: levels: - mediumtype: MEM path: /dev/shm quota: 1Gi high: "0.95" low: "0.7"Run the following command to create the dataset and the JuiceFSRuntime:
kubectl apply -f shared-dataset.yamlExpected output:
dataset.data.fluid.io/shared-dataset created juicefsruntime.data.fluid.io/shared-dataset createdThe output shows that the dataset and the JuiceFSRuntime are created.
Wait a few minutes and run the following command to query the dataset and the JuiceFSRuntime that are created:
kubectl get dataset,juicefsruntime -nshareExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE dataset.data.fluid.io/shared-dataset 2.32GiB 0.00B 4.00GiB 0.0% Bound 3d16h NAME WORKER PHASE FUSE PHASE AGE juicefsruntime.data.fluid.io/shared-dataset 3m50s
Step 3: Create a reference dataset and a pod
Create a namespace named
ref. In the following example, theref-datasetdataset is created in this namespace.kubectl create ns refCreate a file named ref-dataset.yaml and copy the following content to the file. The file is used to create a dataset named
ref-datasetin therefnamespace. The dataset can be used to access (reference) a dataset in another different namespace, which is thesharenamespace in this example.ImportantIn Fluid, a dataset must be mounted to a unique path. In addition, the value of the mountPoint parameter of the dataset must be in the
dataset://format. If you specify the value ofmountPointin other formats, the dataset cannot be created and the fields in thespecsection do not take effect.apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: ref-dataset namespace: ref spec: mounts: - mountPoint: dataset://share/shared-datasetThe following list describes the value of the
mountPointparameter:dataset://: the protocol prefix, which indicates that a dataset is referenced.share: the namespace to which the referenced dataset belongs, which issharein this example.shared-dataset: the name of the referenced dataset.
Run the following command to deploy the resources defined in the
ref-dataset.yamlfile in the cluster:kubectl apply -f ref-dataset.yamlCreate a file named app.yaml and copy the following content to the file. The file is used to create a pod in the
refnamespace and the pod uses theref-datasetdataset.apiVersion: v1 kind: Pod metadata: name: nginx namespace: ref spec: containers: - name: nginx image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 command: - "bash" - "-c" - "sleep inf" volumeMounts: - mountPath: /data name: ref-data volumes: - name: ref-data persistentVolumeClaim: claimName: ref-datasetRun the following command to deploy the resources defined in the
app.yamlfile in the cluster:kubectl apply -f app.yamlRun the following command to query the pod in the
refnamespace:kubectl get pods -n ref -o wideIf the pod in the output is in the Running state, the pod is created.
Step 3: Test data sharing and caching
Run the following command to query the pods in the
shareandrefnamespaces:kubectl get pods -n share kubectl get pods -n refExpected output:
# The following list shows the pods in the share namespace. NAME READY STATUS RESTARTS AGE shared-dataset-jindofs-fuse-ftkb5 1/1 Running 0 44s shared-dataset-jindofs-master-0 1/1 Running 0 9m13s shared-dataset-jindofs-worker-0 1/1 Running 0 9m13s # The following list shows the pods in the ref namespace. NAME READY STATUS RESTARTS AGE nginx 1/1 Running 0 118sThe output shows that three dataset-related pods run in the
sharenamespace. Only one pod namednginxruns in therefnamespace. No dataset-related pod runs in the ref namespace.Test data sharing and caching.
Run the following command to log on to the
nginxpod:kubectl exec nginx -n ref -it -- shTest data sharing.
Run the following command to query the files in the
/datadirectory: The ref-dataset dataset is located in the/datapath of therefnamespace.du -sh /data/wwm_uncased_L-24_H-1024_A-16.zipExpected output:
1.3G /data/wwm_uncased_L-24_H-1024_A-16.zipThe output shows that the
nginxpod in therefnamespace can access theconfig.jsonfile that belongs to thesharenamespace.Test data caching.
NoteThe following test results are for reference only. The actual read latency varies based on the actual conditions.
# The read latency when the file is read for the first time. sh-4.4# time cat /data/wwm_uncased_L-24_H-1024_A-16.zip > /dev/null real 0m1.166s user 0m0.007s sys 0m1.154s # Read the file again to test whether the read latency is reduced after the file is cached. sh-4.4# time cat /data/wwm_uncased_L-24_H-1024_A-16.zip > /dev/null real 0m0.289s user 0m0.011s sys 0m0.274sThe output shows that 1.166 seconds are required to read the file for the first time and the time required to read the file for the second time is reduced to 0.289 seconds. This indicates that the file is cached.