Fluid isolates resources in Kubernetes by namespaces. You can use Fluid to regulate access to datasets from different computing jobs and isolate data that belongs to different teams. In addition, Fluid supports data access and cache sharing across namespaces. With Fluid, you need to cache your data only once when you need to share data among multiple teams. This greatly improves data utilization efficiency and data management flexibility, and facilitates collaboration between R&D teams. This topic describes how to share datasets across namespaces by using Fluid.
How it works
Fluid supports ThinRuntime, which allows you to access various storage systems in a low-code way and reuse the key capabilities of Fluid, such as data orchestration and data access through the runtime platform. With ThinRuntime, Fluid allows you to associate a dataset in a namespace with a dataset in another namespace. This way, you can share the same cache runtime among applications in different namespaces.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
ImportantThe ack-fluid component is not currently supported on the ContainerOS.
The cloud-native AI suite is installed and the ack-fluid component is deployed.
ImportantIf you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.
If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.
A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
Step 1: Upload the test dataset to the OSS bucket
Create a test dataset of 2 GB in size. In this example, the test dataset is used.
Upload the test dataset to the OSS bucket that you created.
You can use the ossutil tool provided by OSS to upload data. For more information, see Install ossutil.
Step 2: Create a shared dataset and a Runtime object
JindoRuntime
Create a namespace named
share
. In the following example, the shared dataset and runtime object are created in this namespace.kubectl create ns share
Run the following command to create a Secret to store the AccessKey pair used to access the Object Storage Service (OSS) bucket:
kubectl apply -f-<<EOF apiVersion: v1 kind: Secret metadata: name: dataset-secret namespace: share stringData: fs.oss.accessKeyId: <YourAccessKey ID> fs.oss.accessKeySecret: <YourAccessKey Secret> EOF
In the preceding code, the
fs.oss.accessKeyId
parameter specifies the AccessKey ID and thefs.oss.accessKeySecret
parameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.Create a file named shared-dataset.yaml and copy the following content to the file. The file is used to create a dataset and a JindoRuntime. For more information about how to configure a dataset and a JindoRuntime, see Use JindoFS to accelerate access to OSS.
# Create a dataset that describes the dataset stored in the OSS bucket and the underlying file system (UFS). apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: shared-dataset namespace: share spec: mounts: - mountPoint: oss://<oss_bucket>/<bucket_dir> # Replace the value with the path of the file that you want to share in the OSS bucket. options: fs.oss.endpoint: <oss_endpoint> # Replace the value with the endpoint of the OSS bucket that you use. name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: dataset-secret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: dataset-secret key: fs.oss.accessKeySecret --- # Create a JindoRuntime to enable JindoFS for data caching in the cluster. apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: shared-dataset namespace: share spec: replicas: 1 tieredstore: levels: - mediumtype: MEM path: /dev/shm quota: 4Gi high: "0.95" low: "0.7"
Run the following command to create the dataset and the JindoRuntime:
kubectl apply -f shared-dataset.yaml
Expected output:
dataset.data.fluid.io/shared-dataset created jindoruntime.data.fluid.io/shared-dataset created
The output shows that the dataset and the JindoRuntime are created.
Wait a few minutes and run the following command to query the dataset and the JindoRuntime that are created:
kubectl get dataset,jindoruntime -nshare
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE dataset.data.fluid.io/shared-dataset 1.16GiB 0.00B 4.00GiB 0.0% Bound 4m1s NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE jindoruntime.data.fluid.io/shared-dataset Ready Ready Ready 15m
The output shows that the dataset is associated with the JindoRuntime.
JuiceFSRuntime
Create a namespace named
share
. In the following example, the shared dataset and runtime object are created in this namespace.kubectl create ns share
Run the following command to create a Secret to store the AccessKey pair used to access the OSS bucket:
kubectl apply -f-<<EOF apiVersion: v1 kind: Secret metadata: name: dataset-secret namespace: share type: Opaque stringData: token: <JUICEFS_VOLUME_TOKEN> access-key: <OSS_ACCESS_KEY> secret-key: <OSS_SECRET_KEY> EOF
In the preceding code, the
access-key
parameter specifies the AccessKey ID and thesecret-key
parameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.Create a file named shared-dataset.yaml and copy the following content to the file. The file is used to create a dataset and a JuiceFSRuntime.
# Create a dataset that describes the dataset stored in the OSS bucket and the UFS. apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: shared-dataset namespace: share spec: accessModes: ["ReadOnlyMany"] sharedEncryptOptions: - name: access-key valueFrom: secretKeyRef: name: dataset-secret key: access-key - name: secret-key valueFrom: secretKeyRef: name: dataset-secret key: secret-key - name: token valueFrom: secretKeyRef: name: dataset-secret key: token mounts: - name: <JUICEFS_VOLUME_NAME> mountPoint: juicefs:/// # The mount point of the file system is juicefs:///. options: bucket: https://<OSS_BUCKET_NAME>.oss-<REGION_ID>.aliyuncs.com # Replace the value with the endpoint of the OSS bucket that you use. Example: https://mybucket.oss-cn-beijing-internal.aliyuncs.com. --- # Create a JuiceFSRuntime to enable JuiceFS for data caching in the cluster. apiVersion: data.fluid.io/v1alpha1 kind: JuiceFSRuntime metadata: name: shared-dataset namespace: share spec: replicas: 1 tieredstore: levels: - mediumtype: MEM path: /dev/shm quota: 1Gi high: "0.95" low: "0.7"
Run the following command to create the dataset and the JuiceFSRuntime:
kubectl apply -f shared-dataset.yaml
Expected output:
dataset.data.fluid.io/shared-dataset created juicefsruntime.data.fluid.io/shared-dataset created
The output shows that the dataset and the JuiceFSRuntime are created.
Wait a few minutes and run the following command to query the dataset and the JuiceFSRuntime that are created:
kubectl get dataset,juicefsruntime -nshare
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE dataset.data.fluid.io/shared-dataset 2.32GiB 0.00B 4.00GiB 0.0% Bound 3d16h NAME WORKER PHASE FUSE PHASE AGE juicefsruntime.data.fluid.io/shared-dataset 3m50s
Step 3: Create a reference dataset and a pod
Create a namespace named
ref
. In the following example, theref-dataset
dataset is created in this namespace.kubectl create ns ref
Create a file named ref-dataset.yaml and copy the following content to the file. The file is used to create a dataset named
ref-dataset
in theref
namespace. The dataset can be used to access (reference) a dataset in another different namespace, which is theshare
namespace in this example.ImportantIn Fluid, a dataset must be mounted to a unique path. In addition, the value of the mountPoint parameter of the dataset must be in the
dataset://
format. If you specify the value ofmountPoint
in other formats, the dataset cannot be created and the fields in thespec
section do not take effect.apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: ref-dataset namespace: ref spec: mounts: - mountPoint: dataset://share/shared-dataset
The following list describes the value of the
mountPoint
parameter:dataset://
: the protocol prefix, which indicates that a dataset is referenced.share
: the namespace to which the referenced dataset belongs, which isshare
in this example.shared-dataset
: the name of the referenced dataset.
Run the following command to deploy the resources defined in the
ref-dataset.yaml
file in the cluster:kubectl apply -f ref-dataset.yaml
Create a file named app.yaml and copy the following content to the file. The file is used to create a pod in the
ref
namespace and the pod uses theref-dataset
dataset.apiVersion: v1 kind: Pod metadata: name: nginx namespace: ref spec: containers: - name: nginx image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 command: - "bash" - "-c" - "sleep inf" volumeMounts: - mountPath: /data name: ref-data volumes: - name: ref-data persistentVolumeClaim: claimName: ref-dataset
Run the following command to deploy the resources defined in the
app.yaml
file in the cluster:kubectl apply -f app.yaml
Run the following command to query the pod in the
ref
namespace:kubectl get pods -n ref -o wide
If the pod in the output is in the Running state, the pod is created.
Step 3: Test data sharing and caching
Run the following command to query the pods in the
share
andref
namespaces:kubectl get pods -n share kubectl get pods -n ref
Expected output:
# The following list shows the pods in the share namespace. NAME READY STATUS RESTARTS AGE shared-dataset-jindofs-fuse-ftkb5 1/1 Running 0 44s shared-dataset-jindofs-master-0 1/1 Running 0 9m13s shared-dataset-jindofs-worker-0 1/1 Running 0 9m13s # The following list shows the pods in the ref namespace. NAME READY STATUS RESTARTS AGE nginx 1/1 Running 0 118s
The output shows that three dataset-related pods run in the
share
namespace. Only one pod namednginx
runs in theref
namespace. No dataset-related pod runs in the ref namespace.Test data sharing and caching.
Run the following command to log on to the
nginx
pod:kubectl exec nginx -n ref -it -- sh
Test data sharing.
Run the following command to query the files in the
/data
directory: The ref-dataset dataset is located in the/data
path of theref
namespace.du -sh /data/wwm_uncased_L-24_H-1024_A-16.zip
Expected output:
1.3G /data/wwm_uncased_L-24_H-1024_A-16.zip
The output shows that the
nginx
pod in theref
namespace can access theconfig.json
file that belongs to theshare
namespace.Test data caching.
NoteThe following test results are for reference only. The actual read latency varies based on the actual conditions.
# The read latency when the file is read for the first time. sh-4.4# time cat /data/wwm_uncased_L-24_H-1024_A-16.zip > /dev/null real 0m1.166s user 0m0.007s sys 0m1.154s # Read the file again to test whether the read latency is reduced after the file is cached. sh-4.4# time cat /data/wwm_uncased_L-24_H-1024_A-16.zip > /dev/null real 0m0.289s user 0m0.011s sys 0m0.274s
The output shows that 1.166 seconds are required to read the file for the first time and the time required to read the file for the second time is reduced to 0.289 seconds. This indicates that the file is cached.