Share datasets across namespaces - Container Service for Kubernetes

Fluid isolates resources in Kubernetes by namespaces. You can use Fluid to regulate access to datasets from different computing jobs and isolate data that belongs to different teams. In addition, Fluid supports data access and cache sharing across namespaces. With Fluid, you need to cache your data only once when you need to share data among multiple teams. This greatly improves data utilization efficiency and data management flexibility, and facilitates collaboration between R&D teams. This topic describes how to share datasets across namespaces by using Fluid.

How it works

Fluid supports ThinRuntime, which allows you to access various storage systems in a low-code way and reuse the key capabilities of Fluid, such as data orchestration and data access through the runtime platform. With ThinRuntime, Fluid allows you to associate a dataset in a namespace with a dataset in another namespace. This way, you can share the same cache runtime among applications in different namespaces.

Prerequisites

A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
Important
The ack-fluid component is not currently supported on the ContainerOS.
The cloud-native AI suite is installed and the ack-fluid component is deployed.
Important
If you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.
- If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
- If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.
A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.

Step 1: Upload the test dataset to the OSS bucket

Create a test dataset of 2 GB in size. In this example, the test dataset is used.
Upload the test dataset to the OSS bucket that you created.
You can use the ossutil tool provided by OSS to upload data. For more information, see Install ossutil.

Step 2: Create a shared dataset and a Runtime object

JindoRuntime

Create a namespace named share. In the following example, the shared dataset and runtime object are created in this namespace.
```
kubectl create ns share
```
Run the following command to create a Secret to store the AccessKey pair used to access the Object Storage Service (OSS) bucket:
```
kubectl apply -f-<<EOF                                            
apiVersion: v1
kind: Secret
metadata:
  name: dataset-secret
  namespace: share
stringData:
  fs.oss.accessKeyId: <YourAccessKey ID>
  fs.oss.accessKeySecret: <YourAccessKey Secret>
EOF                                         
```
In the preceding code, the fs.oss.accessKeyId parameter specifies the AccessKey ID and the fs.oss.accessKeySecret parameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.

Create a file named shared-dataset.yaml and copy the following content to the file. The file is used to create a dataset and a JindoRuntime. For more information about how to configure a dataset and a JindoRuntime, see Use JindoFS to accelerate access to OSS.

# Create a dataset that describes the dataset stored in the OSS bucket and the underlying file system (UFS). 
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: shared-dataset
  namespace: share
spec:
  mounts:
  - mountPoint: oss://<oss_bucket>/<bucket_dir> # Replace the value with the path of the file that you want to share in the OSS bucket. 
    options:
      fs.oss.endpoint: <oss_endpoint> # Replace the value with the endpoint of the OSS bucket that you use. 
    name: hadoop
    path: "/"
    encryptOptions:
      - name: fs.oss.accessKeyId
        valueFrom:
          secretKeyRef:
            name: dataset-secret
            key: fs.oss.accessKeyId
      - name: fs.oss.accessKeySecret
        valueFrom:
          secretKeyRef:
            name: dataset-secret
            key: fs.oss.accessKeySecret

---
# Create a JindoRuntime to enable JindoFS for data caching in the cluster. 
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: shared-dataset
  namespace: share
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 4Gi
        high: "0.95"
        low: "0.7"

Run the following command to create the dataset and the JindoRuntime:
```
kubectl apply -f shared-dataset.yaml
```
Expected output:
```
dataset.data.fluid.io/shared-dataset created
jindoruntime.data.fluid.io/shared-dataset created
```
The output shows that the dataset and the JindoRuntime are created.

Wait a few minutes and run the following command to query the dataset and the JindoRuntime that are created:

kubectl get dataset,jindoruntime -nshare

Expected output:

NAME                                   UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
dataset.data.fluid.io/shared-dataset   1.16GiB          0.00B    4.00GiB          0.0%                Bound   4m1s

NAME                                        MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
jindoruntime.data.fluid.io/shared-dataset   Ready          Ready          Ready        15m

The output shows that the dataset is associated with the JindoRuntime.

JuiceFSRuntime

Create a namespace named share. In the following example, the shared dataset and runtime object are created in this namespace.
```
kubectl create ns share
```

Run the following command to create a Secret to store the AccessKey pair used to access the OSS bucket:

kubectl apply -f-<<EOF                                            
apiVersion: v1
kind: Secret
metadata:
  name: dataset-secret
  namespace: share
type: Opaque
stringData:
  token: <JUICEFS_VOLUME_TOKEN>
  access-key: <OSS_ACCESS_KEY>
  secret-key: <OSS_SECRET_KEY>
EOF

In the preceding code, the access-key parameter specifies the AccessKey ID and the secret-key parameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.

Create a file named shared-dataset.yaml and copy the following content to the file. The file is used to create a dataset and a JuiceFSRuntime.

# Create a dataset that describes the dataset stored in the OSS bucket and the UFS. 
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: shared-dataset
  namespace: share
spec:
  accessModes: ["ReadOnlyMany"]
  sharedEncryptOptions:
  - name: access-key
    valueFrom:
      secretKeyRef:
        name: dataset-secret
        key: access-key
  - name: secret-key
    valueFrom:
      secretKeyRef:
        name: dataset-secret
        key: secret-key
  - name: token
    valueFrom:
      secretKeyRef:
        name: dataset-secret
        key: token
  mounts:
  - name: <JUICEFS_VOLUME_NAME>
    mountPoint: juicefs:/// #  The mount point of the file system is juicefs:///. 
    options:
      bucket: https://<OSS_BUCKET_NAME>.oss-<REGION_ID>.aliyuncs.com # Replace the value with the endpoint of the OSS bucket that you use. Example: https://mybucket.oss-cn-beijing-internal.aliyuncs.com.
---
# Create a JuiceFSRuntime to enable JuiceFS for data caching in the cluster. 
apiVersion: data.fluid.io/v1alpha1
kind: JuiceFSRuntime
metadata:
  name: shared-dataset
  namespace: share
spec:
  replicas: 1
  tieredstore:
    levels:
    - mediumtype: MEM
      path: /dev/shm
      quota: 1Gi
      high: "0.95"
      low: "0.7"

Run the following command to create the dataset and the JuiceFSRuntime:
```
kubectl apply -f shared-dataset.yaml
```
Expected output:
```
dataset.data.fluid.io/shared-dataset created
juicefsruntime.data.fluid.io/shared-dataset created
```
The output shows that the dataset and the JuiceFSRuntime are created.

Wait a few minutes and run the following command to query the dataset and the JuiceFSRuntime that are created:

kubectl get dataset,juicefsruntime -nshare

Expected output:

NAME                                   UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
dataset.data.fluid.io/shared-dataset   2.32GiB          0.00B    4.00GiB          0.0%                Bound   3d16h

NAME                                          WORKER PHASE   FUSE PHASE   AGE
juicefsruntime.data.fluid.io/shared-dataset                               3m50s

Step 3: Create a reference dataset and a pod

Create a namespace named ref. In the following example, the ref-dataset dataset is created in this namespace.
```
kubectl create ns ref
```
Create a file named ref-dataset.yaml and copy the following content to the file. The file is used to create a dataset named ref-dataset in the ref namespace. The dataset can be used to access (reference) a dataset in another different namespace, which is the share namespace in this example.
Important
In Fluid, a dataset must be mounted to a unique path. In addition, the value of the mountPoint parameter of the dataset must be in the dataset:// format. If you specify the value of mountPoint in other formats, the dataset cannot be created and the fields in the spec section do not take effect.
```
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: ref-dataset
  namespace: ref
spec:
  mounts:
  - mountPoint: dataset://share/shared-dataset
```
The following list describes the value of the mountPoint parameter:
- dataset://: the protocol prefix, which indicates that a dataset is referenced.
- share: the namespace to which the referenced dataset belongs, which is share in this example.
- shared-dataset: the name of the referenced dataset.
Run the following command to deploy the resources defined in the ref-dataset.yaml file in the cluster:
```
kubectl apply -f ref-dataset.yaml        
```

Create a file named app.yaml and copy the following content to the file. The file is used to create a pod in the ref namespace and the pod uses the ref-dataset dataset.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: ref
spec:
  containers:
  - name: nginx
    image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
    command:
    - "bash"
    - "-c"
    - "sleep inf"
    volumeMounts:
    - mountPath: /data
      name: ref-data
  volumes:
  - name: ref-data
    persistentVolumeClaim:
      claimName: ref-dataset

Run the following command to deploy the resources defined in the app.yaml file in the cluster:
```
kubectl apply -f app.yaml
```
Run the following command to query the pod in the ref namespace:
```
kubectl get pods -n ref -o wide
```
If the pod in the output is in the Running state, the pod is created.

Step 3: Test data sharing and caching

Run the following command to query the pods in the share and ref namespaces:

kubectl get pods -n share
kubectl get pods -n ref

Expected output:

# The following list shows the pods in the share namespace. 
NAME                                READY   STATUS    RESTARTS   AGE
shared-dataset-jindofs-fuse-ftkb5   1/1     Running   0          44s
shared-dataset-jindofs-master-0     1/1     Running   0          9m13s
shared-dataset-jindofs-worker-0     1/1     Running   0          9m13s
# The following list shows the pods in the ref namespace. 
NAME    READY   STATUS    RESTARTS   AGE
nginx   1/1     Running   0          118s

The output shows that three dataset-related pods run in the share namespace. Only one pod named nginx runs in the ref namespace. No dataset-related pod runs in the ref namespace.

Test data sharing and caching.
1. Run the following command to log on to the nginx pod:
```
kubectl exec nginx -n ref -it -- sh
```
2. Test data sharing.
  Run the following command to query the files in the /data directory: The ref-dataset dataset is located in the /data path of the ref namespace.
```
du -sh /data/wwm_uncased_L-24_H-1024_A-16.zip
```
  Expected output:
```
1.3G	/data/wwm_uncased_L-24_H-1024_A-16.zip
```
  The output shows that the nginx pod in the ref namespace can access the config.json file that belongs to the share namespace.
3. Test data caching.
  Note
  The following test results are for reference only. The actual read latency varies based on the actual conditions.
```
# The read latency when the file is read for the first time. 
sh-4.4# time cat /data/wwm_uncased_L-24_H-1024_A-16.zip > /dev/null
real	0m1.166s
user	0m0.007s
sys	0m1.154s

# Read the file again to test whether the read latency is reduced after the file is cached.
sh-4.4# time cat /data/wwm_uncased_L-24_H-1024_A-16.zip > /dev/null
real	0m0.289s
user	0m0.011s
sys	0m0.274s
```
  The output shows that 1.166 seconds are required to read the file for the first time and the time required to read the file for the second time is reduced to 0.289 seconds. This indicates that the file is cached.