All Products
Search
Document Center

Container Service for Kubernetes:Share datasets across namespaces

Last Updated:Oct 21, 2024

Fluid isolates resources in Kubernetes by namespaces. You can use Fluid to regulate access to datasets from different computing jobs and isolate data that belongs to different teams. In addition, Fluid supports data access and cache sharing across namespaces. With Fluid, you need to cache your data only once when you need to share data among multiple teams. This greatly improves data utilization efficiency and data management flexibility, and facilitates collaboration between R&D teams. This topic describes how to share datasets across namespaces by using Fluid.

How it works

Fluid supports ThinRuntime, which allows you to access various storage systems in a low-code way and reuse the key capabilities of Fluid, such as data orchestration and data access through the runtime platform. With ThinRuntime, Fluid allows you to associate a dataset in a namespace with a dataset in another namespace. This way, you can share the same cache runtime among applications in different namespaces.

Prerequisites

  • A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.

    Important

    The ack-fluid component is not currently supported on the ContainerOS.

  • The cloud-native AI suite is installed and the ack-fluid component is deployed.

    Important

    If you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.

    • If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.

    • If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.

  • A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.

Step 1: Upload the test dataset to the OSS bucket

  1. Create a test dataset of 2 GB in size. In this example, the test dataset is used.

  2. Upload the test dataset to the OSS bucket that you created.

    You can use the ossutil tool provided by OSS to upload data. For more information, see Install ossutil.

Step 2: Create a shared dataset and a Runtime object

JindoRuntime

  1. Create a namespace named share. In the following example, the shared dataset and runtime object are created in this namespace.

    kubectl create ns share
  2. Run the following command to create a Secret to store the AccessKey pair used to access the Object Storage Service (OSS) bucket:

    kubectl apply -f-<<EOF                                            
    apiVersion: v1
    kind: Secret
    metadata:
      name: dataset-secret
      namespace: share
    stringData:
      fs.oss.accessKeyId: <YourAccessKey ID>
      fs.oss.accessKeySecret: <YourAccessKey Secret>
    EOF                                         

    In the preceding code, the fs.oss.accessKeyId parameter specifies the AccessKey ID and the fs.oss.accessKeySecret parameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.

  3. Create a file named shared-dataset.yaml and copy the following content to the file. The file is used to create a dataset and a JindoRuntime. For more information about how to configure a dataset and a JindoRuntime, see Use JindoFS to accelerate access to OSS.

    # Create a dataset that describes the dataset stored in the OSS bucket and the underlying file system (UFS). 
    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: shared-dataset
      namespace: share
    spec:
      mounts:
      - mountPoint: oss://<oss_bucket>/<bucket_dir> # Replace the value with the path of the file that you want to share in the OSS bucket. 
        options:
          fs.oss.endpoint: <oss_endpoint> # Replace the value with the endpoint of the OSS bucket that you use. 
        name: hadoop
        path: "/"
        encryptOptions:
          - name: fs.oss.accessKeyId
            valueFrom:
              secretKeyRef:
                name: dataset-secret
                key: fs.oss.accessKeyId
          - name: fs.oss.accessKeySecret
            valueFrom:
              secretKeyRef:
                name: dataset-secret
                key: fs.oss.accessKeySecret
    
    ---
    # Create a JindoRuntime to enable JindoFS for data caching in the cluster. 
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: shared-dataset
      namespace: share
    spec:
      replicas: 1
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            quota: 4Gi
            high: "0.95"
            low: "0.7"
  4. Run the following command to create the dataset and the JindoRuntime:

    kubectl apply -f shared-dataset.yaml

    Expected output:

    dataset.data.fluid.io/shared-dataset created
    jindoruntime.data.fluid.io/shared-dataset created

    The output shows that the dataset and the JindoRuntime are created.

  5. Wait a few minutes and run the following command to query the dataset and the JindoRuntime that are created:

    kubectl get dataset,jindoruntime -nshare

    Expected output:

    NAME                                   UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    dataset.data.fluid.io/shared-dataset   1.16GiB          0.00B    4.00GiB          0.0%                Bound   4m1s
    
    NAME                                        MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
    jindoruntime.data.fluid.io/shared-dataset   Ready          Ready          Ready        15m

    The output shows that the dataset is associated with the JindoRuntime.

JuiceFSRuntime

  1. Create a namespace named share. In the following example, the shared dataset and runtime object are created in this namespace.

    kubectl create ns share
  2. Run the following command to create a Secret to store the AccessKey pair used to access the OSS bucket:

    kubectl apply -f-<<EOF                                            
    apiVersion: v1
    kind: Secret
    metadata:
      name: dataset-secret
      namespace: share
    type: Opaque
    stringData:
      token: <JUICEFS_VOLUME_TOKEN>
      access-key: <OSS_ACCESS_KEY>
      secret-key: <OSS_SECRET_KEY>
    EOF                                         

    In the preceding code, the access-key parameter specifies the AccessKey ID and the secret-key parameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.

  3. Create a file named shared-dataset.yaml and copy the following content to the file. The file is used to create a dataset and a JuiceFSRuntime.

    # Create a dataset that describes the dataset stored in the OSS bucket and the UFS. 
    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: shared-dataset
      namespace: share
    spec:
      accessModes: ["ReadOnlyMany"]
      sharedEncryptOptions:
      - name: access-key
        valueFrom:
          secretKeyRef:
            name: dataset-secret
            key: access-key
      - name: secret-key
        valueFrom:
          secretKeyRef:
            name: dataset-secret
            key: secret-key
      - name: token
        valueFrom:
          secretKeyRef:
            name: dataset-secret
            key: token
      mounts:
      - name: <JUICEFS_VOLUME_NAME>
        mountPoint: juicefs:/// #  The mount point of the file system is juicefs:///. 
        options:
          bucket: https://<OSS_BUCKET_NAME>.oss-<REGION_ID>.aliyuncs.com # Replace the value with the endpoint of the OSS bucket that you use. Example: https://mybucket.oss-cn-beijing-internal.aliyuncs.com.
    ---
    # Create a JuiceFSRuntime to enable JuiceFS for data caching in the cluster. 
    apiVersion: data.fluid.io/v1alpha1
    kind: JuiceFSRuntime
    metadata:
      name: shared-dataset
      namespace: share
    spec:
      replicas: 1
      tieredstore:
        levels:
        - mediumtype: MEM
          path: /dev/shm
          quota: 1Gi
          high: "0.95"
          low: "0.7"
  4. Run the following command to create the dataset and the JuiceFSRuntime:

    kubectl apply -f shared-dataset.yaml

    Expected output:

    dataset.data.fluid.io/shared-dataset created
    juicefsruntime.data.fluid.io/shared-dataset created

    The output shows that the dataset and the JuiceFSRuntime are created.

  5. Wait a few minutes and run the following command to query the dataset and the JuiceFSRuntime that are created:

    kubectl get dataset,juicefsruntime -nshare

    Expected output:

    NAME                                   UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    dataset.data.fluid.io/shared-dataset   2.32GiB          0.00B    4.00GiB          0.0%                Bound   3d16h
    
    NAME                                          WORKER PHASE   FUSE PHASE   AGE
    juicefsruntime.data.fluid.io/shared-dataset                               3m50s
    

Step 3: Create a reference dataset and a pod

  1. Create a namespace named ref. In the following example, the ref-dataset dataset is created in this namespace.

    kubectl create ns ref
  2. Create a file named ref-dataset.yaml and copy the following content to the file. The file is used to create a dataset named ref-dataset in the ref namespace. The dataset can be used to access (reference) a dataset in another different namespace, which is the share namespace in this example.

    Important

    In Fluid, a dataset must be mounted to a unique path. In addition, the value of the mountPoint parameter of the dataset must be in the dataset:// format. If you specify the value of mountPoint in other formats, the dataset cannot be created and the fields in the spec section do not take effect.

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: ref-dataset
      namespace: ref
    spec:
      mounts:
      - mountPoint: dataset://share/shared-dataset

    The following list describes the value of the mountPoint parameter:

    • dataset://: the protocol prefix, which indicates that a dataset is referenced.

    • share: the namespace to which the referenced dataset belongs, which is share in this example.

    • shared-dataset: the name of the referenced dataset.

  3. Run the following command to deploy the resources defined in the ref-dataset.yaml file in the cluster:

    kubectl apply -f ref-dataset.yaml        
  4. Create a file named app.yaml and copy the following content to the file. The file is used to create a pod in the ref namespace and the pod uses the ref-dataset dataset.

    apiVersion: v1
    kind: Pod
    metadata:
      name: nginx
      namespace: ref
    spec:
      containers:
      - name: nginx
        image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
        command:
        - "bash"
        - "-c"
        - "sleep inf"
        volumeMounts:
        - mountPath: /data
          name: ref-data
      volumes:
      - name: ref-data
        persistentVolumeClaim:
          claimName: ref-dataset
  5. Run the following command to deploy the resources defined in the app.yaml file in the cluster:

    kubectl apply -f app.yaml
  6. Run the following command to query the pod in the ref namespace:

    kubectl get pods -n ref -o wide

    If the pod in the output is in the Running state, the pod is created.

Step 3: Test data sharing and caching

  1. Run the following command to query the pods in the share and ref namespaces:

    kubectl get pods -n share
    kubectl get pods -n ref

    Expected output:

    # The following list shows the pods in the share namespace. 
    NAME                                READY   STATUS    RESTARTS   AGE
    shared-dataset-jindofs-fuse-ftkb5   1/1     Running   0          44s
    shared-dataset-jindofs-master-0     1/1     Running   0          9m13s
    shared-dataset-jindofs-worker-0     1/1     Running   0          9m13s
    # The following list shows the pods in the ref namespace. 
    NAME    READY   STATUS    RESTARTS   AGE
    nginx   1/1     Running   0          118s

    The output shows that three dataset-related pods run in the share namespace. Only one pod named nginx runs in the ref namespace. No dataset-related pod runs in the ref namespace.

  2. Test data sharing and caching.

    1. Run the following command to log on to the nginx pod:

      kubectl exec nginx -n ref -it -- sh
    2. Test data sharing.

      Run the following command to query the files in the /data directory: The ref-dataset dataset is located in the /data path of the ref namespace.

      du -sh /data/wwm_uncased_L-24_H-1024_A-16.zip

      Expected output:

      1.3G	/data/wwm_uncased_L-24_H-1024_A-16.zip

      The output shows that the nginx pod in the ref namespace can access the config.json file that belongs to the share namespace.

    3. Test data caching.

      Note

      The following test results are for reference only. The actual read latency varies based on the actual conditions.

      # The read latency when the file is read for the first time. 
      sh-4.4# time cat /data/wwm_uncased_L-24_H-1024_A-16.zip > /dev/null
      real	0m1.166s
      user	0m0.007s
      sys	0m1.154s
      
      # Read the file again to test whether the read latency is reduced after the file is cached.
      sh-4.4# time cat /data/wwm_uncased_L-24_H-1024_A-16.zip > /dev/null
      real	0m0.289s
      user	0m0.011s
      sys	0m0.274s

      The output shows that 1.166 seconds are required to read the file for the first time and the time required to read the file for the second time is reduced to 0.289 seconds. This indicates that the file is cached.