Fluid: Sharing Dataset across Kubernetes Namespaces

By Biran

What Is Fluid?

You can use the cloud-native architecture to run tasks on the cloud (such as AI and big data). You can enjoy the advantages of elastic computing resources. However, you encounter the challenges of data access latency and high bandwidth overhead for remote data pull caused by the separation of computing and storage. The iterative remote reading of many training data will slow down the GPU computing efficiency, especially in GPU deep learning training scenarios.

On the other hand, Kubernetes only provides heterogeneous storage service access and management standard interfaces (Container Storage Interface (CSI)). It does not define how applications use and manage data in container clusters. When running training tasks, data scientists need to be able to manage Dataset versions, control access permissions, preprocess Dataset, accelerate heterogeneous data reads, and more. However, there is no such standard solution in Kubernetes, which is one of the important capabilities missing from the cloud-native container community.

Fluid abstracts the process of using data for computing tasks and proposes the concept of elastic Dataset, which is implemented in Kubernetes as a first class citizen. Fluid creates a data orchestration and acceleration system around the elastic Dataset to implement capabilities (such as Dataset management (CRUD operations), permission control, and access acceleration).

Fluid has two core concepts: Dataset and Runtime.

Dataset is a collection of logically related data used by computing engines (such as Spark, TensorFlow, and PyTorch).
Runtime refers to a system that provides distributed cache. Currently, Fluid supports the following types of runtimes: JuiceFS, Alluxio, JindoFS, and GooseFS. Alluxio and JindoFS are typical distributed cache engines. JuiceFS is a distributed file system that provides distributed cache capabilities. These cache systems use the storage resources (such as memory and disk) on the nodes in the Kubernetes cluster as the compute-side cache of the remote storage system.

Why Does Fluid Need to Support Cross-Namespace Sharing?

By default, the earliest mode of Fluid supports a Dataset exclusive to one runtime, which can be understood as a Dataset with the dedicated cache cluster acceleration. It can be customized and optimized for the characteristics of the Dataset (such as single file size, file quantity scale, and the number of clients). A separate caching system is provided. It provides the best performance and stability and does not interfere with each other. However, it is a waste of hardware resources, which requires the deployment of cache systems for different Datasets. In addition, it is complex to maintain and manage multiple cache runtimes. This mode is a single-tenant architecture and is suitable for scenarios with high requirements for data access throughput and latency.

With the deepening of the use of Fluid, there are different needs. For example, users will create data-intensive jobs in multiple different namespaces, and these jobs will access the same Dataset. Multiple data scientists share the same Dataset, and each data scientist has his independent namespace to submit jobs. If you redeploy the cache system for each namespace and warm up the cache, data redundancy and job startup latency may occur.

As such, community users reduce the performance requirements to save resources and simplify O&M and begin to need to access Datasets across namespaces. The cross-namespace requirement is calling for a multi-tenant architecture, meaning the cluster administrator points the runtime to the root directory of storage. Multiple data scientists can create multiple Datasets in different namespaces to share the same runtime. Furthermore, administrators can configure subdirectories and different read and write permissions for data scientists in different namespaces.

There is no silver bullet to all architectural choices but trade-offs. This article uses AlluxioRuntime as an example to explain how to use Fluid to share a runtime.

Example

Imagine that User A preheats the Dataset spark in the Kubernetes namespace development, and User B accesses the Dataset spark in another namespace production. Fluid can help User B access the cached data in the namespace production without secondary preheating, which simplifies user usage. It can be preheated at one time, and users in different namespaces get benefits.

1. Before you run the sample code, install the sample code (only in the master branch currently) by referring to the installation documentation. Check that the Fluid components are running properly.

NAME                                        READY   STATUS              RESTARTS   AGE
csi-nodeplugin-fluid-mwx59                  2/2     Running             0          5m46s
csi-nodeplugin-fluid-tcbfd                  2/2     Running             0          5m46s
csi-nodeplugin-fluid-zwm8t                  2/2     Running             0          5m46s
dataset-controller-5c7557c4c5-q58bb         1/1     Running             0          5m46s
fluid-webhook-67fb7dffd6-h8ksp              1/1     Running             0          5m46s
fluidapp-controller-59b4fcfcb7-b8tx5        1/1     Running             0          5m46s

2. Create a namespace development:

$ kubectl create ns development

3. Create a Dataset and AlluxioRuntime in the namespace development:

$ cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: spark
  namespace: development
spec:
  mounts:
    - mountPoint: https://mirrors.bit.edu.cn/apache/spark/
      name: spark
      path: "/"
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: spark
  namespace: development
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 4Gi
        high: "0.95"
        low: "0.7"
EOF

4. View the status of a Dataset:

$ kubectl get dataset -A
NAMESPACE     NAME    UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
development   spark   3.41GiB          0.00B    4.00GiB          0.0%                Bound   2m54s

5. Create a Pod Access Dataset in the namespace development:

$ cat<<EOF >app.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: development
spec:
  containers:
    - name: nginx
      image: nginx
      volumeMounts:
        - mountPath: /data
          name: spark
  volumes:
    - name: spark
      persistentVolumeClaim:
        claimName: spark
EOF
$ kubectl create -f app.yaml

6. View the data the application can access through the Dataset and copy it. Copying 1.4G of data (7 files) took 3 minutes and 16 seconds:

$ kubectl exec -it -n development nginx -- ls -ltr /data
total 2
dr--r----- 1 root root 6 Dec  4 15:39 spark-3.1.3
dr--r----- 1 root root 7 Dec  4 15:39 spark-3.2.3
dr--r----- 1 root root 7 Dec  4 15:39 spark-3.3.1
$ kubectl exec -it -n development nginx -- bash
root@nginx:/# time cp -R  /data/spark-3.3.1/* /tmp
real    3m16.761s
user    0m0.021s
sys 0m3.520s
root@nginx:/# du -sh /tmp/
1.4G    /tmp/
root@nginx:/# du -sh /tmp/*
348K    /tmp/SparkR_3.3.1.tar.gz
269M    /tmp/pyspark-3.3.1.tar.gz
262M    /tmp/spark-3.3.1-bin-hadoop2.tgz
293M    /tmp/spark-3.3.1-bin-hadoop3-scala2.13.tgz
286M    /tmp/spark-3.3.1-bin-hadoop3.tgz
201M    /tmp/spark-3.3.1-bin-without-hadoop.tgz
28M /tmp/spark-3.3.1.tgz

7. Load a specified Dataset subdirectory using dataload:

$ cat<<EOF >dataload.yaml
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: spark
  namespace: development
spec:
  dataset:
    name: spark
    namespace: development
  target:
    - path: /spark-3.3.1
EOF
$ kubectl create -f dataload.yaml

8. View the dataload status:

$ kubectl get dataload -A
NAMESPACE     NAME    DATASET   PHASE      AGE     DURATION
development   spark   spark     Complete   5m47s   2m1s

9. Check the cache effect, and you can see that 38.4% of data has been cached:

$ kubectl get dataset -n development
NAME    UFS TOTAL SIZE   CACHED    CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
spark   3.41GiB          1.31GiB   4.00GiB          38.4%               Bound   79m

10. It only takes 0.8 seconds to copy 1.4G of data again, and the access speed is 245 times higher than before:

$ kubectl exec -it -n development nginx -- bash
root@nginx:/# time cp -R  /data/spark-3.3.1/* /tmp
real    0m0.872s
user    0m0.009s
sys 0m0.859s
root@nginx:/# du -sh /tmp/
1.4G    /tmp/
root@nginx:/# du -sh /tmp/*
348K    /tmp/SparkR_3.3.1.tar.gz
269M    /tmp/pyspark-3.3.1.tar.gz
262M    /tmp/spark-3.3.1-bin-hadoop2.tgz
293M    /tmp/spark-3.3.1-bin-hadoop3-scala2.13.tgz
286M    /tmp/spark-3.3.1-bin-hadoop3.tgz
201M    /tmp/spark-3.3.1-bin-without-hadoop.tgz
28M /tmp/spark-3.3.1.tgz

11. Create a production namespace:

$ kubectl create ns production

12. In the production namespace, create:

Refer to the spark reference. The mountPoint format is in the dataset://${namespace of the initial dataset}/${name of the initial dataset}. In this example, it is the initial dataset.

Note: The currently referenced Dataset only supports one mount, and the format must be dataset://. (This means Dataset creation fails when dataset:// or other formats occur.) Other fields in Spec are invalid.

$ cat<<EOF >spark-production.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: spark
  namespace: production
spec:
  mounts:
    - mountPoint: dataset://development/spark
EOF
$ kubectl create -f spark-production.yaml

13. View the Dataset, and the Spark Dataset in the production namespace is cached:

$ kubectlkubectl get dataset -n production
NAME    UFS TOTAL SIZE   CACHED    CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
spark   3.41GiB          1.31GiB   4.00GiB          38.4%               Bound   14h

14. In the production namespace, create a pod:

$ cat<<EOF >app-production.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: production
spec:
  containers:
    - name: nginx
      image: nginx
      volumeMounts:
        - mountPath: /data
          name: spark
  volumes:
    - name: spark
      persistentVolumeClaim:
        claimName: spark
EOF
$ kubectl create -f app-production.yaml

15. It takes 0.878s to access data in the production namespace.

$ kubectl exec -it -n production nginx -- ls -ltr /data
total 2
dr--r----- 1 root root 6 Dec  4 15:39 spark-3.1.3
dr--r----- 1 root root 7 Dec  4 15:39 spark-3.2.3
dr--r----- 1 root root 7 Dec  4 15:39 spark-3.3.1
$ kubectl exec -it -n production nginx -- bash
root@nginx:/# ls -ltr /tmp/
total 0
root@nginx:/# time cp -R  /data/spark-3.3.1/* /tmp
real    0m0.878s
user    0m0.014s
sys 0m0.851s
root@nginx:/# du -sh /tmp
1.4G    /tmp
root@nginx:/# du -sh /tmp/*
348K    /tmp/SparkR_3.3.1.tar.gz
269M    /tmp/pyspark-3.3.1.tar.gz
262M    /tmp/spark-3.3.1-bin-hadoop2.tgz
293M    /tmp/spark-3.3.1-bin-hadoop3-scala2.13.tgz
286M    /tmp/spark-3.3.1-bin-hadoop3.tgz
201M    /tmp/spark-3.3.1-bin-without-hadoop.tgz
28M /tmp/spark-3.3.1.tgz

Summary

The preceding example shows how to use Fluid to share Dataset across namespaces. In the next article, Fluid will support cross-namespace Dataset access on Serverless Kubernetes. There is no difference in the entire user experience.

In the next article, we will support the ability of SubDataset, which is to use a subdirectory of a Dataset as a Dataset to implement the same set of caches. Stay tuned.

Community

Fluid: Sharing Dataset across Kubernetes Namespaces

What Is Fluid?

Why Does Fluid Need to Support Cross-Namespace Sharing?

Example

Summary

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

Cloud-Native Applications Management Solution

Function Compute

Container Service for Kubernetes

ACK One