All Products
Search
Document Center

Container Service for Kubernetes:Use Fluid to accelerate access to OSS objects

Last Updated:Oct 18, 2024

Fluid is an open source, Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data applications and AI applications. JindoRuntime is the execution engine of JindoFS developed by the Alibaba Cloud E-MapReduce (EMR) team. JindoRuntime is based on C++ and provides dataset management and caching. JindoRuntime also supports Object Storage Service (OSS). Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use Fluid in a registered cluster to accelerate access to OSS objects.

How it works

The following figure shows how Fluid is used to accelerate access to OSS objects.访问oss.png

Prerequisites

Step 1: Install ack-fluid

Use onectl

  1. Install onectl on your on-premises machine. For more information, see Use onectl to manage registered clusters.

  2. Run the following command to install ack-fluid:

    onectl addon install ack-fluid --set pullImageByVPCNetwork=false

    pullImageByVPCNetwork: optional. This parameter specifies whether to pull the component image through a virtual private cloud (VPC).

    Expected output:

    Addon ack-fluid, version **** installed.

Use the console

  1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.

  2. On the App Catalog tab, find and click ack-fluid.

  3. In the upper-right part of the page, click Deploy.

  4. In the Deploy panel, specify Cluster, keep the default settings for Namespace and Release Name, and then click Next.

  5. Set Chart Version to the latest version, configure component parameters, and then click OK.

Step 2: Prepare data

  1. Run the following command to download a test dataset:

    wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
  2. Upload the test dataset to the OSS bucket. You can use the client ossutil provided by OSS to upload the dataset. For more information, see Install ossutil.

Step 3: Add labels to nodes in the external Kubernetes cluster

Run the following command to add the demo-oss=true label to all nodes in the external Kubernetes cluster. The label adds constraints to limit the nodes where the master and worker components of JindoRuntime can be deployed.

kubectl label node **** demo-oss=true

Step 4: Create a Dataset CR and a JindoRuntime CR

  1. Create a file named mySecret.yaml and add the following content to the file.

    The file is used to store the fs.oss.accessKeyId and fs.oss.accessKeySecret of OSS. You must create this file before creating the Dataset CustomResource (CR).

    apiVersion: v1
    kind: Secret
    metadata:
      name: mysecret
    stringData:
      fs.oss.accessKeyId: ****
      fs.oss.accessKeySecret: ****
  2. Run the following command to deploy the mySecret file to generate a Secret:

    kubectl create -f mySecret.yaml

    Kubernetes automatically encrypts Secrets to avoid disclosing sensitive data in plaintext.

  3. Create a file named resource.yaml and add the following content to file. The file contains a Dataset CR and a JindoRuntime CR.

    • Dataset: describes the dataset stored in the bucket and the underlying file system (UFS).

    • JindoRuntime: launches a JindoFS cluster to provide caching services.

      apiVersion: data.fluid.io/v1alpha1
      kind: Dataset
      metadata:
        name: hadoop
      spec:
        mounts:
          - mountPoint: oss://<oss_bucket>/<bucket_dir>
            options:
              fs.oss.endpoint: <oss_endpoint>
            name: hadoop
            path: "/"
            encryptOptions:
              - name: fs.oss.accessKeyId
                valueFrom:
                  secretKeyRef:
                    name: mysecret
                    key: fs.oss.accessKeyId
              - name: fs.oss.accessKeySecret
                valueFrom:
                  secretKeyRef:
                    name: mysecret
                    key: fs.oss.accessKeySecret
      ---
      apiVersion: data.fluid.io/v1alpha1
      kind: JindoRuntime
      metadata:
        name: hadoop
      spec:
        # Make sure that the cache runtime runs only on the nodes in the external Kubernetes cluster. 
        master:
          nodeSelector:
            demo-oss: "true"
        worker:
          nodeSelector:
            demo-oss: "true"
        fuse:
          nodeSelector:
            demo-oss: "true"
        replicas: 2
        tieredstore:
          levels:
            - mediumtype: HDD
              path: /mnt/disk1
              quota: 100G
              high: "0.99"
              low: "0.8"

      Resource

      Parameter

      Description

      Dataset

      mountPoint

      oss://<oss_bucket>/<bucket_dir> specifies the path of the UFS to be mounted. You do not need to include the endpoint in the path.

      fs.oss.endpoint

      The public or private endpoint of the OSS bucket.

      JindoRuntime

      replicas

      The number of workers in the JindoFS cluster.

      mediumtype

      The type of cache. You can select HDD, SSD, or MEM for JindoFS when you create the JindoRuntime template.

      path

      The cache path. You can specify only one path. If you select MEM as the cache type, you need to specify a local path to store logs.

      quota

      The maximum size of the cache. Unit: GB.

      high

      The upper limit of the storage capacity.

      low

      The lower limit of the storage capacity.

  4. Run the following command to create a Dataset CR and a JindoRuntime CR:

    kubectl create -f resource.yaml
  5. Run the following command to query the deployment of the Dataset CR:

    kubectl get dataset hadoop

    Expected output:

    NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    hadoop        210MiB       0.00B    100.00GiB              0.0%          Bound   1h
  6. Run the following command to query the deployment of the JindoRuntime CR:

    kubectl get jindoruntime hadoop

    Expected output:

    NAME     MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
    hadoop   Ready          Ready          Ready        4m45s
  7. Run the following command to query the status of the persistent volume (PV) and persistent volume claim (PVC):

    kubectl get pv,pvc

    Expected output:

    NAME                      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   REASON   AGE
    persistentvolume/hadoop   100Gi      RWX            Retain           Bound    default/hadoop                           52m
    
    NAME                           STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    persistentvolumeclaim/hadoop   Bound    hadoop   100Gi      RWX                           52m

    The output indicates that the Dataset and JindoRuntime CRs are created.

Step 5: Create a containerized application to verify the acceleration service

You can create a containerized application or submit a machine learning job to verify the JindoFS acceleration service. This section describes how to create a containerized application to access the same dataset multiple times and then compare the time consumption to verify the acceleration service.

  1. Create a file named app.yaml and add the following content to the file:

    apiVersion: v1
    kind: Pod
    metadata:
      name: demo-app
    spec:
      containers:
        - name: demo
          image: fluidcloudnative/serving
          volumeMounts:
            - mountPath: /data
              name: hadoop
      volumes:
        - name: hadoop
          persistentVolumeClaim:
            claimName: hadoop
  2. Run the following command to create a containerized application:

    kubectl create -f app.yaml
  3. Run the following command to query the size of the file to be accessed:

    kubectl exec -it demo-app -- bash
    du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz

    Expected output:

    209.7M    /data/spark-3.0.1-bin-hadoop2.7.tgz
  4. Run the following command to query the time required to copy the file:

    time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /test

    Expected output:

    real    1m2.374s
    user    0m0.000s
    sys     0m0.256s

    The output indicates that it takes 62 seconds to copy the file.

  5. Run the following command to query the cache information of the dataset:

    kubectl get dataset hadoop

    Expected output:

    NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    hadoop   209.74MiB       209.74MiB    100.00GiB        100.0%           Bound   1h

    The output indicates that 209.7 MiB of data is cached.

  6. Run the following command to delete the current application and then create the same application.

    Note

    This operation helps eliminate the impact of other factors, such as page cache, on the verification result.

    kubectl delete -f app.yaml && kubectl create -f app.yaml
  7. Run the following command to query the time required to copy the file:

    kubectl exec -it demo-app -- bash
    time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /test

    Expected output:

    real	0m3.454s
    user	0m0.000s
    sys	  0m0.268s

    The output indicates that it takes 3 seconds to copy the file, which is only one eighteenth of the original time. This is because the file is cached by JindoFS. It is much faster to access a cached file.

(Optional) Step 6: Clear the environment

If you no longer need the acceleration service, run the following commands to clear the environment.

  1. Run the following command to delete the JindoRuntime and application:

    kubectl delete jindoruntime hadoop
  2. Run the following command to delete the dataset:

    kubectl delete dataset hadoop