All Products
Search
Document Center

Object Storage Service:Connect non-EMR clusters to OSS-HDFS

Last Updated:Aug 28, 2024

OSS-HDFS (JindoFS) is fully compatible with Hadoop Distributed File System (HDFS) API operations and supports directory-level operations. JindoSDK allows Apache Hadoop-based computing and analysis applications, such as MapReduce, Hive, Spark, and Flink, to access HDFS. This topic describes how to deploy JindoSDK on an Elastic Compute Service (ECS) instance and perform common operations related to OSS-HDFS.

Note

If you use an Alibaba Cloud E-MapReduce (EMR) cluster, connect the EMR cluster to OSS-HDFS by using the methods described in Connect EMR clusters to OSS-HDFS.

Prerequisites

  • By default, an Alibaba Cloud account has the permissions to connect non-EMR clusters to OSS-HDFS and perform common operations related to OSS-HDFS. An Alibaba Cloud account or a RAM user that is granted the required permissions is created. If you want to use a RAM user to connect non-EMR clusters to OSS-HDFS, the RAM user must have the required permissions. For more information, see Grant a RAM user permissions to connect non-EMR clusters to OSS-HDFS.

Video tutorial

The following video provides an example on how to connect non-EMR clusters to OSS-HDFS and perform common operations.

Procedure

  1. Connect to an ECS instance. For more information, see Connect to an instance.

  2. Download and decompress the JindoSDK JAR package. To download JindoSDK, visit GitHub.

  3. Run the following command to decompress the JindoSDK JAR package:

    The following sample code provides an example on how to decompress a JindoSDK JAR package named jindosdk-x.x.x-linux.tar.gz. If you use another version of JindoSDK, replace the package name with the name of the corresponding JindoSDK JAR package.

    tar zxvf jindosdk-x.x.x-linux.tar.gz
    Note

    x.x.x indicates the version number of the JindoSDK JAR package.

  4. Configure environment variables.

    1. Configure JINDOSDK_HOME.

      The following sample code provides an example on how to decompress the package to the /usr/lib/jindosdk-x.x.x-linux directory:

      export JINDOSDK_HOME=/usr/lib/jindosdk-x.x.x-linux
    2. Configure HADOOP_CLASSPATH.

      export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${JINDOSDK_HOME}/lib/*
      Important

      Specify the installation directory of the package and configure environment variables on all required nodes.

  5. Configure the implementation class of OSS-HDFS and specify the AccessKey pair that you want to use to access the bucket.

    1. Run the following command to go to the Hadoop configuration file named core-site.xml:

      vim /usr/local/hadoop/etc/hadoop/core-site.xml
    2. Configure the JindoSDK DLS implementation class in the core-site.xml file.

      <configuration>
          <property>
              <name>fs.AbstractFileSystem.oss.impl</name>
              <value>com.aliyun.jindodata.oss.JindoOSS</value>
          </property>
      
          <property>
              <name>fs.oss.impl</name>
              <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
          </property>
      </configuration>
    3. In the core-site.xml file, configure the AccessKey pair of the Alibaba Cloud account or the RAM user that has the required permissions.

      For more information about the permissions that a RAM user must have in this scenario, see Grant a RAM user permissions to connect non-EMR clusters to OSS-HDFS.

      <configuration>
          <property>
              <name>fs.oss.accessKeyId</name>
              <value>xxx</value>
          </property>
      
          <property>
              <name>fs.oss.accessKeySecret</name>
              <value>xxx</value>
          </property>
      </configuration>
  6. Specify the endpoint of OSS-HDFS.

    You must specify the endpoint of OSS-HDFS if you want to use OSS-HDFS to access OSS buckets. We recommend that you configure the access path in the following format: oss://<Bucket>.<Endpoint>/<Object>. Example: oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/exampleobject.txt. After you configure the access path, JindoSDK calls the corresponding OSS-HDFS operation based on the specified endpoint in the access path.

    You can also configure the endpoint of OSS-HDFS by using other methods. The endpoints that are configured by using different methods have different priorities. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.

  7. Run HDFS Shell commands to perform common operations that are related to OSS-HDFS.

    • Upload local files

      Run the following command to upload a local file named examplefile.txt in the local root directory to a bucket named examplebucket:

      hdfs dfs -put examplefile.txt oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/
    • Download objects

      Run the following command to download an object named exampleobject.txt from a bucket named examplebucket to the root directory named /tmp on your computer:

      hdfs dfs -get oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/exampleobject.txt  /tmp/

    For more information, see Use Hadoop Shell commands to access OSS-HDFS.

Appendix 1: Other methods used to configure the endpoint of OSS-HDFS

Apart from the preceding method used to configure the endpoint in the access path, you can use the following methods to configure the endpoint:

  • Use bucket-level endpoints

    If you use the access path in the oss://<Bucket>/<Object> format, no endpoint is configured. In this case, you can configure a bucket-level endpoint in the core-site.xml file to point to the endpoint of OSS-HDFS.

    <configuration>
        <property>
            <!-- In this example, examplebucket is used as the name of the bucket for which OSS-HDFS is enabled. Specify your actual bucket name.   -->
            <name>fs.oss.bucket.examplebucket.endpoint</name>
            <!-- In this example, the endpoint of the China (Hangzhou) region is used. Specify your actual endpoint.   -->
            <value>cn-hangzhou.oss-dls.aliyuncs.com</value>
        </property>
    </configuration>
  • Use the default OSS endpoint

    If you use the access path in the oss://<Bucket>/<Object> format and do not specify a bucket-level endpoint in the access path, the default OSS endpoint is used to access OSS-HDFS. Run the following code to configure the default OSS endpoint in the Hadoop configuration file named core-site.xml:

    <configuration>
        <property>
            <name>fs.oss.endpoint</name>
            <!-- In this example, the endpoint of the China (Hangzhou) region is used. Specify your actual endpoint.   -->
            <value>cn-hangzhou.oss-dls.aliyuncs.com</value>
        </property>
    </configuration>
Note

The following endpoints that are configured by using different methods are arranged in descending order of priority: the endpoint specified in the access path > the bucket-level endpoint > the default OSS endpoint.

Appendix 2: Performance tuning

You can add the following configuration items to the core-site.xml file based on your requirements. Only JindoSDK 4.0 and later support these configuration items.

<configuration>

    <property>
          <!-- Specify the directories to which the client writes temporary files. You can configure multiple directories that are separated by commas (,). Read and write permissions must be granted in environments that involve multiple users. -->
        <name>fs.oss.tmp.data.dirs</name>
        <value>/tmp/</value>
    </property>

    <property>
          <!-- Specify the number of retries for failed access to OSS. -->
        <name>fs.oss.retry.count</name>
        <value>5</value>
    </property>

    <property>
          <!-- Specify the timeout period of OSS access requests. Unit: milliseconds. -->
        <name>fs.oss.timeout.millisecond</name>
        <value>30000</value>
    </property>

    <property>
          <!-- Specify the timeout period of OSS connections. Unit: milliseconds. -->
        <name>fs.oss.connection.timeout.millisecond</name>
        <value>3000</value>
    </property>

    <property>
          <!-- Specify the number of concurrent threads that can be used to upload a single object to OSS. -->
        <name>fs.oss.upload.thread.concurrency</name>
        <value>5</value>
    </property>

    <property>
          <!-- Specify the number of concurrent tasks that can be initiated to upload objects to OSS. -->
        <name>fs.oss.upload.queue.size</name>
        <value>5</value>
    </property>

    <property>
          <!-- Specify the maximum number of concurrent tasks that are initiated to upload objects to OSS in a process. -->
        <name>fs.oss.upload.max.pending.tasks.per.stream</name>
        <value>16</value>
    </property>

    <property>
          <!-- Specify the number of concurrent tasks that can be initiated to download objects from OSS. -->
        <name>fs.oss.download.queue.size</name>
        <value>5</value>
    </property>

    <property>
          <!-- Specify the maximum number of concurrent tasks that can be initiated to download objects from OSS in a process. -->
        <name>fs.oss.download.thread.concurrency</name>
        <value>16</value>
    </property>

    <property>
          <!-- Specify the size of the buffer that can be used to prefetch data from OSS. -->
        <name>fs.oss.read.readahead.buffer.size</name>
        <value>1048576</value>
    </property>

    <property>
          <!-- Specify the number of buffers that can be used to prefetch data from OSS at the same time. -->
        <name>fs.oss.read.readahead.buffer.count</name>
        <value>4</value>
    </property>

</configuration>