Connect non-EMR clusters to OSS-HDFS - Object Storage Service

OSS-HDFS (JindoFS) is fully compatible with Hadoop Distributed File System (HDFS) API operations and supports directory-level operations. JindoSDK allows Apache Hadoop-based computing and analysis applications, such as MapReduce, Hive, Spark, and Flink, to access HDFS. This topic describes how to deploy JindoSDK on an Elastic Compute Service (ECS) instance and perform common operations related to OSS-HDFS.

Note

If you use an Alibaba Cloud E-MapReduce (EMR) cluster, connect the EMR cluster to OSS-HDFS by using the methods described in Connect EMR clusters to OSS-HDFS.

Prerequisites

By default, an Alibaba Cloud account has the permissions to connect non-EMR clusters to OSS-HDFS and perform common operations related to OSS-HDFS. An Alibaba Cloud account or a RAM user that is granted the required permissions is created. If you want to use a RAM user to connect non-EMR clusters to OSS-HDFS, the RAM user must have the required permissions. For more information, see Grant a RAM user permissions to connect non-EMR clusters to OSS-HDFS.

An ECS instance is created. For more information, see Create an instance
A Hadoop environment is created. For more information about how to install Hadoop, see Step 2: Create a Hadoop runtime environment.
OSS-HDFS is enabled for a bucket and permissions are granted to the RAM role to access OSS-HDFS. For more information, see Enable OSS-HDFS and grant access permissions.

Video tutorial

The following video provides an example on how to connect non-EMR clusters to OSS-HDFS and perform common operations.

Connect non-EMR clusters to OSS-HDFS

Procedure

Connect to an ECS instance. For more information, see Connect to an instance.
Download and decompress the JindoSDK JAR package. To download JindoSDK, visit GitHub.
Run the following command to decompress the JindoSDK JAR package:
The following sample code provides an example on how to decompress a JindoSDK JAR package named jindosdk-x.x.x-linux.tar.gz. If you use another version of JindoSDK, replace the package name with the name of the corresponding JindoSDK JAR package.
```
tar zxvf jindosdk-x.x.x-linux.tar.gz
```
Note
x.x.x indicates the version number of the JindoSDK JAR package.
Configure environment variables.
1. Configure JINDOSDK_HOME.
  The following sample code provides an example on how to decompress the package to the /usr/lib/jindosdk-x.x.x-linux directory:
```
export JINDOSDK_HOME=/usr/lib/jindosdk-x.x.x-linux
```
2. Configure HADOOP_CLASSPATH.
```
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${JINDOSDK_HOME}/lib/*
```
  Important
  Specify the installation directory of the package and configure environment variables on all required nodes.

Configure the implementation class of OSS-HDFS and specify the AccessKey pair that you want to use to access the bucket.

Run the following command to go to the Hadoop configuration file named core-site.xml:
```
vim /usr/local/hadoop/etc/hadoop/core-site.xml
```

Configure the JindoSDK DLS implementation class in the core-site.xml file.

<configuration>
    <property>
        <name>fs.AbstractFileSystem.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOSS</value>
    </property>

    <property>
        <name>fs.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
    </property>
</configuration>

In the core-site.xml file, configure the AccessKey pair of the Alibaba Cloud account or the RAM user that has the required permissions.
For more information about the permissions that a RAM user must have in this scenario, see Grant a RAM user permissions to connect non-EMR clusters to OSS-HDFS.
```
<configuration>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>xxx</value>
    </property>

    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>xxx</value>
    </property>
</configuration>
```

Specify the endpoint of OSS-HDFS.
You must specify the endpoint of OSS-HDFS if you want to use OSS-HDFS to access OSS buckets. We recommend that you configure the access path in the following format: oss://<Bucket>.<Endpoint>/<Object>. Example: oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/exampleobject.txt. After you configure the access path, JindoSDK calls the corresponding OSS-HDFS operation based on the specified endpoint in the access path.
You can also configure the endpoint of OSS-HDFS by using other methods. The endpoints that are configured by using different methods have different priorities. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.
Run HDFS Shell commands to perform common operations that are related to OSS-HDFS.
- Upload local files
  Run the following command to upload a local file named examplefile.txt in the local root directory to a bucket named examplebucket:
```
hdfs dfs -put examplefile.txt oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/
```
- Download objects
  Run the following command to download an object named exampleobject.txt from a bucket named examplebucket to the root directory named /tmp on your computer:
```
hdfs dfs -get oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/exampleobject.txt  /tmp/
```
For more information, see Use Hadoop Shell commands to access OSS-HDFS.

Appendix 1: Other methods used to configure the endpoint of OSS-HDFS

Apart from the preceding method used to configure the endpoint in the access path, you can use the following methods to configure the endpoint:

Use bucket-level endpoints

If you use the access path in the oss://<Bucket>/<Object> format, no endpoint is configured. In this case, you can configure a bucket-level endpoint in the core-site.xml file to point to the endpoint of OSS-HDFS.

<configuration>
    <property>
        <!-- In this example, examplebucket is used as the name of the bucket for which OSS-HDFS is enabled. Specify your actual bucket name.   -->
        <name>fs.oss.bucket.examplebucket.endpoint</name>
        <!-- In this example, the endpoint of the China (Hangzhou) region is used. Specify your actual endpoint.   -->
        <value>cn-hangzhou.oss-dls.aliyuncs.com</value>
    </property>
</configuration>

Use the default OSS endpoint
If you use the access path in the oss://<Bucket>/<Object> format and do not specify a bucket-level endpoint in the access path, the default OSS endpoint is used to access OSS-HDFS. Run the following code to configure the default OSS endpoint in the Hadoop configuration file named core-site.xml:
```
<configuration>
    <property>
        <name>fs.oss.endpoint</name>
        
        <value>cn-hangzhou.oss-dls.aliyuncs.com</value>
    </property>
</configuration>
```

Note

The following endpoints that are configured by using different methods are arranged in descending order of priority: the endpoint specified in the access path > the bucket-level endpoint > the default OSS endpoint.

Appendix 2: Performance tuning

You can add the following configuration items to the core-site.xml file based on your requirements. Only JindoSDK 4.0 and later support these configuration items.

<configuration>

    <property>
          <!-- Specify the directories to which the client writes temporary files. You can configure multiple directories that are separated by commas (,). Read and write permissions must be granted in environments that involve multiple users. -->
        <name>fs.oss.tmp.data.dirs</name>
        <value>/tmp/</value>
    </property>

    <property>
          <!-- Specify the number of retries for failed access to OSS. -->
        <name>fs.oss.retry.count</name>
        <value>5</value>
    </property>

    <property>
          <!-- Specify the timeout period of OSS access requests. Unit: milliseconds. -->
        <name>fs.oss.timeout.millisecond</name>
        <value>30000</value>
    </property>

    <property>
          <!-- Specify the timeout period of OSS connections. Unit: milliseconds. -->
        <name>fs.oss.connection.timeout.millisecond</name>
        <value>3000</value>
    </property>

    <property>
          <!-- Specify the number of concurrent threads that can be used to upload a single object to OSS. -->
        <name>fs.oss.upload.thread.concurrency</name>
        <value>5</value>
    </property>

    <property>
          <!-- Specify the number of concurrent tasks that can be initiated to upload objects to OSS. -->
        <name>fs.oss.upload.queue.size</name>
        <value>5</value>
    </property>

    <property>
          <!-- Specify the maximum number of concurrent tasks that are initiated to upload objects to OSS in a process. -->
        <name>fs.oss.upload.max.pending.tasks.per.stream</name>
        <value>16</value>
    </property>

    <property>
          <!-- Specify the number of concurrent tasks that can be initiated to download objects from OSS. -->
        <name>fs.oss.download.queue.size</name>
        <value>5</value>
    </property>

    <property>
          <!-- Specify the maximum number of concurrent tasks that can be initiated to download objects from OSS in a process. -->
        <name>fs.oss.download.thread.concurrency</name>
        <value>16</value>
    </property>

    <property>
          <!-- Specify the size of the buffer that can be used to prefetch data from OSS. -->
        <name>fs.oss.read.readahead.buffer.size</name>
        <value>1048576</value>
    </property>

    <property>
          <!-- Specify the number of buffers that can be used to prefetch data from OSS at the same time. -->
        <name>fs.oss.read.readahead.buffer.count</name>
        <value>4</value>
    </property>

</configuration>