Use HDP 2.6-based Hadoop to read and write OSS data - Object Storage Service

Hortonworks Data Platform (HDP) is a big data platform released by Hortonworks and consists of open source components such as Hadoop, Hive, and HBase. Hadoop 3.1.1 is included in HDP 3.0.1 and supports Object Storage Service (OSS). However, earlier versions of HDP do not support OSS. This topic uses HDP 2.6.1.0 as an example to describe how to configure HDP 2.6 to read and write OSS data.

Prerequisites

An HDP 2.6.1.0 cluster is created.

If you do not have an HDP 2.6.1.0 cluster, you can use one of the following methods to create an HDP 2.6.1.0 cluster:

Use Ambari to create an HDP 2.6.1.0 cluster.
If Ambari is not available, you can manually create an HDP 2.6.1.0 cluster.

Procedure

Download the HDP 2.6.1.0 package that supports OSS.

Run the following command to decompress the downloaded package:

sudo tar -xvf hadoop-oss-hdp-2.6.1.0-129.tar

Sample success response:

hadoop-oss-hdp-2.6.1.0-129/
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-ram-3.0.0.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-core-3.4.0.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-ecs-4.2.0.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-sts-3.0.0.jar
hadoop-oss-hdp-2.6.1.0-129/jdom-1.1.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-sdk-oss-3.4.1.jar
hadoop-oss-hdp-2.6.1.0-129/hadoop-aliyun-2.7.3.2.6.1.0-129.jar

Modify the directories of the JAR packages.

Note In this topic, all contents enclosed by ${} are environment variables. Modify the environment variables based on the actual environment.

Move the Hadoop-aliyun-2.7.3.2.6.1.0-129.jar package to the ${/usr/hdp/current}/hadoop-client/ directory. Run the following command to check whether the directory is modified:
```
sudo ls -lh /usr/hdp/current/hadoop-client/hadoop-aliyun-2.7.3.2.6.1.0-129.jar
```
Sample success response:
```
-rw-r--r-- 1 root root 64K Oct 28 20:56 /usr/hdp/current/hadoop-client/hadoop-aliyun-2.7.3.2.6.1.0-129.jar
```

Move other jar packages to the ${/usr/hdp/current}/hadoop-client/lib/ directory. Run the following command to check whether the directory is modified:

sudo ls -ltrh /usr/hdp/current/hadoop-client/lib

Sample success response:

total 27M
......
drwxr-xr-x 2 root root 4.0K Oct 28 20:10 ranger-hdfs-plugin-impl
drwxr-xr-x 2 root root 4.0K Oct 28 20:10 ranger-yarn-plugin-impl
drwxr-xr-x 2 root root 4.0K Oct 28 20:10 native
-rw-r--r-- 1 root root 114K Oct 28 20:56 aliyun-java-sdk-core-3.4.0.jar
-rw-r--r-- 1 root root 513K Oct 28 20:56 aliyun-sdk-oss-3.4.1.jar
-rw-r--r-- 1 root root  13K Oct 28 20:56 aliyun-java-sdk-sts-3.0.0.jar
-rw-r--r-- 1 root root 211K Oct 28 20:56 aliyun-java-sdk-ram-3.0.0.jar
-rw-r--r-- 1 root root 770K Oct 28 20:56 aliyun-java-sdk-ecs-4.2.0.jar
-rw-r--r-- 1 root root 150K Oct 28 20:56 jdom-1.1.jar

Perform the preceding operations on all HDP nodes.

Use Ambari to add configurations. If your cluster does not use Ambari for management, modify core-site.xml. In this example, Ambari is used. The following table describes the configurations that you must add.

Parameter	Description
fs.oss.endpoint	Specify the endpoint of the region in which the bucket that you want to access is located. Example: oss-cn-zhangjiakou-internal.aliyuncs.com.
fs.oss.accessKeyId	Enter the AccessKey ID used to access OSS.
fs.oss.accessKeySecret	Enter the AccessKey secret used to access OSS.
fs.oss.impl	Specify the class used to implement the OSS file system based on Hadoop. Set the value to org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.
fs.oss.buffer.dir	Specify the name of the directory used to store temporary files. We recommend that you set this parameter to /tmp/oss.
fs.oss.connection.secure.enabled	Specify whether to enable HTTPS. Performance may be affected when HTTPS is enabled. We recommend that you set this parameter to false.
fs.oss.connection.maximum	Specify the maximum number of connections to OSS. We recommend that you set this parameter to 2048.

For more information about more parameters, visit Hadoop-Aliyun module.

Restart the cluster as prompted by Ambari.
Test whether data can be read from and written to OSS.
1. Run the following command to test whether data can be read from OSS:
```
sudo hadoop fs -ls oss://${your-bucket-name}/
```
2. Run the following command to test whether data can be written to OSS:
```
sudo hadoop fs -mkdir oss://${your-bucket-name}/hadoop-test
```
  If data can be read from and written to OSS, the configurations are successful. Otherwise, check whether the configurations are correct.

To run MapReduce jobs, run the following command to move the HDP 2.6.1.0 package to the hdfs://hdp-master:8020/hdp/apps/2.6.1.0-129/mapreduce/mapreduce.tar.gz package:

Note In this example, MapReduce jobs are used. For more information about how to run jobs of other types, refer to the following step and code. For example, to run TEZ jobs, move the HDP 2.6.1.0 package to the hdfs://hdp-master:8020/hdp/apps/2.6.1.0-129/tez/tez.tar.gz package.

sudo su hdfs
sudo cd
sudo hadoop fs -copyToLocal /hdp/apps/2.6.1.0-129/mapreduce/mapreduce.tar.gz
sudo hadoop fs -rm /hdp/apps/2.6.1.0-129/mapreduce/mapreduce.tar.gz
sudo cp mapreduce.tar.gz mapreduce.tar.gz.bak
sudo tar zxf mapreduce.tar.gz
sudo cp /usr/hdp/current/hadoop-client/hadoop-aliyun-2.7.3.2.6.1.0-129.jar hadoop/share/hadoop/tools/lib/
sudo cp /usr/hdp/current/hadoop-client/lib/aliyun-* hadoop/share/hadoop/tools/lib/
sudo cp /usr/hdp/current/hadoop-client/lib/jdom-1.1.jar hadoop/share/hadoop/tools/lib/
sudo tar zcf mapreduce.tar.gz hadoop
sudo hadoop fs -copyFromLocal mapreduce.tar.gz /hdp/apps/2.6.1.0-129/mapreduce/

Verify the configurations

You can test TeraGen and TeraSort to check whether the configurations take effect.

Run the following command to test TeraGen:

sudo hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen -Dmapred.map.tasks=100 10995116 oss://{bucket-name}/1G-input

Sample success response:

18/10/28 21:32:38 INFO client.RMProxy: Connecting to ResourceManager at cdh-master/192.168.0.161:8050
18/10/28 21:32:38 INFO client.AHSProxy: Connecting to Application History server at cdh-master/192.168.0.161:10200
18/10/28 21:32:38 INFO aliyun.oss: [Server]Unable to execute HTTP request: Not Found
[ErrorCode]: NoSuchKey
[RequestId]: 5BD5BA7641FCE369BC1D052C
[HostId]: null
18/10/28 21:32:38 INFO aliyun.oss: [Server]Unable to execute HTTP request: Not Found
[ErrorCode]: NoSuchKey
[RequestId]: 5BD5BA7641FCE369BC1D052F
[HostId]: null
18/10/28 21:32:39 INFO terasort.TeraSort: Generating 10995116 using 100
18/10/28 21:32:39 INFO mapreduce.JobSubmitter: number of splits:100
18/10/28 21:32:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1540728986531_0005
18/10/28 21:32:39 INFO impl.YarnClientImpl: Submitted application application_1540728986531_0005
18/10/28 21:32:39 INFO mapreduce.Job: The url to track the job: http://cdh-master:8088/proxy/application_1540728986531_0005/
18/10/28 21:32:39 INFO mapreduce.Job: Running job: job_1540728986531_0005
18/10/28 21:32:49 INFO mapreduce.Job: Job job_1540728986531_0005 running in uber mode : false
18/10/28 21:32:49 INFO mapreduce.Job:  map 0% reduce 0%
18/10/28 21:32:55 INFO mapreduce.Job:  map 1% reduce 0%
18/10/28 21:32:57 INFO mapreduce.Job:  map 2% reduce 0%
18/10/28 21:32:58 INFO mapreduce.Job:  map 4% reduce 0%
...
18/10/28 21:34:40 INFO mapreduce.Job:  map 99% reduce 0%
18/10/28 21:34:42 INFO mapreduce.Job:  map 100% reduce 0%
18/10/28 21:35:15 INFO mapreduce.Job: Job job_1540728986531_0005 completed successfully
18/10/28 21:35:15 INFO mapreduce.Job: Counters: 36
...

Run the following command to test TeraSort:

sudo hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar terasort -Dmapred.map.tasks=100 oss://{bucket-name}/1G-input oss://{bucket-name}/1G-output

Sample success response:

18/10/28 21:39:00 INFO terasort.TeraSort: starting
...
18/10/28 21:39:02 INFO mapreduce.JobSubmitter: number of splits:100
18/10/28 21:39:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1540728986531_0006
18/10/28 21:39:02 INFO impl.YarnClientImpl: Submitted application application_1540728986531_0006
18/10/28 21:39:02 INFO mapreduce.Job: The url to track the job: http://cdh-master:8088/proxy/application_1540728986531_0006/
18/10/28 21:39:02 INFO mapreduce.Job: Running job: job_1540728986531_0006
18/10/28 21:39:09 INFO mapreduce.Job: Job job_1540728986531_0006 running in uber mode : false
18/10/28 21:39:09 INFO mapreduce.Job:  map 0% reduce 0%
18/10/28 21:39:17 INFO mapreduce.Job:  map 1% reduce 0%
18/10/28 21:39:19 INFO mapreduce.Job:  map 2% reduce 0%
18/10/28 21:39:20 INFO mapreduce.Job:  map 3% reduce 0%
...
18/10/28 21:42:50 INFO mapreduce.Job:  map 100% reduce 75%
18/10/28 21:42:53 INFO mapreduce.Job:  map 100% reduce 80%
18/10/28 21:42:56 INFO mapreduce.Job:  map 100% reduce 86%
18/10/28 21:42:59 INFO mapreduce.Job:  map 100% reduce 92%
18/10/28 21:43:02 INFO mapreduce.Job:  map 100% reduce 98%
18/10/28 21:43:05 INFO mapreduce.Job:  map 100% reduce 100%
^@18/10/28 21:43:56 INFO mapreduce.Job: Job job_1540728986531_0006 completed successfully
18/10/28 21:43:56 INFO mapreduce.Job: Counters: 54
...

If the tests are successful, the configurations take effect.