Migrate data from an OSS bucket to a Lindorm file database

This topic describes how to import data from an Object Storage Service (OSS) bucket to a database that is powered by the Lindorm file engine.

Before you begin

Activate the file engine for your Lindorm instance. For more information, see Activate the file engine service.
Create a Hadoop cluster. We recommend that you use Hadoop 2.7.3 or later. In this example, Apache Hadoop 2.7.3 is used. You must modify the Hadoop configuration. For more information, see Use an open source HDFS clients to access the file engine.
Install Java Development Kit (JDK) on all the nodes of the Hadoop cluster. The JDK version must be 1.8 or later.
Install the OSS client JindoFS SDK on all the nodes of the Hadoop cluster. For more information about JindoFS SDK, see JindoFS SDK.
- Download jindofs-sdk.jar.
```
cp ./jindofs-sdk-*.jar  ${HADOOP_HOME}/share/hadoop/hdfs/lib/
```
- Create a JindoFS SDK configuration file for each node of the Hadoop cluster.
  - Add the following environment variable to the /etc/profile file.
```
export B2SDK_CONF_DIR=/etc/jindofs-sdk-conf
```
  - Create a JindoFS SDK configuration file: /etc/jindofs-sdk-conf/bigboot.cfg.
```
[bigboot]
logger.dir=/tmp/bigboot-log[bigboot-client]
client.oss.retry=5
client.oss.upload.threads=4
client.oss.upload.queue.size=5
client.oss.upload.max.parallelism=16
client.oss.timeout.millisecond=30000
client.oss.connection.timeout.millisecond=4000
```
  - Load the environment variable. After the environment variable is loaded, the environment variable takes effect.
```
source /etc/profile
```
  - Verify that your OSS bucket can be accessed in the Hadoop cluster.
```
${HADOOP_HOME}/bin/hadoop fs -ls oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/
```

Determine the size of the data that needs to be migrated.

${HADOOP_HOME}/bin/hadoop du -h oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data

Use the Hadoop distributed copy (DistCp) tool to start a MapReduce task to migrate the data to the file database.

${HADOOP_HOME}/bin/hadoop distcp  \
oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data.txt \
hdfs://${Instance ID}/

Replace ${Instance ID} with your Lindorm instance ID.

Configure the parameters based on the description in the following table.

Parameter	Description
accessKeyId	The AccessKey pair that is required when you call the OSS API. For information about how to obtain your AccessKey pair, see Create an AccessKey pair.
accessKeySecret
bucket-name.endpoint	The access address of the OSS bucket. The address consists of the bucket name and the endpoint that corresponds to the region where the bucket is deployed.

View the migration result after the task is completed.

If the result is similar to the following example, the data is migrated:

20/09/29 12:23:59 INFO mapreduce.Job:  map 100% reduce 0%
20/09/29 12:23:59 INFO mapreduce.Job: Job job_1601195105349_0015 completed successfully
20/09/29 12:23:59 INFO mapreduce.Job: Counters: 38
 File System Counters
  FILE: Number of bytes read=0
  FILE: Number of bytes written=122343
  FILE: Number of read operations=0
  FILE: Number of large read operations=0
  FILE: Number of write operations=0
  HDFS: Number of bytes read=470
  HDFS: Number of bytes written=47047709
  HDFS: Number of read operations=15
  HDFS: Number of large read operations=0
  HDFS: Number of write operations=4
  OSS: Number of bytes read=0
  OSS: Number of bytes written=0
  OSS: Number of read operations=0
  OSS: Number of large read operations=0
  OSS: Number of write operations=0
 Job Counters
  Launched map tasks=1
  Other local map tasks=1
  Total time spent by all maps in occupied slots (ms)=5194
  Total time spent by all reduces in occupied slots (ms)=0
  Total time spent by all map tasks (ms)=5194
  Total vcore-milliseconds taken by all map tasks=5194
  Total megabyte-milliseconds taken by all map tasks=5318656
 Map-Reduce Framework
  Map input records=1
  Map output records=0
  Input split bytes=132
  Spilled Records=0
  Failed Shuffles=0
  Merged Map outputs=0
  GC time elapsed (ms)=64
  CPU time spent (ms)=2210
  Physical memory (bytes) snapshot=222294016
  Virtual memory (bytes) snapshot=2672074752
  Total committed heap usage (bytes)=110100480
 File Input Format Counters
  Bytes Read=338
 File Output Format Counters
  Bytes Written=0
 org.apache.hadoop.tools.mapred.CopyMapper$Counter
  BYTESCOPIED=47047709
  BYTESEXPECTED=47047709
  COPY=1
20/09/29 12:23:59 INFO common.AbstractJindoFileSystem: Read total statistics: oss read average -1 us, cache read average -1 us, read oss percent 0%

Verify the migration result.
Check the size of the data that is migrated to the Lindorm file database.
```
${HADOOP_HOME}/bin/hadoop fs -du -s -h hdfs://${Instance ID}/
```