This topic describes how to import data from an Object Storage Service (OSS) bucket to a database that is powered by the Lindorm file engine.
Before you begin
Activate the file engine for your Lindorm instance. For more information, see Activate the file engine service.
Create a Hadoop cluster. We recommend that you use Hadoop 2.7.3 or later. In this example, Apache Hadoop 2.7.3 is used. You must modify the Hadoop configuration. For more information, see Use an open source HDFS clients to access the file engine.
Install Java Development Kit (JDK) on all the nodes of the Hadoop cluster. The JDK version must be 1.8 or later.
Install the OSS client JindoFS SDK on all the nodes of the Hadoop cluster. For more information about JindoFS SDK, see JindoFS SDK.
Download jindofs-sdk.jar.
cp ./jindofs-sdk-*.jar ${HADOOP_HOME}/share/hadoop/hdfs/lib/
Create a JindoFS SDK configuration file for each node of the Hadoop cluster.
Add the following environment variable to the
/etc/profile
file.
export B2SDK_CONF_DIR=/etc/jindofs-sdk-conf
Create a JindoFS SDK configuration file:
/etc/jindofs-sdk-conf/bigboot.cfg
.
[bigboot] logger.dir=/tmp/bigboot-log[bigboot-client] client.oss.retry=5 client.oss.upload.threads=4 client.oss.upload.queue.size=5 client.oss.upload.max.parallelism=16 client.oss.timeout.millisecond=30000 client.oss.connection.timeout.millisecond=4000
Load the environment variable. After the environment variable is loaded, the environment variable takes effect.
source /etc/profile
Verify that your OSS bucket can be accessed in the Hadoop cluster.
${HADOOP_HOME}/bin/hadoop fs -ls oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/
Migrate data from an OSS bucket to a Lindorm file database
Determine the size of the data that needs to be migrated.
${HADOOP_HOME}/bin/hadoop du -h oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data
Use the Hadoop distributed copy (DistCp) tool to start a MapReduce task to migrate the data to the file database.
${HADOOP_HOME}/bin/hadoop distcp \ oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data.txt \ hdfs://${Instance ID}/
Replace ${Instance ID} with your Lindorm instance ID.
Configure the parameters based on the description in the following table.
Parameter
Description
accessKeyId
The AccessKey pair that is required when you call the OSS API. For information about how to obtain your AccessKey pair, see Create an AccessKey pair.
accessKeySecret
bucket-name.endpoint
The access address of the OSS bucket. The address consists of the bucket name and the endpoint that corresponds to the region where the bucket is deployed.
View the migration result after the task is completed.
If the result is similar to the following example, the data is migrated:
20/09/29 12:23:59 INFO mapreduce.Job: map 100% reduce 0% 20/09/29 12:23:59 INFO mapreduce.Job: Job job_1601195105349_0015 completed successfully 20/09/29 12:23:59 INFO mapreduce.Job: Counters: 38 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=122343 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=470 HDFS: Number of bytes written=47047709 HDFS: Number of read operations=15 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 OSS: Number of bytes read=0 OSS: Number of bytes written=0 OSS: Number of read operations=0 OSS: Number of large read operations=0 OSS: Number of write operations=0 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=5194 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=5194 Total vcore-milliseconds taken by all map tasks=5194 Total megabyte-milliseconds taken by all map tasks=5318656 Map-Reduce Framework Map input records=1 Map output records=0 Input split bytes=132 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=64 CPU time spent (ms)=2210 Physical memory (bytes) snapshot=222294016 Virtual memory (bytes) snapshot=2672074752 Total committed heap usage (bytes)=110100480 File Input Format Counters Bytes Read=338 File Output Format Counters Bytes Written=0 org.apache.hadoop.tools.mapred.CopyMapper$Counter BYTESCOPIED=47047709 BYTESEXPECTED=47047709 COPY=1 20/09/29 12:23:59 INFO common.AbstractJindoFileSystem: Read total statistics: oss read average -1 us, cache read average -1 us, read oss percent 0%
Verify the migration result.
Check the size of the data that is migrated to the Lindorm file database.
${HADOOP_HOME}/bin/hadoop fs -du -s -h hdfs://${Instance ID}/