本章节介绍如何将对象存储OSS上的数据迁移到LindormDFS。
准备工作
开通LindormDFS,详情请参见开通指南 。
搭建Hadoop集群。建议您使用的Hadoop版本不低于2.7.3,本文档中使用的Hadoop版本为Apache Hadoop 2.7.3,修改Hadoop配置信息,详情参见使用开源HDFS客户端访问。
在Hadoop集群所有节点上安装JDK,本操作要求JDK版本不低于1.8。
在Hadoop集群安装OSS客户端JindoFS SDK。JindoFS SDK详细介绍请参见JindoFS SDK。
下载 jindofs-sdk.jar。
cp ./jindofs-sdk-*.jar ${HADOOP_HOME}/share/hadoop/hdfs/lib/
为Hadoop集群所有节点创建JindoFS SDK配置文件。
添加如下环境变量到
/etc/profile
文件。
export B2SDK_CONF_DIR=/etc/jindofs-sdk-conf
创建OSS存储工具配置文件
/etc/jindofs-sdk-conf/bigboot.cfg
。
[bigboot] logger.dir=/tmp/bigboot-log[bigboot-client] client.oss.retry=5 client.oss.upload.threads=4 client.oss.upload.queue.size=5 client.oss.upload.max.parallelism=16 client.oss.timeout.millisecond=30000 client.oss.connection.timeout.millisecond=4000
加载环境变量使之生效。
source /etc/profile
验证是否可以在Hadoop 集群上使用OSS。
${HADOOP_HOME}/bin/hadoop fs -ls oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/
将对象存储OSS数据迁移到LindormDFS
检查并且确定需要迁移的数据大小。
${HADOOP_HOME}/bin/hadoop du -h oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data
启动Hadoop MapReduce任务(DistCp)将测试数据迁移至LindormDFS。
${HADOOP_HOME}/bin/hadoop distcp \ oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data.txt \ hdfs://${实例Id}/
其中${实例Id}请根据您的实际情况进行修改。
参数说明如下表所示:
参数
说明
accessKeyId
访问对象存储OSS API的密钥。获取方式请参见创建AccessKey。
accessKeySecret
bucket-name.endpoint
对象存储OSS的访问域名,包括存储空间(Bucket)名称和对应的地域域名(Endpoint)地址。
任务执行完成后,查看迁移结果。
如果回显包含如下类似信息,说明迁移成功。
20/09/29 12:23:59 INFO mapreduce.Job: map 100% reduce 0% 20/09/29 12:23:59 INFO mapreduce.Job: Job job_1601195105349_0015 completed successfully 20/09/29 12:23:59 INFO mapreduce.Job: Counters: 38 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=122343 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=470 HDFS: Number of bytes written=47047709 HDFS: Number of read operations=15 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 OSS: Number of bytes read=0 OSS: Number of bytes written=0 OSS: Number of read operations=0 OSS: Number of large read operations=0 OSS: Number of write operations=0 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=5194 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=5194 Total vcore-milliseconds taken by all map tasks=5194 Total megabyte-milliseconds taken by all map tasks=5318656 Map-Reduce Framework Map input records=1 Map output records=0 Input split bytes=132 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=64 CPU time spent (ms)=2210 Physical memory (bytes) snapshot=222294016 Virtual memory (bytes) snapshot=2672074752 Total committed heap usage (bytes)=110100480 File Input Format Counters Bytes Read=338 File Output Format Counters Bytes Written=0 org.apache.hadoop.tools.mapred.CopyMapper$Counter BYTESCOPIED=47047709 BYTESEXPECTED=47047709 COPY=1 20/09/29 12:23:59 INFO common.AbstractJindoFileSystem: Read total statistics: oss read average -1 us, cache read average -1 us, read oss percent 0%
验证迁移结果。
查看迁移到LindormDFS的测试数据大小。
${HADOOP_HOME}/bin/hadoop fs -du -s -h hdfs://${实例Id}/