Migrate data from a self-managed HDFS cluster to LindormDFS - Lindorm

This topic describes how to migrate data from an open source Hadoop Distributed File System (HDFS) cluster to LindormDFS (LDPS).

Background information

In some scenarios, you may need to migrate data from a self-managed Hadoop cluster to LindormDFS.

Scenarios

You can import data from a self-managed Hadoop cluster that runs on an Elastic Compute Service (ECS) instance to LindormDFS.

Preparations

Activate LindormDFS for your Lindorm instance. For more information, see Activate LindormDFS.
Modify the Hadoop configuration. For more information, see Use open source HDFS clients to connect to and use LindormDFS.
Check the connectivity between the self-managed Hadoop cluster and LindormDFS.
Run the following command on the self-managed Hadoop cluster to test the connectivity of the cluster:
```
hadoop fs -ls hdfs://${Instance ID}/
```
Replace ${Instance ID} with your Lindorm instance ID. If the files in LindormDFS are returned, the Hadoop cluster is connected to LindormDFS.
Prepare the migration tool
You can use the Apache Hadoop distributed copy (DistCp) tool to migrate full data or incremental data from a self-managed Hadoop cluster to LindormDFS. For more information about DistCp, see DistCp Guide.

Migrate data from a Hadoop cluster

If the ECS instance on which the self-managed Hadoop cluster is deployed and LindormDFS are in the same virtual private cloud (VPC), you can migrate data to LindormDFS over the VPC. Run the following command to migrate data:

hadoop distcp  -m 1000 -bandwidth 30 hdfs://oldcluster:8020/user/hive/warehouse  hdfs://${Instance ID}/user/hive/warehouse

In the preceding command, oldcluster specifies the IP address or the domain name of a NameNode in the self-managed Hadoop cluster. ${Instance ID} specifies the Lindorm instance ID. Replace ${Instance ID} with your Lindorm instance ID.

FAQ

The overall amount of time used for migration depends on the size of the data in the self-managed Hadoop cluster and the transmission speed between the self-managed Hadoop cluster and LindormDFS. If you need to migrate a large amount of data, we recommend that you migrate a few directories to estimate the amount of time that is required to migrate the full data. If you can migrate data only within specified time periods, you can split the entire directory into small directories and migrate them in sequence.
Make sure that your clients do not write data when you migrate full data. During this period, you can enable your clients to write data to both the self-managed Hadoop cluster and LindormDFS for data processing. You can also modify your client configuration to write data only to LindormDFS.