Hadoop DistCp (distributed copy) replicates data between Hadoop Distributed File System (HDFS) clusters or within a single cluster. It uses MapReduce to distribute, track, and recover data during replication.
For the full list of options and advanced usage, see DistCp Guide in the Apache Hadoop documentation.
Choose between Hadoop DistCp and Jindo DistCp
E-MapReduce (EMR) provides two DistCp tools. Choose the one that matches your data source and destination.
| Tool | Description | When to use |
|---|---|---|
| Hadoop DistCp | Built-in open source Hadoop tool for distributed data replication. | Replicate data between HDFS clusters. |
| Jindo DistCp | JindoFS data migration tool. Supports Object Storage Service (OSS), OSS-HDFS, and Amazon S3-compatible data sources. | Migrate HDFS data to OSS or OSS-HDFS. Migrate Amazon S3 data to OSS or OSS-HDFS. |
Copy data between clusters
Prerequisites
Establish network connectivity between the source and destination HDFS clusters. For setup instructions, see E-MapReduce data migration solution.
Run a copy command
To replicate the /foo/bar directory on the nn1 cluster to /bar/foo on the nn2 cluster, run:
hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/fooFor more options and usage details, see DistCp Guide.
Troubleshooting
"ACLs not supported on at least one file system" error
DistCp returns the following error:
org.apache.hadoop.tools.CopyListing$AclsNotSupportedException: ACLs not supported for file system: hdfs://xx.xx.xx.xx:8020To resolve this issue:
Determine whether the source cluster has access control lists (ACLs) to synchronize.
Scenario Action Source cluster has ACLs to synchronize Add the -pparameter afterdistcpto grant synchronization permissions.Destination cluster does not support ACLs Enable ACLs on the destination cluster by modifying its configuration, then restart the NameNode. Source cluster does not support ACLs Remove the -aparameter from the command. No ACLs need to be synchronized.Verify that
dfs.permissions.enabledanddfs.namenode.acls.enabledmatch on both clusters. If the values differ, either set them to the same values on both clusters or skip ACL synchronization.
Out-of-memory (OOM) error during DistCp
Open source DistCp stores the list of paths to replicate in client memory. When the file count is large (for example, one million files) or file names are long, an OOM error occurs.
Increase the client memory before running DistCp:
export HADOOP_CLIENT_OPTS="-Xmx1024m"
hadoop distcp /source /target