Use Jindo DistCp in specific scenarios - E-MapReduce - Alibaba Cloud Documentation Center

Use scenarios

Common use scenarios of Jindo DistCp:

Scenario 1: What parameters are required for optimization if I import a large amount of HDFS data or a large number of files (millions or tens of millions) to OSS?
Scenario 2: How do I verify data integrity after I use Jindo DistCp to import data to OSS?
Scenario 3: What parameters are required to support resumable upload in the case of a DistCp task failure when I import HDFS data to OSS?
Scenario 4: What parameters are required to process the files that may be newly generated when I use Jindo DistCp to import HDFS data to OSS?
Scenario 5: What parameters are required to specify the YARN queue where a Jindo DistCp task resides and the available bandwidth allocated to the task?
Scenario 5: What parameters are required to specify the YARN queue where a Jindo DistCp task resides and the available bandwidth allocated to the task?
Scenario 6: What parameters are required when I write data to OSS IA or Archive storage?
Scenario 7: What parameters are required to accelerate file transfer based on the proportion of small files and specific file size?
Scenario 8: What parameters are required if Amazon S3 is used as a data source?
Scenario 9: What parameters are required if I want to copy files to OSS and compress the copied files in the LZO or GZ format?
Scenario 10: What parameters are required if I want to copy the files that meet specific rules or the files in some sub-directories of the same parent directory?
Scenario 11: What parameters are required if I want to merge the files that meet specific rules to reduce the number of files?
Scenario 12: What parameters are required if I want to delete original files after a copy operation?
Scenario 13: What do I do if I do not want to specify AccessKey pair and endpoint information of OSS in a CLI?

Scenario 1: What parameters are required for optimization if I import a large amount of HDFS data or a large number of files (millions or tens of millions) to OSS?

If you are not using EMR, the following conditions must be met before you can import HDFS data to Object Storage Service (OSS):

You can read data from HDFS.
The AccessKey ID, AccessKey secret, and endpoint of OSS are obtained. You can write data to a destination OSS bucket.
The storage class of the OSS bucket is not Archive.
You can submit MapReduce tasks.
The JAR package of Jindo DistCp is downloaded.

Example:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --parallelism 10

Note For more information about the parameters, see Use Jindo DistCp.

To import a large amount of data or a large number of files, such as millions or tens of millions of files, to OSS, you can set parallelism to a large value to increase the concurrency. You can also set enableBatch to true for optimization. Optimization command:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --parallelism 500 --enableBatch

Scenario 2: How do I verify data integrity after I use Jindo DistCp to import data to OSS?

You can use one of the following methods to verify data integrity:

Jindo DistCp Counters
Check DistCp Counters in the counter information of a MapReduce task.
```
Distcp Counters
        Bytes Destination Copied=11010048000
        Bytes Source Read=11010048000
        Files Copied=1001
    
Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
```
where:
- Bytes Destination Copied specifies the size of the file that you copy to a destination directory, in bytes.
- Bytes Source Read specifies the size of the file that you read from the source directory, in bytes.
- Files Copied specifies the number of copied files.
Jindo DistCp --diff
Use --diff to compare the information of files in the source and destination directories. The information includes file name and file size. Missing files and the files that fail to be copied are recorded in a manifest file. The manifest file is generated in the directory where you perform the comparison.
Add --diff to the command described in Scenario 1. Example:
```
hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --diff
```
If all files are copied, the following information is returned:
```
INFO distcp.JindoDistCp: distcp has been done completely
```

Scenario 3: What parameters are required to support resumable upload in the case of a DistCp task failure when I import HDFS data to OSS?

If your DistCp task fails, and you want to perform resumable upload to copy only the files that fail to be copied, perform the following operations on the basis of the command described in Scenario 1:

Add --diff to check whether all files are copied.

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --diff

If all the files are copied, the following information is returned. Otherwise, a manifest file is generated. In this case, go to the next step.

INFO distcp.JindoDistCp: distcp has been done completely.

Use --copyFromManifest and --previousManifest to copy the remaining files. Example:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --dest oss://yang-hhht/hourly_table --previousManifest=file:///opt/manifest-2020-04-17.gz --copyFromManifest --parallelism 20

file:///opt/manifest-2020-04-17.gz is the path where the manifest file is stored.

Scenario 4: What parameters are required to process the files that may be newly generated when I use Jindo DistCp to import HDFS data to OSS?

If no information is generated for the files that were copied last time, specify parameters to generate a manifest file to record the information about the files that are copied.
Add --outputManifest=manifest-2020-04-17.gz and --requirePreviousManifest=false to the command described in Scenario 1. Example:
```
hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --outputManifest=manifest-2020-04-17.gz --requirePreviousManifest=false --parallelism 20
```
where:
- --outputManifest generates a manifest file whose name can be customized and must have a file name extension of gz, such as manifest-2020-04-17.gz. This file is stored in the directory specified by --dest.
- --requirePreviousManifest specifies whether to require a previous manifest file.

After a DistCp task is complete, copy the new files that may be generated in the source directory.

Add --outputManifest=manifest-2020-04-17.gz and --previousManifest=oss://yang-hhht/hourly_table/manifest-2020-04-17.gz to the command described in Scenario 1. Example:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --outputManifest=manifest-2020-04-17.gz --requirePreviousManifest=false --parallelism 20

hadoop jar jindo-distcp-2.7.3.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --outputManifest=manifest-2020-04-18.gz --previousManifest=oss://yang-hhht/hourly_table/manifest-2020-04-17.gz --parallelism 10

Repeat Step 2 to continuously copy incremental files.

Scenario 5: What parameters are required to specify the YARN queue where a Jindo DistCp task resides and the available bandwidth allocated to the task?

Add the following two parameters to the command described in Scenario 1. These two parameters can be used together or separately.

--queue: the name of the YARN queue.
--bandwidth: the size of the specified bandwidth , in MB.

Example:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --queue yarnqueue --bandwidth 6 --parallelism 10

Scenario 6: What parameters are required when I write data to OSS IA or Archive storage?

To write data to OSS Archive storage, add --archive to the command described in Scenario 1. Example:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --archive --parallelism 20

To write data to OSS Infrequent Access (IA) storage, add --ia to the command described in Scenario 1. Example:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --ia --parallelism 20

Scenario 7: What parameters are required to accelerate file transfer based on the proportion of small files and specific file sizes?

A large number of small files and large data volume for a single large file
If most of the files that you want to copy are small files, and the data volume of a single large file is large, the general solution is to randomly allocate the files to be copied. In this case, if you do not optimize the job allocation plan, few large files and many small files may be allocated to the same copy process. This cannot achieve excellent copy performance.
Add --enableDynamicPlan to the command described in Scenario 1 to enable the optimization. --enableDynamicPlan cannot be used with --enableBalancePlan. Example:
```
hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --enableDynamicPlan --parallelism 10
```
The following figure shows the data copy before and after the plan is optimized.
No significant differences in sizes of files
If the sizes are not significantly different among the files that you want to copy, use --enableBalancePlan to optimize the job allocation plan. Example:
```
hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --enableBalancePlan --parallelism 10
```
The following figure shows the data copy before and after the plan is optimized.

Scenario 8: What parameters are required if Amazon S3 is used as a data source?

Replace the parameters that specify the AccessKey ID, AccessKey secret, and endpoint of OSS in the command described in Scenario 1 with the following parameters of Amazon S3:

--s3Key: the AccessKey ID for Amazon S3
--s3Secret: the AccessKey secret for Amazon S3
--s3EndPoint: the endpoint for Amazon S3

Example:

hadoop jar jindo-distcp-<version>.jar --src s3a://yourbucket/ --dest oss://yang-hhht/hourly_table --s3Key yourkey --s3Secret yoursecret --s3EndPoint s3-us-west-1.amazonaws.com --parallelism 10

Scenario 9: What parameters are required if I want to copy files to OSS and compress the copied files in the LZO or GZ format?

You can use --outputCodec to compress copied files in a format such as LZO or GZ to reduce the space that is used to store the files.

Add --outputCodec to the command described in Scenario 1. Example:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --outputCodec=gz --parallelism 10

You can set --outputCodec to gzip, gz, lzo, lzop, snappy, none, or keep. Default value: keep. Descriptions of none and keep:

none: Jindo DistCp does not compress copied files. If the files are compressed, Jindo DistCp decompresses them.
keep: Jindo DistCp copies files with no change to the compression.

Note If you want to use the LZO codec in an open source Hadoop cluster, you must install the native library of gplcompression and the hadoop-lzo package.

Scenario 10: What parameters are required if I want to copy the files that meet specific rules or the files in some sub-directories of the same parent directory?

If you want to copy the files that meet specific rules, add --srcPattern to the command described in Scenario 1. Example:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --srcPattern .*\.log --parallelism 10

--srcPattern: specifies a regular expression that filters files for the copy operation.

If you want to copy the files stored in some sub-directories of the same parent directory, add --srcPrefixesFile to the command described in Scenario 1.

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --srcPrefixesFile file:///opt/folders.txt --parallelism 20

--srcPrefixesFile: enables Jindo DistCp to copy files in multiple folders under the same parent directory at a time.

Content of the folders.txt file:

hdfs://emr-header-1.cluster-50466:9000/data/incoming/hourly_table/2017-02-01
hdfs://emr-header-1.cluster-50466:9000/data/incoming/hourly_table/2017-02-02

Scenario 11: What parameters are required if I want to merge the files that meet specific rules to reduce the number of files?

Add the following parameters to the command described in Scenario 1:

--targetSize: the maximum size of the merged file, in MB
--groupBy: the merging rule, which is a regular expression

Example:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --targetSize=10 --groupBy='.*/([a-z]+).*.txt' --parallelism 20

Scenario 12: What parameters are required if I want to delete original files after a copy operation?

Add --deleteOnSuccess to the command described in Scenario 1. Example:

hadoop jar jindo-distcp-<version>.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com --deleteOnSuccess --parallelism 10

Scenario 13: What do I do if I do not want to specify AccessKey pair and endpoint information of OSS in a CLI?

Jindo DistCp allows you to save the AccessKey ID, AccessKey secret, and endpoint of OSS in the core-site.xml file so that you do not need to repeatedly specify the information.

If you want to save the AccessKey ID, AccessKey secret, and endpoint of OSS, save the following information into the core-site.xml file:

<configuration>
    <property>
        <name>fs.jfs.cache.oss-accessKeyId</name>
        <value>xxx</value>
    </property>

    <property>
        <name>fs.jfs.cache.oss-accessKeySecret</name>
        <value>xxx</value>
    </property>

    <property>
        <name>fs.jfs.cache.oss-endpoint</name>
        <value>oss-cn-xxx.aliyuncs.com</value>
    </property>
</configuration>

If you want to save the AccessKey ID, AccessKey secret, and endpoint of Amazon S3, save the following information into the core-site.xml file:

<configuration>
    <property>
        <name>fs.s3a.access.key</name>
        <value>xxx</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>xxx</value>
    </property>
    <property>
        <name>fs.s3.endpoint</name>
        <value>s3-us-west-1.amazonaws.com</value>
    </property>
</configuration>

E-MapReduce:Use Jindo DistCp in specific scenarios

Prerequisites

Use scenarios

Scenario 1: What parameters are required for optimization if I import a large amount of HDFS data or a large number of files (millions or tens of millions) to OSS?

Scenario 2: How do I verify data integrity after I use Jindo DistCp to import data to OSS?

Scenario 3: What parameters are required to support resumable upload in the case of a DistCp task failure when I import HDFS data to OSS?

Scenario 4: What parameters are required to process the files that may be newly generated when I use Jindo DistCp to import HDFS data to OSS?

Scenario 5: What parameters are required to specify the YARN queue where a Jindo DistCp task resides and the available bandwidth allocated to the task?

Scenario 6: What parameters are required when I write data to OSS IA or Archive storage?

Scenario 7: What parameters are required to accelerate file transfer based on the proportion of small files and specific file sizes?

Scenario 8: What parameters are required if Amazon S3 is used as a data source?

Scenario 9: What parameters are required if I want to copy files to OSS and compress the copied files in the LZO or GZ format?

Scenario 10: What parameters are required if I want to copy the files that meet specific rules or the files in some sub-directories of the same parent directory?

Scenario 11: What parameters are required if I want to merge the files that meet specific rules to reduce the number of files?

Scenario 12: What parameters are required if I want to delete original files after a copy operation?

Scenario 13: What do I do if I do not want to specify AccessKey pair and endpoint information of OSS in a CLI?