Archive and unarchive data in SDK mode - E-MapReduce - Alibaba Cloud Documentation Center

JindoTable allows you to run the archiveTable and unarchiveTable commands in SDK mode to archive and unarchive data in Object Storage Service (OSS). The commands do not rely on the Jindo Namespace Service component of SmartData. This topic describes how to use the archiveTable and unarchiveTable commands.

Prerequisites

Java Development Kit (JDK) 8 is installed on your computer.
An E-MapReduce (EMR) cluster is created. For more information, see Create a cluster.
The partitioned table or non-partitioned table that you want to archive is stored in OSS. Only table data can be archived.

Background information

You can use the original archive and unarchive commands of JindoTable to archive or unarchive tables or partitions in OSS. However, these commands rely on the Jindo Namespace Service component of SmartData. The new commands archiveTable and unarchiveTable do not rely on the Jindo Namespace Service component.

The archiveTable and unarchiveTable commands have the following advantages over the archive and unarchive commands:

You can run the archiveTable and unarchiveTable commands even if the SmartData service is not deployed in your cluster. For example, you can run the commands on a self-managed cluster.
You can configure filter parameters in the archiveTable or unarchiveTable command to archive or unarchive a large number of partitions on multiple threads at the same time. If local multithreading cannot meet your business requirements, you can run MapReduce tasks on the entire cluster to archive or unarchive data.

For more information about the original archive and unarchive commands, see Use JindoTable.

Limits

The archiveTable and unarchiveTable commands are supported only in EMR V3.36.0 and later minor versions, and EMR V5.2.0 and later minor versions.

archiveTable

You can use the archiveTable command to archive tables or partitions in OSS.

Log on to your cluster in SSH mode. For more information, see Log on to a cluster.

Run the following command to obtain help information:

jindo table -help archiveTable

The following information is returned:

  <dbName.tableName>      The table to archive.
  -a/-i                   storage policy, -a for Archive and -i for IA
                          (Infrequent Access).
  <condition>/-fullTable  A filter condition to determine which partitions should
                          be archived, supporting common operators (like '>'),
                          while -fullTable means that all partitions (or a whole
                          un-partitioned table) should be archived. One but only
                          one option must be specified among -c "<condition>" and
                          -fullTable.
  <before days>           Optional, saying that table/partitions should be
                          archived only when they are created (not updated or
                          modified) more than some days before from now.
  <parallelism>           The maximum concurrency when archiving partitions, 1 by
                          default.
  -mr/-mapReduce          Archive table/partitions using cluster-level MapReduce
                          job instead of local-level multi-thread.
  -e/-explain             If present, the command would not really archive data,
                          but only prints the table/partitions that would be
                          archived for given conditions.
        <working directory>: A directory to locate map-reduce temp files. Must not be a
  local file system directory. 'hdfs:///tmp/<current user>/jindotable-policy/' by
  default.

  <log directory>  A directory to locate log files, '/tmp/<current user>/' by
                   default.

archiveTable syntax:

-archiveTable -t <dbName.tableName> \
-a/-i \
[-c "<condition>" | -fullTable] \
[-b/-before <before days>] \
[-p/-parallel <parallelism>] \
[-mr/-mapReduce] \
[-e/-explain] \
[-w/-workingDir <working directory>] \
[-l/-logDir <log directory>]


Parameter	Description	Required
-t <dbName.tableName>	The name of the table that you want to archive. You must specify this parameter in the `Database name.Table name` format. Separate the database name and table name with a period (.). The table can be a partitioned table or a non-partitioned table.	Yes
-a/-i	The storage class in which you want to archive data. You can use one of the following options to specify a storage class: `-a`: Archive `-i`: Infrequent Access (IA) If you use the -i option in the command, the files whose storage class is Archive are skipped.	Yes
-c "<condition>" \| -fullTable	You must specify either `-fullTable` or `-c "<condition>"`. If you specify `-fullTable`, the entire partitioned or non-partitioned table is archived. If you specify `-c "<condition>"`, only the partitions that meet the filter condition are archived. Common operators, such as greater-than signs (>), are supported. For example, if the partition key column is the ds column whose data type is String and you want to archive partitions whose partition names are greater than 'd', use `-c " ds > 'd' "`.	No
-b/before <before days>	Only the tables or partitions that were created at least the specified days ago can be archived.	No
-p/-parallel <parallelism>	The parallelism of archiving operations.	No
-mr/-mapReduce	Hadoop MapReduce instead of local multithreading is used to archive data.	No
-e/-explain	The explain mode is used. In explain mode, the list of partitions to be archived is displayed, but no data is archived.	No
-w/-workingDir	The working directory of a MapReduce job. This option is used only when you use a MapReduce job to archive data. You must have read and write permissions on the directory. The directory can be empty or not. Temporary files are created when you run the MapReduce job and are automatically deleted after the job is completed.	No
-l/-logDir <log directory>	The directory in which log files are stored.	No

unarchiveTable

The syntax of the unarchiveTable command is similar to the syntax of the archiveTable command. You can use the unarchiveTable command to unarchive tables or partitions in OSS.

Log on to your cluster in SSH mode. For more information, see Log on to a cluster.

Run the following command to obtain help information:

jindo table -help unarchiveTable

The following information is returned:

  <dbName.tableName>      The table to unarchive.
  -i                      unarchive to IA (Infrequent Access).
  -o                      restore to make archived data accessible temporarily.
  <condition>/-fullTable  A filter condition to determine which partitions should
                          be unarchived, supporting common operators (like '>'),
                          while -fullTable means that all partitions (or a whole
                          un-partitioned table) should be unarchived. One but
                          only one option must be specified among -c
                          "<condition>" and -fullTable.
  <before days>           Optional, saying that table/partitions should be
                          unarchived only when they are created (not updated or
                          modified) more than some days before from now.
  <parallelism>           The maximum concurrency when unarchiving partitions, 1
                          by default.
  -mr/-mapReduce          Unarchive table/partitions using cluster-level
                          MapReduce job instead of local-level multi-thread.
  -e/-explain             If present, the command would not really unarchive
                          data, but only prints the table/partitions that would
                          be unarchived for given conditions.
        <working directory>: A directory to locate map-reduce temp files. Must not be a
  local file system directory. 'hdfs:///tmp/<current user>/jindotable-policy/' by
  default.

  <log directory>  A directory to locate log files, '/tmp/<current user>/' by
                   default.

unarchiveTable syntax:

-unarchiveTable -t <dbName.tableName> \
[-i/-o] \
[-c "<condition>" | -fullTable] \
[-b/-before <before days>] \
[-p/-parallel <parallelism>] \
[-mr/-mapReduce] \
[-e/-explain] \
[-w/-workingDir <working directory>] \
[-l/-logDir <log directory>]

In the unarchiveTable command, the optional parameter -i/-o is used instead of the required parameter -a/-i. This is the only difference between the parameters of the unarchiveTable command and the archiveTable command.

Description of -i/-o:

If you do not specify -i/-o, the storage class of the data that you want to unarchive is changed to Standard.
If you specify the -i option, the storage class of the data that you want to unarchive is changed to IA. Files whose storage class is Standard are skipped.
If you specify the -o option, data is only temporarily unarchived and its storage class is retained. Files whose storage class is Standard and files whose storage class is IA are skipped. Files that are previously unarchived are also skipped. This way, these files are not repeatedly unarchived.