Use JindoTable to collect infrequent-access statistics of tables and partitions - E-MapReduce

JindoTable can be used to collect infrequent-access statistics of tables and partitions. It enables you to find the data that is not accessed recently based on the last access time of tables and partitions. This helps optimize data storage and reduce costs. For example, in data analytics scenarios, you can use this feature to move infrequently used partitions to lower-cost storage media.

Prerequisites

A cluster is created in E-MapReduce (EMR) V3.35.0 or later V3.X.X, or in EMR V4.9.0 or later V4.X.X. For more information, see Create a cluster.

Background information

In SmartData 3.5.X and later, JindoTable can be used to collect infrequent-access statistics of Hive, Spark, and Presto tables. This feature is disabled by default. For more information about how to enable this feature, see Enable the collection of infrequent-access statistics for Spark, Enable the collection of infrequent-access statistics for Hive, and Enable the collection of infrequent-access statistics for Presto.

Note The same hooks and listeners are used to collect frequent-access and infrequent-access statistics. Therefore, if you enable the collection of infrequent-access statistics, the collection of frequent-access statistics is also enabled. For more information about the collection of frequent-access statistics, see Use JindoTable to collect access frequency statistics on tables and partitions.

Limits

Data Lake Formation (DLF) metadata is not supported.
You must have the permissions to access the underlying metadata storage, such as MySQL or ApsaraDB RDS, from the IP addresses of the nodes on which the Hive CLI, HiveServer2, Spark SQL CLI, Spark Thrift Server, and Presto services reside.
You can use JindoTable to collect infrequent-access statistics only of Hive, Spark, and Presto tables.

Query infrequent-access statistics

You can run a command to query infrequent-access statistics.

Syntax
```
jindo table -leastUseStat -n <num> [-i/-ignoreNever]
```
num indicates the number of tables or partitions that you want to query. Set this parameter to a positive integer. -i/-ignoreNever is an optional parameter. If you specify this parameter, the tables or partitions that have never been accessed are filtered out.
Description
This command is used to query the tables or partitions that have not been accessed for the longest time.
Example: Query the first 20 tables or partitions that have not been accessed for the longest time.
```
jindo table -leastUseStat -n 20
```
The following figure shows the output.
- The first column lists tables in the Database name.Table name format.
- The second column lists partitions in the First partition key column=Column value/Second partition key column=Column value/... format. For a non-partitioned table, this column is empty.
- The third column displays the last access time of the tables or partitions in the yyyy-MM-dd HH:mm:ss format.
Note For a partitioned table, only partition-level access statistics are displayed. Table-level access statistics are not displayed.

For more information about how to use JindoTable, see Use JindoTable.

Enable the collection of infrequent-access statistics for Spark

Go to the Spark service page.
1. Log on to the Alibaba Cloud EMR console.
2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
3. Click the Cluster Management tab.
4. On the Cluster Management page, find your cluster and click Details in the Actions column.
5. In the left-side navigation pane, choose Cluster Service > Spark.
On the Spark service page, click the Configure tab.
Search for the spark.sql.queryExecutionListeners parameter and check whether the value contains com.aliyun.emr.table.spark.SparkSQLQueryListener. If the value does not contain this listener, add the listener to the value. Separate multiple listeners with comma(,).
Add custom configurations.
1. In the Service Configuration section, click the spark-defaults tab.
2. Click Custom Configuration in the upper-right corner.
3. In the Add Configuration Item dialog box, add the spark.sql.query.update.access.time.enabled parameter and set it to true.
4. Click OK.
Save the configurations.
1. Click Save in the upper-right corner.
2. In the Confirm Changes dialog box, specify Description and click OK.
Restart all components.
1. In the upper-right corner of the Spark service page, choose Actions > Restart All Components.
2. In the Cluster Activities dialog box, specify Description and click OK.
3. In the Confirm message, click OK.

Enable the collection of infrequent-access statistics for Hive

Go to the Hive service page.
1. Log on to the Alibaba Cloud EMR console.
2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
3. Click the Cluster Management tab.
4. On the Cluster Management page, find your cluster and click Details in the Actions column.
5. In the left-side navigation pane, choose Cluster Service > Hive.
On the Hive service page, click the Configure tab.
Search for the hive.exec.post.hooks parameter and check whether the value contains com.aliyun.emr.table.hive.HivePostHook. If the value does not contain this hook, add the hook to the value. Separate multiple hooks with comma(,).
Add custom configurations.
1. In the Service Configuration section, click the hive-site tab.
2. Click Custom Configuration in the upper-right corner.
3. In the Add Configuration Item dialog box, add the hive.hook.update.access.time.enabled parameter and set it to true.
4. Click OK.
Save the configurations.
1. Click Save in the upper-right corner.
2. In the Confirm Changes dialog box, specify Description and click OK.
Restart all components.
1. In the upper-right corner of the Spark service page, choose Actions > Restart All Components.
2. In the Cluster Activities dialog box, specify Description and click OK.
3. In the Confirm message, click OK.

Enable the collection of infrequent-access statistics for Presto

Go to the Presto service page.
1. Log on to the Alibaba Cloud EMR console.
2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
3. Click the Cluster Management tab.
4. On the Cluster Management page, find your cluster and click Details in the Actions column.
5. In the left-side navigation pane, choose Cluster Service > Presto.
On the Presto service page, click the Configure tab.
Search for the event-listener.name parameter and check whether the value contains jindo-presto-post-event-listener. If the value does not contain this listener, add the listener to the value.
Add custom configurations.
1. In the Service Configuration section, click the event-listener.properties tab.
2. Click Custom Configuration in the upper-right corner.
3. In the Add Configuration Item dialog box, add the listener.update.access.time.enabled parameter and set it to true.
4. Click OK.
Save the configurations.
1. Click Save in the upper-right corner.
2. In the Confirm Changes dialog box, specify Description and click OK.
Restart all components.
1. In the upper-right corner of the Spark service page, choose Actions > Restart All Components.
2. In the Cluster Activities dialog box, specify Description and click OK.
3. In the Confirm message, click OK.