JindoTable can be used to collect infrequent-access statistics of tables and partitions.
It enables you to find the data that is not accessed recently based on the last access
time of tables and partitions. This helps optimize data storage and reduce costs.
For example, in data analytics scenarios, you can use this feature to move infrequently
used partitions to lower-cost storage media.
Prerequisites
A cluster is created in E-MapReduce (EMR) V3.35.0 or later V3.X.X, or in EMR V4.9.0
or later V4.X.X. For more information, see Create a cluster.
Limits
- Data Lake Formation (DLF) metadata is not supported.
- You must have the permissions to access the underlying metadata storage, such as MySQL
or ApsaraDB RDS, from the IP addresses of the nodes on which the Hive CLI, HiveServer2,
Spark SQL CLI, Spark Thrift Server, and Presto services reside.
- You can use JindoTable to collect infrequent-access statistics only of Hive, Spark,
and Presto tables.
Query infrequent-access statistics
You can run a command to query infrequent-access statistics.
- Syntax
jindo table -leastUseStat -n <num> [-i/-ignoreNever]
num indicates the number of tables or partitions that you want to query. Set this parameter
to a positive integer. -i/-ignoreNever is an optional parameter. If you specify this parameter, the tables or partitions
that have never been accessed are filtered out.
- Description
This command is used to query the tables or partitions that have not been accessed
for the longest time.
- Example: Query the first 20 tables or partitions that have not been accessed for the
longest time.
jindo table -leastUseStat -n 20
The following figure shows the output.
- The first column lists tables in the Database name.Table name format.
- The second column lists partitions in the First partition key column=Column value/Second
partition key column=Column value/... format. For a non-partitioned table, this column
is empty.
- The third column displays the last access time of the tables or partitions in the
yyyy-MM-dd HH:mm:ss format.
Note For a partitioned table, only partition-level access statistics are displayed. Table-level
access statistics are not displayed.
For more information about how to use JindoTable, see Use JindoTable.
Enable the collection of infrequent-access statistics for Spark
- Go to the Spark service page.
- Log on to the Alibaba Cloud EMR console.
- In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
- Click the Cluster Management tab.
- On the Cluster Management page, find your cluster and click Details in the Actions column.
- In the left-side navigation pane, choose .
- On the Spark service page, click the Configure tab.
- Search for the spark.sql.queryExecutionListeners parameter and check whether the value contains com.aliyun.emr.table.spark.SparkSQLQueryListener. If the value does not contain this listener, add the listener to the value. Separate
multiple listeners with comma(,).
- Add custom configurations.
- In the Service Configuration section, click the spark-defaults tab.
- Click Custom Configuration in the upper-right corner.
- In the Add Configuration Item dialog box, add the spark.sql.query.update.access.time.enabled parameter and set it to true.
- Click OK.
- Save the configurations.
- Click Save in the upper-right corner.
- In the Confirm Changes dialog box, specify Description and click OK.
- Restart all components.
- In the upper-right corner of the Spark service page, choose .
- In the Cluster Activities dialog box, specify Description and click OK.
- In the Confirm message, click OK.
Enable the collection of infrequent-access statistics for Hive
- Go to the Hive service page.
- Log on to the Alibaba Cloud EMR console.
- In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
- Click the Cluster Management tab.
- On the Cluster Management page, find your cluster and click Details in the Actions column.
- In the left-side navigation pane, choose .
- On the Hive service page, click the Configure tab.
- Search for the hive.exec.post.hooks parameter and check whether the value contains com.aliyun.emr.table.hive.HivePostHook. If the value does not contain this hook, add the hook to the value. Separate multiple
hooks with comma(,).
- Add custom configurations.
- In the Service Configuration section, click the hive-site tab.
- Click Custom Configuration in the upper-right corner.
- In the Add Configuration Item dialog box, add the hive.hook.update.access.time.enabled parameter and set it to true.
- Click OK.
- Save the configurations.
- Click Save in the upper-right corner.
- In the Confirm Changes dialog box, specify Description and click OK.
- Restart all components.
- In the upper-right corner of the Spark service page, choose .
- In the Cluster Activities dialog box, specify Description and click OK.
- In the Confirm message, click OK.
Enable the collection of infrequent-access statistics for Presto
- Go to the Presto service page.
- Log on to the Alibaba Cloud EMR console.
- In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
- Click the Cluster Management tab.
- On the Cluster Management page, find your cluster and click Details in the Actions column.
- In the left-side navigation pane, choose .
- On the Presto service page, click the Configure tab.
- Search for the event-listener.name parameter and check whether the value contains jindo-presto-post-event-listener. If the value does not contain this listener, add the listener to the value.
- Add custom configurations.
- In the Service Configuration section, click the event-listener.properties tab.
- Click Custom Configuration in the upper-right corner.
- In the Add Configuration Item dialog box, add the listener.update.access.time.enabled parameter and set it to true.
- Click OK.
- Save the configurations.
- Click Save in the upper-right corner.
- In the Confirm Changes dialog box, specify Description and click OK.
- Restart all components.
- In the upper-right corner of the Spark service page, choose .
- In the Cluster Activities dialog box, specify Description and click OK.
- In the Confirm message, click OK.