Enable query acceleration based on a native engine - E-MapReduce

JindoTable provides a native engine that can be used to accelerate the queries of files. This feature accelerates the speed of Spark, Hive, or Presto to query ORC or Parquet files. It is disabled by default.

Prerequisites

An E-MapReduce (EMR) cluster of V3.35.0 or later V3.X.X, or V4.9.0 or later V4.X.X is created. ORC or Parquet files are stored in JindoFS or Object Storage Service (OSS). For more information about how to create an EMR cluster, see Create a cluster.

Background information

The following table lists the supported Spark, Hive, and Presto engines and the file formats supported by each engine.

Engine	ORC	Parquet
Spark2	Supported	Supported
Presto	Supported	Unsupported
Hive2	Unsupported	Supported

Limits

Data of the binary type is not supported.
Partitioned tables whose values of partition key columns are stored in files are not supported.
You are not allowed to use spark.read.schema (userDefinedSchema) to define a schema because the schema may be inconsistent with the existing file schema.
Data of the date type must be in the YYYY-MM-DD format and range from 1400-01-01 to 9999-12-31.
Queries of columns that are case-sensitive in the same table cannot be accelerated. For example, queries of the ID and id columns of the same table cannot be accelerated.

Improve performance of Spark to read data

Enable query acceleration for ORC or Parquet files in JindoTable.
Note
- Query acceleration consumes off-heap memory. We recommend that you add --conf spark.executor.memoryOverhead=4g to a Spark task to apply for additional resources for query acceleration.
- You can use the native engine when Spark reads data from ORC or Parquet files.
- Configure global parameters.
  For more information, see Configure the global parameters of Spark.
- Configure job-level parameters.
  You can add Spark startup parameters when you run Spark Shell or Spark SQL jobs.
```
spark.sql.extensions==io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension
```
  For more information about the configurations of jobs, see Configure a Spark Shell job or Configure a Spark SQL job.
Check whether query acceleration is enabled.
1. Access the web UI of Spark History Server.
2. On the SQL tab of Spark, view the execution task.
  If JindoDataSourceV2Scan appears, query acceleration is enabled. Otherwise, check the configurations in Step 1.

Improve performance of Presto to read data

Presto has built-in catalog: hive-acc. You can use catalog: hive-acc to enable query acceleration.

Example:

presto --server https://emr-header-1.cluster-xxx:7778/ --catalog hive-acc --schema default

emr-header-1.cluster-xxx indicates the hostname of the emr-header-1 node.

Note

Complex data types, such as MAP, STRUCT, and ARRAY, are not supported when you use this feature in Presto.

Improve performance of Hive to read data

Important

If you want to schedule jobs in a more stable environment, we recommend that you disable this feature in Hive.

EMR Hive V2.3.7 (EMR V3.35.0) has integrated a plug-in provided by JindoTable. The plug-in accelerates queries of Parquet files. You can set hive.jindotable.native.enabled in your Hive job to enable query acceleration. Alternatively, you can add the hive.jindotable.native.enabled custom parameter and set it to true on the hive-site.xml tab of the Hive configuration page to enable query acceleration, and restart Hive. The latter is suitable to Hive on MapReduce and Hive on Tez.

Example:

set hive.jindotable.native.enabled=true;

Note

Complex data types, such as MAP, STRUCT, and ARRAY, are not supported when you use this feature in Hive.

Configure the global parameter of Spark

Go to the Spark service page.
1. Log on to the Alibaba Cloud EMR console.
2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
3. Click the Cluster Management tab.
4. On the Cluster Management page, find your cluster and click Details in the Actions column.
5. In the left-side navigation pane, choose Cluster Service > Spark.
On the Spark service page, click the Configure tab.
Find the spark.sql.extensions parameter and change its value to io.delta.sql.DeltaSparkSessionExtension,com.aliyun.emr.sql.JindoTableExtension.
Save the configurations.
1. Click Save in the upper-right corner of the Service Configuration section.
2. In the Confirm Changes dialog box, specify Description and click OK.
Restart ThriftServer.
1. Choose Actions > Restart ThriftServer in the upper-right corner.
2. In the Cluster Activities dialog box, specify Description and click OK.
3. In the Confirm message, click OK.