Use Spark SQL to read and write Hudi external tables - AnalyticDB

The Apache Hudi table format can be used based on Object Storage Service (OSS) and supports the UPDATE, DELETE, and INSERT operations. AnalyticDB for MySQL is integrated with Hudi. This allows you to use Spark SQL to read and write Hudi external tables. This topic describes how to use Spark SQL to read and write Hudi external tables.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
Note
An AnalyticDB for MySQL Data Lakehouse Edition cluster that has more than 0 AnalyticDB compute units (ACUs) of reserved storage resources is created.
A job resource group is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition cluster.

Step 1: Go to the SQL Development page

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and the job resource group.

Step 2: Create a database and a Hudi external table

Note

You can execute the following SQL statements in batch or interactive mode. For more information, see Spark SQL execution modes.

Execute the following statement to create a database. If a database already exists, you can skip this step.

CREATE DATABASE adb_external_db_hudi
location 'oss://<bucket_name>/test/';    /* The location that is used to store the database. Replace the parameter value with your OSS path. */

Execute the following statement to create a Hudi external table:
```
CREATE TABLE adb_external_db_hudi.test_hudi_tbl (
 `id` int,
 `name` string,
 `age` int
) using hudi 
tblproperties
 (primaryKey = 'id', 
  preCombineField = 'age') 
  partitioned by (age) 
location 'oss://<bucket_name>/test/table/';  /* The location that is used to store the external table. Replace the parameter value with your OSS path. */
```
Important
- The buckets that are used to store the database and the external table must be the same.
- The OSS path of the external table must have at least one more directory level than that of the database. The location of the external table must be in the database path.
- When you create the external table, you must use the primaryKey parameter to specify a primary key. The preCombineField parameter is optional. If you do not specify this parameter and you perform an UPDATE operation, an error message is returned.

Step 3: Write data to the Hudi external table

Note

You can execute the following SQL statements in batch or interactive mode. For more information, see Spark SQL execution modes.

INSERT

Use one of the following methods to write data to the Hudi external table:

Method 1: Execute the INSERT INTO statement

INSERT INTO adb_external_db_hudi.test_hudi_tbl values(1, 'lisa', 10),(2, 'jams', 10);

Method 2: Execute the INSERT OVERWRITE statement

INSERT OVERWRITE adb_external_db_hudi.test_hudi_tbl values (1, 'lisa', 10), (2, 'jams', 20);

Method 3: Execute the INSERT OVERWRITE statement to write data to a static partition

INSERT OVERWRITE adb_external_db_hudi.test_hudi_tbl partition(age=10) values(1, 'anna');

Method 4: Execute the INSERT OVERWRITE statement to write data to a dynamic partition

INSERT OVERWRITE adb_external_db_hudi.test_hudi_tbl partition (age) values (1, 'bom', 10);

UPDATE

Execute the following statement to update data. In this example, the value of the name column in the row whose id value is 2 is updated to box.

UPDATE adb_external_db_hudi.test_hudi_tbl SET name = 'box' where id = 2;

DELETE

Execute the following statement to delete data. In this example, the row whose id value is 1 is deleted.

DELETE FROM adb_external_db_hudi.test_hudi_tbl where id = 1;

Concurrency control

Hudi external tables use the lock provider-based concurrency control mechanism to prevent concurrency conflicts during the execution of DML operations. Multiple concurrent writes to different data ranges are allowed. To prevent write conflicts and ensure data correctness and consistency, make sure that the data that you want to write does not contain overlapping entries. To enable concurrency control, you must configure the following parameters. For more information about the concurrency control mechanism of Hudi, see Concurrency Control.

Note

If you use an open source Hudi JAR package, you cannot use the MdsBasedLockProvider class to implement concurrency control.

set hoodie.cleaner.policy.failed.writes=LAZY;
set hoodie.write.concurrency.mode=OPTIMISTIC_CONCURRENCY_CONTROL;
set hoodie.write.lock.provider=org.apache.hudi.sync.adb.MdsBasedLockProvider;

The following table describes the parameters.

Parameter name	Parameter value	Required	Description
hoodie.cleaner.policy.failed.writes	LAZY	Yes	The dirty data cleanup policy for failed writes. A value of LAZY specifies that incomplete writes are not immediately cleaned up. Failed writes are batch cleaned up after the heartbeat expires. This setting is suitable for scenarios that involve multiple concurrent writes.
hoodie.write.concurrency.mode	OPTIMISTIC_CONCURRENCY_CONTROL	Yes	The concurrency mode of write operations. A value of OPTIMISTIC_CONCURRENCY_CONTROL specifies that the system checks for write conflicts before each write is complete when multiple writes are performed on a Hudi external table. If a conflict is detected, the write fails.
hoodie.write.lock.provider	org.apache.hudi.sync.adb.MdsBasedLockProvider	Yes	The name of the lock provider class. You can specify a lock provider based on your business requirements. The lock provider class must be a subclass of `org.apache.hudi.common.lock.LockProvider`.

Step 4: Query data

Note

You can execute the following SQL statement in batch or interactive mode. For more information, see Spark SQL execution modes.
When you execute a Spark SQL statement, the system returns a message indicating the success or failure of the execution but does not return data. To view data, you can go to the Spark JAR Development page and click Logs in the Actions column corresponding to your application on the Applications tab. For more information, see the "View information about a Spark application" section of the Spark editor topic.

Execute the following statement to query the data of the Hudi external table:

SELECT * FROM adb_external_db_hudi.test_hudi_tbl;