為了在您使用MySQL處理龐巨量資料集時最佳化寫入查詢效能,本文詳細介紹了DataWorksData Integration將資料寫入MySQL分區表的最佳實務。
前提條件
已具備MySQL資料庫,並在DataWorks中添加為資料來源。具體操作,請參見MySQL資料來源。
準備工作
建立源表,用於DataWorks讀取來源資料。
CREATE TABLE `user_tb1` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'id', `name` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '名字', `update_time` datetime DEFAULT NULL, PRIMARY KEY (`id`) ); INSERT INTO user_tb1 (name, update_time) VALUES ('u1', '2023-12-28 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u2', '2023-12-25 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u3', '2023-12-26 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u3', '2023-12-27 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u4', '2023-12-28 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u5', '2023-12-28 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u6', '2023-12-28 09:52:05'); INSERT INTO user_tb1 (name, update_time) VALUES ('u7', '2023-12-28 09:52:05');建立目標表,用於將資料寫入分區表。
本文以RANGE分區表和KEY分區表為例,需要分別建立如下表:
RANGE分區表
CREATE TABLE `pt_write_test_tb1` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'id', `name` varchar(100) COMMENT '名字', `update_time` datetime COMMENT '更新時間', PRIMARY KEY (`id`,`update_time`), KEY `idx_update_time` (`update_time`) ) PARTITION BY RANGE COLUMNS(update_time) ( PARTITION pt20231216 VALUES LESS THAN ('20231216'), PARTITION pt20231217 VALUES LESS THAN ('20231217'), PARTITION pt20231218 VALUES LESS THAN ('20231218'), PARTITION pt20231219 VALUES LESS THAN ('20231219'), PARTITION pt20231220 VALUES LESS THAN ('20231220'), PARTITION pt20231221 VALUES LESS THAN ('20231221'), PARTITION pt20231222 VALUES LESS THAN ('20231222'), PARTITION pt20231223 VALUES LESS THAN ('20231223'), PARTITION pt20231224 VALUES LESS THAN ('20231224'), PARTITION pt20231225 VALUES LESS THAN ('20231225'), PARTITION pt20231226 VALUES LESS THAN ('20231226'), PARTITION pt20231227 VALUES LESS THAN ('20231227'), PARTITION pt20231228 VALUES LESS THAN ('20231228'), PARTITION pt20231229 VALUES LESS THAN ('20231229') );KEY分區表
CREATE TABLE `pt_write_test_tb2` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'id', `name` varchar(100) COMMENT '名字', `update_time` datetime COMMENT '更新時間', PRIMARY KEY (`id`) ) PARTITION BY KEY(`id`) PARTITIONS 16;
分區表寫入
RANGE分區表寫入
本樣本通過日期update_time欄位作為RANGE分區條件,示範如何把資料寫入MySQL分區表。
配置離線同步任務。
配置要點:
資料來源的表配置為已建立的源表
user_tb1。資料去向的表配置為已建立的目標RANGE分區表
pt_write_test_tb1。
說明您也可以通過資料去向的匯入前準備語句配置項,讓同步任務在寫入資料之前,先建立當天的分區:
準備一個調度參數,例如
pt_date=$[yyyymmdd]。在匯入前準備語句中配置如下SQL:
ALTER TABLE `pt_write_test_tb1` ADD PARTITION (PARTITION pt${pt_date} VALUES LESS THAN ('${pt_date}'));
查詢目標表中被寫入分區表的資料。
查詢整表資料:
SELECT * FROM pt_write_test_tb1;返回樣本:
+----+------+---------------------+ | id | name | update_time | +----+------+---------------------+ | 2 | u2 | 2023-12-25 09:52:05 | | 3 | u3 | 2023-12-26 09:52:05 | | 4 | u3 | 2023-12-27 09:52:05 | | 1 | u1 | 2023-12-28 09:52:05 | | 5 | u4 | 2023-12-28 09:52:05 | | 6 | u5 | 2023-12-28 09:52:05 | | 7 | u6 | 2023-12-28 09:52:05 | | 8 | u7 | 2023-12-28 09:52:05 | +----+------+---------------------+ 8 rows in set (0.01 sec)查詢單分區:
SELECT * FROM pt_write_test_tb1 partition (pt20231229);返回樣本:
+----+------+---------------------+ | id | name | update_time | +----+------+---------------------+ | 1 | u1 | 2023-12-28 09:52:05 | | 5 | u4 | 2023-12-28 09:52:05 | | 6 | u5 | 2023-12-28 09:52:05 | | 7 | u6 | 2023-12-28 09:52:05 | | 8 | u7 | 2023-12-28 09:52:05 | +----+------+---------------------+ 5 rows in set (0.00 sec)
KEY分區表寫入
本樣本通過主鍵id欄位作為KEY分區條件,示範如何把資料寫入MySQL分區表。
配置離線同步任務。
配置要點:
資料來源的表配置為已建立的源表
user_tb1。資料去向的表配置為已建立的目標KEY分區表
pt_write_test_tb2。

查詢目標表中被寫入分區表的資料。
查詢整表資料:
select * from pt_write_test_tb2 order by id;返回樣本:
+----+------+---------------------+ | id | name | update_time | +----+------+---------------------+ | 1 | u1 | 2023-12-28 09:52:05 | | 2 | u2 | 2023-12-25 09:52:05 | | 3 | u3 | 2023-12-26 09:52:05 | | 4 | u3 | 2023-12-27 09:52:05 | | 5 | u4 | 2023-12-28 09:52:05 | | 6 | u5 | 2023-12-28 09:52:05 | | 7 | u6 | 2023-12-28 09:52:05 | | 8 | u7 | 2023-12-28 09:52:05 | +----+------+---------------------+ 8 rows in set (0.00 sec)查看各個分區的資料:
SELECT TABLE_NAME , PARTITION_NAME , TABLE_ROWS, UPDATE_TIME FROM information_schema.partitions WHERE TABLE_NAME = 'pt_write_test_tb2';返回樣本:
+-------------------+----------------+------------+---------------------+ | TABLE_NAME | PARTITION_NAME | TABLE_ROWS | UPDATE_TIME | +-------------------+----------------+------------+---------------------+ | pt_write_test_tb2 | p0 | 0 | NULL | | pt_write_test_tb2 | p1 | 0 | NULL | | pt_write_test_tb2 | p10 | 0 | NULL | | pt_write_test_tb2 | p11 | 1 | 2024-02-29 14:49:14 | | pt_write_test_tb2 | p12 | 0 | NULL | | pt_write_test_tb2 | p13 | 0 | NULL | | pt_write_test_tb2 | p14 | 1 | 2024-02-29 14:49:15 | | pt_write_test_tb2 | p15 | 1 | 2024-02-29 14:49:16 | | pt_write_test_tb2 | p2 | 1 | 2024-02-29 14:49:17 | | pt_write_test_tb2 | p3 | 0 | NULL | | pt_write_test_tb2 | p4 | 1 | 2024-02-29 14:49:14 | | pt_write_test_tb2 | p5 | 1 | 2024-02-29 14:49:15 | | pt_write_test_tb2 | p6 | 0 | NULL | | pt_write_test_tb2 | p7 | 0 | NULL | | pt_write_test_tb2 | p8 | 1 | 2024-02-29 14:49:16 | | pt_write_test_tb2 | p9 | 1 | 2024-02-29 14:49:17 | +-------------------+----------------+------------+---------------------+ 16 rows in set (0.01 sec)