Import data from MaxCompute to Hologres by executing SQL statements - Hologres

If the size of your business data in MaxCompute exceeds 200 GB, the query complexity is high, and responses are required to be returned within seconds, you can import your business data from MaxCompute to Hologres internal tables for queries. This query method allows you to configure indexes and helps achieve a higher query efficiency than queries based on foreign tables. This topic describes how to import data from MaxCompute to Hologres and provides answers to some frequently asked questions.

Usage notes

When you import data from MaxCompute to Hologres by executing SQL statements, take note of the following items:

Partition fields in MaxCompute tables can be mapped to partition fields or regular fields in Hologres tables. Data can be imported from a partitioned MaxCompute table to a partitioned or non-partitioned Hologres table.
Hologres supports one level of partitioning. MaxCompute supports multiple levels of partitioning. When you import data from a MaxCompute table that involves multiple levels of partitioning to a partitioned Hologres table, you need to map only the first-level partition field of the MaxCompute table to the partition field of the Hologres table. Other partition fields in the MaxCompute table are mapped to regular fields in the Hologres table.
To overwrite existing data when you import data, you must use the INSERT ON CONFLICT(UPSERT) statement.
For more information about the data type mappings between MaxCompute and Hologres, see Data types.
After data in the MaxCompute table is updated, the data cached in Hologres is updated within 10 minutes. We recommend that you use the IMPORT FOREIGN SCHEMA statement to update the foreign table before you import data.
To import data from MaxCompute to Hologres, we recommend that you use SQL statements instead of the Data Integration service of DataWorks. This is because data import by executing SQL statements delivers better performance.

Import data from a MaxCompute non-partitioned table to a Hologres table for queries

Create a non-partitioned table in MaxCompute.

Create a non-partitioned table in MaxCompute. You can also use an existing non-partitioned table in MaxCompute.

In this example, the customer table in the MaxCompute public dataset named public_data is used. You can log on to the MaxCompute console and query the dataset. For more information, see Public dataset reference. The following sample code shows the DDL statement that is used to create the customer table and the SELECT statement that is used to query data from the table:

-- The DDL statement that is used to create the customer table.
CREATE TABLE IF NOT EXISTS public_data.customer(
  c_customer_sk BIGINT,
  c_customer_id STRING,
  c_current_cdemo_sk BIGINT,
  c_current_hdemo_sk BIGINT,
  c_current_addr_sk BIGINT,
  c_first_shipto_date_sk BIGINT,
  c_first_sales_date_sk BIGINT,
  c_salutation STRING,
  c_first_name STRING,
  c_last_name STRING,
  c_preferred_cust_flag STRING,
  c_birth_day BIGINT,
  c_birth_month BIGINT,
  c_birth_year BIGINT,
  c_birth_country STRING,
  c_login STRING,
  c_email_address STRING,
  c_last_review_date STRING,
  useless STRING);

-- The SELECT statement that is used to query the customer table.
SELECT * FROM public_data.customer;

The following figure shows part of the data in the customer table.

Create a foreign table in Hologres.

Create a foreign table in Hologres that is mapped to the source table in MaxCompute. Sample statement:

CREATE FOREIGN TABLE foreign_customer (
    "c_customer_sk" int8,
    "c_customer_id" text,
    "c_current_cdemo_sk" int8,
    "c_current_hdemo_sk" int8,
    "c_current_addr_sk" int8,
    "c_first_shipto_date_sk" int8,
    "c_first_sales_date_sk" int8,
    "c_salutation" text,
    "c_first_name" text,
    "c_last_name" text,
    "c_preferred_cust_flag" text,
    "c_birth_day" int8,
    "c_birth_month" int8,
    "c_birth_year" int8,
    "c_birth_country" text,
    "c_login" text,
    "c_email_address" text,
    "c_last_review_date" text,
    "useless" text
)
SERVER odps_server
OPTIONS (project_name 'public_data', table_name 'customer');

Parameter	Description
Server	The server in which the foreign table resides. You can use the odps_server server that is created at the underlying layer of Hologres. For more information, see Postgres FDW.
Project_Name	The name of the project where the MaxCompute table resides.
Table_Name	The name of the MaxCompute table that you want to query.

The data types of the fields in the foreign table must map to those in the MaxCompute table. For more information about the data type mappings, see Data type mappings between MaxCompute and Hologres.

Create a table in Hologres to store imported data.

Create a table in Hologres to store the data imported from the MaxCompute source table.

The following sample statements are for reference only. When you create a table, configure the table schema based on your business requirements and create appropriate indexes to deliver better query performance. For more information about table properties, see CREATE TABLE.

-- Create a column-oriented table to store imported data.
BEGIN;
CREATE TABLE public.holo_customer (
 "c_customer_sk" int8,
 "c_customer_id" text,
 "c_current_cdemo_sk" int8,
 "c_current_hdemo_sk" int8,
 "c_current_addr_sk" int8,
 "c_first_shipto_date_sk" int8,
 "c_first_sales_date_sk" int8,
 "c_salutation" text,
 "c_first_name" text,
 "c_last_name" text,
 "c_preferred_cust_flag" text,
 "c_birth_day" int8,
 "c_birth_month" int8,
 "c_birth_year" int8,
 "c_birth_country" text,
 "c_login" text,
 "c_email_address" text,
 "c_last_review_date" text,
 "useless" text
);
CALL SET_TABLE_PROPERTY('public.holo_customer', 'orientation', 'column');
CALL SET_TABLE_PROPERTY('public.holo_customer', 'bitmap_columns', 'c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag,c_birth_country,c_login,c_email_address,c_last_review_date,useless');
CALL SET_TABLE_PROPERTY('public.holo_customer', 'dictionary_encoding_columns', 'c_customer_id:auto,c_salutation:auto,c_first_name:auto,c_last_name:auto,c_preferred_cust_flag:auto,c_birth_country:auto,c_login:auto,c_email_address:auto,c_last_review_date:auto,useless:auto');
CALL SET_TABLE_PROPERTY('public.holo_customer', 'time_to_live_in_seconds', '3153600000');
CALL SET_TABLE_PROPERTY('public.holo_customer', 'storage_format', 'segment');
COMMIT;

Import data to the Hologres table.
Note
Hologres V2.1.17 and later support the Serverless Computing feature. The Serverless Computing feature is suitable for scenarios in which you want to import a large amount of data offline, run large-scale extract, transform, and load (ETL) jobs, or query a large amount of data from foreign tables. You can use the Serverless Computing feature to perform the preceding operations based on additional serverless computing resources. This can eliminate the need to reserve additional computing resources for the instances. This improves instance stability and reduces the occurrences of out of memory (OOM) errors. You are charged only for the additional serverless computing resources used by tasks. For more information about the Serverless Computing feature, see Overview of Serverless Computing. For more information about how to use the Serverless Computing feature, see User guide on Serverless Computing.
Execute the INSERT statement to import data from the MaxCompute source table to the Hologres table. You can import part of or all the fields. If you import part of the fields, make sure that the fields in the INSERT statement are arranged in the same order as the fields in the SELECT statement. Sample statements:
```
-- We recommend that you use the Serverless Computing feature to import a large amount of data offline and run extract, transform, and load (ETL) jobs.
SET hg_computing_resource = 'serverless';

-- Import part of the fields.
INSERT INTO holo_customer (c_customer_sk,c_customer_id,c_email_address,c_last_review_date,useless)
SELECT 
    c_customer_sk,
    c_customer_id,
    c_email_address,
    c_last_review_date,
    useless
FROM foreign_customer;

-- Import all the fields.
INSERT INTO holo_customer
SELECT * FROM foreign_customer;

-- Reset the configurations. This ensures that serverless computing resources are not used for subsequent SQL statements. 
RESET hg_computing_resource;
```
Query the data in Hologres.
Query the data that is imported from the MaxCompute source table in Hologres. Sample statement:
```
SELECT * FROM holo_customer;
```

Import data from a partitioned table in MaxCompute to a Hologres table for queries

For more information, see Import data from a partitioned MaxCompute table.

Best practices for the INSERT OVERWRITE statement

For more information, see INSERT OVERWRITE.

Synchronize data by using a visualization tool or periodic scheduling

You can synchronize a large amount of data at the same time to Hologres by using a visualization tool or periodic scheduling.

To synchronize data from MaxCompute by using HoloWeb, perform the following steps:

Log on to the HoloWeb console. For more information, see Connect to HoloWeb and perform queries.
In the top navigation bar of the HoloWeb console, choose Metadata Management > MaxCompute Query Acceleration > Import MaxCompute Data.

On the page that appears, configure the parameters. 一键同步 The following table describes the parameters.

Note

In the SQL Statements section, the SQL statements of the visualized operations are automatically generated. You cannot modify SQL statements in the SQL Statements section. However, you can copy the SQL statements to other places and modify them. Then, you can execute the modified SQL statements to synchronize data.

Section	Parameter	Description
Instance	Instance Name	The name of the instance.
Source MaxCompute Table	Project Name	The name of the MaxCompute project.
	Schema Name	The name of the schema in which the MaxCompute table resides. If your MaxCompute project uses the two-layer model, this parameter is not displayed by default. If your MaxCompute project uses the three-layer model, you can select an authorized schema from the drop-down list.
	Table Name	The name of the MaxCompute table. Prefix-based fuzzy search is supported.
Destination Hologres Table	Database Name	The name of the Hologres database to which the internal table belongs.
	Schema Name	The name of the schema in Hologres. The default value is public. You can select another authorized schema.
	Table Name	The name of the Hologres internal table. The name of the source MaxCompute table is automatically specified for this parameter. You can rename the table.
	Destination Table Description	The description of the Hologres internal table. You can enter a custom description.
Parameter Settings	GUC Parameters	The GUC parameters. For more information, see GUC parameters.
Import Task	Fields	The fields that you want to synchronize from the MaxCompute table. You can import part of or all fields in the MaxCompute table.
	Partition Configurations	Partition Field If you select a partition field, Hologres automatically creates a partitioned table as the destination table. Hologres supports one level of partitioning. MaxCompute supports multiple levels of partitioning. When you import data from a MaxCompute table that involves multiple levels of partitioning to a partitioned Hologres table, you need to set only the first-level partition field of the MaxCompute table for the destination table. Other partition fields in the MaxCompute table are mapped to regular fields in the destination table. Data Timestamp If a MaxCompute table is partitioned by date, you can specify a date. The system automatically imports data of the specified date to the destination table.
	Property	Storage Mode Column-oriented Storage: This mode is applicable to various complex queries. Row-oriented Storage: This mode is applicable to point queries and scans based on primary keys. Row-column Storage. This mode is applicable to all scenarios that support column-oriented storage and row-oriented storage. This mode is also applicable to point queries that are not based on primary keys. Default value: Column-oriented Storage. Data Lifecycle The lifecycle of table data. If you do not configure this parameter, the table data is permanently stored. If the data is not updated within the specified period, the system deletes the data after the period expires. Binlog Specifies whether to enable binary logging. For more information, see Subscribe to Hologres binary logs. Lifecycle of Binary Logs The time to live (TTL) of binary logs. Unit: days. If you leave this parameter empty, the TTL is 30 days (2,592,000 seconds) by default. Distribution Column Specifies whether a field is used as the distribution key. Hologres shuffles data to each shard based on the field. Data entries with the same distribution key value are distributed to the same shard. If you use a distribution key as a filter condition, the execution efficiency can be improved. Event Time Column Specifies whether a field is used as the segment key. Hologres segments data based on the field. If the field is involved in query conditions, Hologres can locate data based on the field. Clustering Key Specifies whether a field is used as the clustering key. The type of the clustering key determines the order of fields. Hologres can use the clustering key to accelerate range and filter queries on fields that constitute a clustering key. Dictionary Encoding Columns Specifies whether the values of a field are used to create a dictionary mapping. Dictionary encoding can convert string comparisons to numeric comparisons to accelerate queries such as GROUP BY and FILTER. By default, all fields of the TEXT type are specified for this parameter. Bitmap Column Specifies whether a field is used to build bit codes. You can filter the data that meets query conditions based on the field. By default, all fields of the TEXT type are specified for this parameter.

Click Submit. After data is imported, you can query internal table data in Hologres.

HoloWeb does not support synchronization by using periodic scheduling. To synchronize a large amount of historical data or import data by using periodic scheduling, you must use the DataStudio service of DataWorks. For more information, see Use DataWorks to periodically import MaxCompute data.

FAQ

An out-of-memory (OOM) issue occurs when data is imported from MaxCompute to Hologres, and the returned error message indicates that the memory limit is exceeded. In most cases, the "Query executor exceeded total memory limitation xxxxx: yyyy bytes used" error message is reported. The following section describes possible causes for the error and the solutions.

Step 1
- Possible causes:
  The INSERT statement contains a SELECT statement, but the ANALYZE statement is not executed on some tables. Another possible cause is that the ANALYZE statement is executed, but the data is updated, which causes inaccuracy. As a result, the join order determined by the query optimizer is invalid, and excessive memory overheads are generated.
- Solution:
  Execute the ANALYZE statement on all involved internal and foreign tables to update the statistical metadata of the tables. This helps the query optimizer generate optimal execution plans and resolve the OOM issue.
Step 2
- Possible causes:
  The table from which data is imported contains a large number of columns and each row contains a large amount of data. In this case, a large amount of data is read at the same time. This causes excessive memory overheads.
- Solution:
  Run the following SET command before the INSERT statement to control the number of data rows to be read at a time. This reduces OOM issues.
```
set hg_experimental_query_batch_size = 1024;--Default value: 8192.
INSERT INTO holo_table SELECT * FROM mc_table;
```
Step 3
- Possible causes:
  The concurrency of data import is high, which consumes large amounts of CPU resources and affects queries in internal tables.
- Solution:
  For Hologres versions earlier than V1.1, the concurrency is specified by the hg_experimental_foreign_table_executor_max_dop parameter. The default concurrency is the number of CPU cores of the instance. Specify a smaller value for the hg_experimental_foreign_table_executor_max_dop parameter. This helps reduce memory usage during data import and resolve the OOM issue. This parameter takes effect on all jobs executed on the foreign table. Sample code:
```
SET hg_experimental_foreign_table_executor_max_dop = 8;
INSERT INTO holo_table SELECT * FROM mc_table;
```
Step 4
- Possible causes:
  The concurrency of data import is high, which consumes large amounts of CPU resources and affects queries in internal tables.
- Solution:
  In Hologres V1.1, the concurrency is specified by the hg_foreign_table_executor_dml_max_dop parameter. The default value is 32. Specify a smaller value for the hg_foreign_table_executor_dml_max_dop parameter. This helps decrease the concurrency of DML statements in data import and export scenarios and prevent DML statements from occupying excessive resources. Sample code:
```
SET hg_foreign_table_executor_dml_max_dop = 8;
INSERT INTO holo_table SELECT * FROM mc_table;
```