Both MaxCompute and HybridDB use a cluster architecture comprising of multiple data nodes. You need to ensure that the data nodes are actively pushing data if you want to efficiently migrate data from MaxCompute to HybridDB with high throughput using such an architecture. You can make this possible using MaxCompute and HybridDB's ability to read and write data from and to Alibaba Cloud Object Storage Service (OSS) with this method. And there you have the solution.
It is also imperative to ensure that you have a common data format to exchange data on OSS. After a comprehensive investigation into this matter, I have discovered that MaxCompute supports writing data in text format (TEXT/CSV), and HybridDB supports reading data in text format.
Based on my personal experience, I will share some best practices to ensure a successful data migration from MaxCompute to HybridDB based on my research.
Begin by creating an external table with the same structure as that of the MaxCompute data table. The table will serve to open the data channel between MaxCompute and OSS. Refer to the screenshot below to create the table.
CREATE external TABLE `demo_oss_ext` (
id string COMMENT 'id',
data1 string COMMENT 'data1',
data2 string COMMENT 'data2'
)
partitioned by (ds string)
STORED BY 'com.alibabacloud.maxcompute.TextStorageHandler'
WITH SERDEPROPERTIES ('maxcompute.text.option.delimiter'='\t')
LOCATION 'oss://id:key@endpoint/bucketname/oss_dir/';
Key Parameters:
1.The parameter com.alibabacloud.maxcompute.TextStorageHandler defines the data format that you can use to store data in OSS.
•Text Storage Handler is developed in JAVA and is the default option for data delivery.
•The default Text Storage Handler does not support the complete TEXT/CSV protocol. If you want it to support the complete TEXT/CSV protocol, you can use open-source JAVA CSV.
2.Text Storage Handler supports two custom parameters:
•maxcompute.text.option.delimiter specifies the column delimiters.
•maxcompute.text.option.use.quote defines the quote characters.
•The default value of the NULL column is N, and you cannot change it.
•Text Storage Handler does not allow escaping special characters. You can escape special characters only through a custom handler.
3.LOCATION specifies the specific account and location on OSS to which you deliver the data. It includes ID, key, endpoint, bucket, and a specific location.
Use the following SQL statement to migrate data from MaxCompute to OSS.
insert into demo_oss_ext select * from t_data;
Note:
•This operation is carried out in a parallel manner, where the default size of each concurrent unit is 256MB. You can also set a smaller value through set maxcompute.sql.mapper.split.size=xxx; to increase the concurrency.
•OSS data flow control affects the data transfer from MaxCompute to OSS. Technically, OSS network bandwidth for a single concurrency is 100MB/s.
•If you want to further increase the bandwidth, you would need to contact the OSS administrator to release the restrictions.
HybridDB external table: oss_ext
The screenshot below will help you create a HybridDB external table.
CREATE READABLE EXTERNAL TABLE user_data_oss_ext (
id int64,
data1 text,
data2 text
)
location('oss://endpoint
dir=data_oss_dir
id=ossid
key=osskey
bucket=bucketname')
FORMAT 'TEXT' (DELIMITER '\t' )
LOG ERRORS INTO error_track_table SEGMENT REJECT LIMIT 10;
Key Parameters:
1.Location specifies all OSS related parameters.
2.File format needs to match the format of the MaxCompute external table - FORMAT 'TEXT' (DELIMITER 't')
3.Set to skip wrong lines
oDuring the migration of heterogeneous data, you may encounter data that is not able to pass the verification. This data may exist in the form of special characters or invalid code.
oLOG ERRORS INTO error_track_table will write data that throws an error into a table.
oSEGMENT REJECT LIMIT X allows you to set the threshold number of errors permitted in a single segment, or the allowed percentage of error.
4.HybridDB also conducts import in parallel, the number of parallel executions is equal to the number of computing nodes.
5.Importing text/csv data in gzip format may improve the performance by more than 100%, provided that MaxCompute supports exporting compressed data.
The next step involves compressing the column for HybridDB local tables. You can do this by referring to the screenshot below.
CREATE TABLE Tao(
id int64,
data1 text,
data2 text
)
with (
APPENDONLY=true, COMPRESSTYPE=zlib,
,BLOCKSIZE=2097152,
ORIENTATION=COLUMN,CHECKSUM=true,
OIDS=false)
DISTRIBUTED BY (id);
Key Parameters:
1.If you do not need to modify the imported data to HybridDB on a large scale, I would suggest you to use apply only to organize data by column, and then compress it.
•Use the following parameter settings: APPENDONLY=true COMPRESSTYPE=zlib COMPRESSLEVEL=5 ORIENTATION=COLUMN BLOCKSIZE=2097152
•HybridDB supports column compression, which offers a much higher compression ratio than row compression. By setting COMPRESSLEVEL=5, the compression ratio can easily reach 20% of the original side.
2.Use DISTRIBUTED BY (column) to distribute data evenly to each of the HybridDB computing nodes. Even distribution of data is the key to selecting a distribution column.
Use the following SQL statement to import data from OSS to HybridDB:
insert into t_ao select * from user_data_oss_ext;
Both HybridDB and PostgreSQL support reading and writing data from and to OSS
Similar to S3 of AWS, OSS is a low-price storage service that opens the communication channels between all cloud products. It is also Alibaba Cloud's recommended cloud data channel.
Both PostgreSQL and HybridDB on the cloud currently support reading and writing the OSS data source.
•PostgreSQL + OSS read and write external data source oss_fdw.
•HybridDB for PostgreSQL + OSS import and export data in parallel oss_ext.
I hope that the steps mentioned in this article will help you successfully migrate data from MaxCompute to HybridDB. To know more about data migration, click here.
1.PostgreSQL + OSS oss_fdw
2.HybridDB for PostgreSQL + OSS oss_ext
3.SLS supports delivery of CSV data to OSS
4.Formatting open-source JAVA data
5.Export data from -MaxCompute to OSS
6.How to access OSS from MaxCompute
2,599 posts | 764 followers
FollowAlibaba Clouder - January 11, 2018
Alibaba Clouder - December 13, 2017
Alibaba Clouder - February 7, 2018
Michael Peng - September 24, 2019
Alibaba Clouder - September 3, 2019
Alibaba Clouder - February 27, 2018
2,599 posts | 764 followers
FollowSecure and easy solutions for moving you workloads to the cloud
Learn MoreAlibaba Cloud offers Independent Software Vendors (ISVs) the optimal cloud migration solutions to ready your cloud business with the shortest path.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreMore Posts by Alibaba Clouder
Raja_KT March 6, 2019 at 3:39 am
Good one. MPP based solution performance is characterized by key(s) .