Migrate data from an Amazon Redshift cluster to an AnalyticDB for PostgreSQL instance - AnalyticDB

You can use Alibaba Cloud Data Online Migration or the ossimport tool to migrate data from an Amazon Redshift cluster to an AnalyticDB for PostgreSQL instance.

Prerequisites

An Amazon Redshift cluster is created.
The Amazon S3 service that is used to export Amazon Redshift data is activated.
Alibaba Cloud Object Storage Service (OSS) is activated. For more information, see What is OSS?
An OSS bucket is created. For more information, see Create a bucket.
Note
To facilitate subsequent data import, we recommend that you create an OSS bucket in the same region as the AnalyticDB for PostgreSQL instance.
An AnalyticDB for PostgreSQL instance is created. For information about how to select instance specifications, see the "Specification selection" section of this topic.

Specification selection

This section describes how to select specifications for an AnalyticDB for PostgreSQL instance based on the specifications of the source Amazon Redshift cluster.

An Amazon Redshift cluster consists of a leader node and multiple compute nodes.

Leader node: acts as the coordinator node within an AnalyticDB for PostgreSQL instance. The leader node communicates with the client, analyzes and develops query execution plans, and performs database operations.
Compute node: acts as a set of compute nodes within an AnalyticDB for PostgreSQL instance. Each compute node in an Amazon Redshift cluster is partitioned into slices. Each node slice acts as a compute node within an AnalyticDB for PostgreSQL instance and performs computing for data storage and query.

When you create an AnalyticDB for PostgreSQL instance, you can select the node specifications of the instance based on specifications of each node slice in the source Amazon Redshift cluster.

Example

An Amazon Redshift cluster consists of four compute nodes. Each compute node consists of two node slices. The specifications of each node slice are 2 cores, 16 GB memory, and 1 TB storage.

When you create an AnalyticDB for PostgreSQL instance, you can set the number of compute nodes to 8, node specifications to 2 cores and 16 GB memory, and single-node storage to 1,000 GB.

Note

For information about how to create an AnalyticDB for PostgreSQL instance, see Create an instance.
We recommend that you set the storage type to Enhanced SSD (ESSD). ESSDs provide better I/O performance than ultra disks.

Procedure

Step 1: Export data from an Amazon Redshift cluster to an Amazon S3 bucket

You can execute the UNLOAD statement to export data. For more information, see UNLOAD.

Syntax:

UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
authorization
[ option [ ... ] ]

where option is
{ [ FORMAT [ AS ] ] CSV | PARQUET
| PARTITION BY ( column_name [, ... ] ) [ INCLUDE ]
| MANIFEST [ VERBOSE ] 
| HEADER           
| DELIMITER [ AS ] 'delimiter-char' 
| FIXEDWIDTH [ AS ] 'fixedwidth-spec'   
| ENCRYPTED [ AUTO ]
| BZIP2  
| GZIP 
| ZSTD
| ADDQUOTES 
| NULL [ AS ] 'null-string'
| ESCAPE
| ALLOWOVERWRITE
| CLEANPATH
| PARALLEL [ { ON | TRUE } | { OFF | FALSE } ]
| MAXFILESIZE [AS] max-size [ MB | GB ] 
| REGION [AS] 'Amazon-region' }

Note

We recommend that you use the FORMAT AS PARQUET or CSV parameter to specify the format of the exported data.
To allow more files to be generated and improve data export efficiency, we recommend that you specify the PARALLEL ON parameter to export data in parallel.
To generate more files, we recommend that you specify the MAXFILESIZE parameter to limit the size of files. The recommended value is an integer multiple of the number of nodes within an AnalyticDB for PostgreSQL instance. This improves the data import efficiency when data is subsequently imported in parallel from OSS foreign tables to the AnalyticDB for PostgreSQL instance.

Step 2: Synchronize data from Amazon S3 to OSS

You can use Alibaba Cloud Data Online Migration or the ossimport tool to synchronize data from Amazon S3 to OSS.

Data Online Migration

Log on to the Data Transport console.

Create the source and destination data addresses for data synchronization.

In the left-side navigation pane, choose Data Online Migration > Data Address.
Click Create Data Address to create a source data address.

In the Create Data Address panel, configure the parameters that are described in the following table and click OK.

Parameter	Required	Description
Data Type	Yes	Select AWS S3.
Data Name	Yes	Enter a name. The name must be 3 to 63 characters in length. The name cannot contain special characters, except for hyphens (-) and underscores (_).
Endpoint	Yes	Enter a region where the source data address is located. For more information, see Regions and endpoints.
Bucket	Yes	The name of an AWS Simple Storage Service (AWS S3) bucket. The bucket name cannot contain invalid characters, such as spaces, line breaks, and tab characters.
Prefix	Yes	Migrate All Data: All data in the bucket is migrated. When you migrate all data, you do not need to specify a prefix. Migrate Partial Data: Only the files in the directory that is specified by the prefix are migrated. The prefix must end with a forward slash (/). For example, you can specify `data/to/oss/` as a prefix.
Access Key Id and Secret Access Key	Yes	Specify the AccessKey pair that is used to migrate data. Delete the key pair after the migration job is completed.

Click Create Data Address again to create a destination data address.

In the Create Data Address panel, configure the parameters that are described in the following table and click OK.

Parameter	Required	Description
Data Type	Yes	Select OSS.
Data Name	Yes	Enter a name. The name must be 3 to 63 characters in length. The name cannot contain special characters, except for hyphens (-) and underscores (_).
Data Region	Yes	Select the region where the destination data address is located.
Enable Transfer Acceleration	No	For Data Online Migration to use the transfer acceleration feature provided by Object Storage Service (OSS), you must enable the transfer acceleration feature for buckets. The transfer acceleration feature takes effect within 30 minutes after it is enabled. After you enable the feature, wait for 30 minutes before you create a migration job. Important If you use the transfer acceleration feature to access your OSS buckets, you are charged additional fees for transfer acceleration. For more information, see Transfer acceleration.
OSS Endpoint	Yes	Select an endpoint based on the region where your data is stored. For more information, see Regions and endpoints. http://oss-cn-endpoint.aliyuncs.com: indicates that you use an HTTP-based endpoint to access OSS over the Internet. https://oss-cn-endpoint.aliyuncs.com: indicates that you use an HTTPS-based endpoint to access OSS over the Internet. Note If you want to migrate data from third-party cloud services to OSS, you must access OSS from a public endpoint over the Internet.
Access Key Id and Access Key Secret	Yes	Specify the AccessKey pair that is used to migrate data. For more information, see Prepare for a migration job.
OSS Bucket	Yes	Select a bucket to store the migrated data.
OSS Prefix	No	An OSS prefix cannot start with a forward slash (/) but must end with a forward slash (/). For example, you can specify `data/to/oss/` as the OSS prefix. If you want to migrate data to the root directory of the specified bucket, do not specify this parameter. Important If the name of a source file starts with a forward slash (/), you must specify an OSS prefix when you configure the destination data address. If no OSS prefix is specified, the migration job fails. For example, if the name of a file to be migrated is /test/test.png, you must specify an OSS prefix, such as oss/. After the migration job is completed, the file name changes from `/test/test.png` to oss//test/test.png.

Create an online migration job.

In the left-side navigation pane, choose Data Online Migration > Migration Jobs.
Click Create Job.
In the Create Job panel, read the terms of the migration service and select I agree to the above terms and activate the Data Transport service. Then, click Next.
In the Fee Reminder message, click Yes, Go Ahead.

In the Create Job panel, configure the parameters that are described in the following table and click Next.

Parameter	Required	Description
Job Name	Yes	The name of the migration job. The name must be 3 to 63 characters in length and can contain only lowercase letters, digits, and hyphens (-). It cannot start or end with a hyphen (-).
Source Data Address	Yes	The source data address that you created.
Destination Data Address	Yes	The destination data address that you created.
Specified Directory	No	The directories in which the files and subdirectories are migrated or not migrated. Valid values: Do not filter: migrates all data from the source data address. Exclude: does not migrate the files or subdirectories in the specified directories. For example, you want to migrate all the directories in the `root_dir/` directory except `root_dir/bad_sub1/` and `root_dir/bad_sub2/`. In this case, you can select this method and specify `bad_sub1/` and `bad_sub2/`. Contain: migrates only the files and subdirectories in the specified directories. For example, you want to migrate only `root_dir/good_sub1/` and `root_dir/good_sub2/` in the `root_dir/` directory. In this case, you can select this method and specify `good_sub1/` and `good_sub2/`. Note A directory name cannot start with a forward slash (/) or a backslash (\), and cannot contain double forward slashes (//), double periods (..), or double quotation marks ("). The character string that consists of all the specified directory names cannot exceed 10 KB in length. A directory name must end with a forward slash (/). Example: docs/. You can specify up to 20 directories that you want to include or exclude.
Migration Type	Yes	The data migration mode. Valid values: Full: The system migrates all data from the source data address to the destination data address. You must configure the Start Time Point of File parameter. The system migrates all data whose last modification time is later than the specified time. After all data is migrated, the migration job ends. If you change the data at the source data address after the migration job is complete, you can submit another job to migrate all data. In this case, the system migrates only the changed data. Incremental: The system runs a migration job based on the values that you specified for the Migration Interval and Migration Times parameters. You must configure the Start Time Point of File parameter. During the first migration, the system migrates all data whose last modification time is later than the specified time. After the first migration is complete, incremental migration is performed based on the value that you specified for the Migration Interval parameter. For each incremental migration, the system migrates only the data that is created or modified after the previous migration starts and before the current migration starts. If you set the Migration Times parameter to N, full migration is performed once and then incremental migration is performed (N - 1) times. For example, the current time is 08:00, March 10, 2019 and you set the Migration Interval parameter to 1, the Migration Times parameter to 5, and the Start Time Point of File parameter to 08:00, March 5, 2019. When the first migration starts, the system migrates all files whose last modification time is between 08:00, March 5, 2019 and 08:00, March 10, 2019. In this example, the first migration requires one hour to complete and the second migration starts at 10:00, March 10, 2019. The duration from 08:00 to 10:00 consists of one hour for the first migration process and one hour for the migration interval. During the second migration, the system migrates the files whose last modification time is between 08:00, March 10, 2019 and 10:00, March 10, 2019. The migration job consists of one full migration and four incremental migrations. Important Before full or incremental migration starts, the system compares the files at the source data address with the files at the destination data address. If a file at the destination data address has the same name as a file at the source data address, the file at the destination data address is overwritten when one of the following conditions is met: The Content-Type values of the files at the source data address and the destination data address are different. The last modification time of the file at the source data address is later than that of the file at the destination data address. The size of the file at the source data address is different from that of the file at the destination data address.
Start Time Point of File	Yes	The time to filter data to be migrated. The system migrates the data that is created or modified after the specified time. Valid values: All: migrates all files. Assign: migrates only the files that are created or modified after the specified time. For example, if you set the time to 08:00:00, November 1, 2018, only the files that are created or modified after 08:00:00, November 1, 2018 are migrated.
Migration Interval	This parameter is required if you set the Migration Type parameter to Incremental.	The default value is 1, and the maximum value is 24. Unit: hours.
Migration Times	This parameter is required if you set the Migration Type parameter to Incremental.	The default value is 1, and the maximum value is 30.
File Overwrite Method	Yes	The method that is used to process the file at the source data address whose name is the same as that of a file at the destination data address. Valid values: LastModified: If a file at the source data address has the same name as a file at the destination data address, the system compares the last modification time that is indicated by the LastModified property of the files. If the last modification time of the file at the source data address is later than that of the file at the destination data address, the file at the source data address is migrated and the file at the destination data address is overwritten. If the last modification time of the file at the source data address is earlier than that of the file at the destination data address, the file at the source data address is not migrated and the file at the destination data address is retained. If the two files have the same last modification time, the system checks the values of the Size and Content-Type properties of the files. If the two files have the same Size and Content-Type values, the file at the source data address is not migrated. If the two files have different values of at least one property, the file at the source data address is migrated and the file at the destination data address is overwritten. Condition: If a file at the source data address has the same name as a file at the destination data address, the system compares the values of the LastModified, Size, and Content-Type properties of the two files. If all property values of the two files are the same, the file at the source data address is not migrated. If the two files have different values of at least one property, the file at the source data address is migrated and the file at the destination data address is overwritten. All: If a file at the source data address has the same name as a file at the destination data address, the system performs no comparison and directly overwrites the file at the destination address with the file at the source data address. No: If a file at the source data address has the same name as a file at the destination data address, the system performs no comparison and directly skips the file at the source data address during the migration.

In the Performance step, go to the Data Prediction section and configure the Data Size and File Count parameters.
Note
To ensure that the migration job is successful, we recommend that you estimate the amount of data that you want to migrate as accurately as possible. For more information, see Prepare for a migration job.
(Optional) In the Performance step, go to the Flow Control section, specify the Time Range and Max Flow(MB/s) parameters, and then click Add.
Note
To ensure business continuity, we recommend that you configure the Time Range and Max Flow parameters based on traffic fluctuations.
Click Create and wait until the migration job is complete.

ossimport

Download and install ossimport in standalone mode. For more information, see Overview.

The ossimport tool in standalone mode has the following file structure:

ossimport
├── bin
│ └── ossimport2.jar  # The JAR package that contains the Master, Worker, TaskTracker, and Console modules.
├── conf
│ ├── local_job.cfg   # The job configuration file.
│   └── sys.properties  # The configuration file that contains system parameters.
├── console.bat         # The Windows command line utility that is used to run tasks in a distributed manner.
├── console.sh          # The Linux command line utility that is used to run tasks in a distributed manner.
├── import.bat          # The script that automatically imports files based on the conf/local_job.cfg configuration file in Windows. The configuration file contains parameters that specify data migration operations such as start, migration, verification, and retry.
├── import.sh           # The script that automatically imports files based on the conf/local_job.cfg configuration file in Linux. The configuration file contains parameters that specify data migration operations, such as start, migration, verification, and retry.
├── logs                # The directory that contains logs.
└ ── README.md # The file that provides a description of ossimport. We recommend that you read the file before you use ossimport.

Configure ossimport in standalone mode.

You need to modify only the following parameters in the conf/local_job.cfg configuration file:

srcType=s3
srcAccessKey=<Your AWS access key>
srcSecretKey=<Your AWS secret key>
srcDomain=<The domain name that corresponds to the region in which the Amazon S3 bucket is deployed>
srcBucket=<The name of the Amazon S3 bucket>
destAccessKey=<Your Alibaba Cloud AccessKey ID>
destSecretKey=<Your Alibaba Cloud AccessKey secret>
destDomain=<The endpoint that corresponds to the region in which the OSS bucket is deployed>
destBucket=<The name of the OSS bucket>
destPrefix=
isSkipExistFile=true

For more information, see Overview.

Run ossimport to synchronize data to OSS. For information about how to use ossimport in standalone mode, see Standalone deployment.

Step 3: Create a destination table

Create a destination table in the AnalyticDB for PostgreSQL instance to load data from Amazon Redshift. The destination table must have the same schema as the source table. For more information, see CREATE TABLE.

Modify information about database objects such as schemas, tables, functions, and views to convert DDL statements. For more information, see the "Convert DDL statements" section of this topic.

Step 4: Import data from OSS to the AnalyticDB for PostgreSQL instance

You can use the COPY statement or an OSS foreign table to import data to AnalyticDB for PostgreSQL.

For information about how to use the COPY statement to import OSS data, see Use the COPY or UNLOAD statement to import or export data between OSS foreign tables and AnalyticDB for PostgreSQL tables.
For information about how to use an OSS foreign table to import OSS data, see the "Import OSS data to an AnalyticDB for PostgreSQL table" section of the Use OSS foreign tables for data lake analysis topic.

Convert DDL statements

The DDL statements of Amazon Redshift are slightly different from the DDL statements of AnalyticDB for PostgreSQL and must be converted before you migrate data.

CREATE SCHEMA

Execute the CREATE SCHEMA statement of AnalyticDB for PostgreSQL to create a schema. Example:

CREATE SCHEMA schema1 AUTHORIZATION xxxpoc;
GRANT ALL ON SCHEMA schema1 TO xxxpoc;
GRANT ALL ON SCHEMA schema1 TO public;
COMMENT ON SCHEMA model IS 'for xxx migration  poc test';

CREATE SCHEMA oss_external_table AUTHORIZATION xxxpoc;

CREATE FUNCTION

AnalyticDB for PostgreSQL does not support specific SQL functions of Amazon Redshift. You must modify or rewrite the SQL functions. Examples:

Replace the CONVERT_TIMEZONE(a,b,c) function with the following function:
```
timezone(b, timezone(a,c))
```
Replace the GETDATE() function with the following function:
```
current_timestamp(0):timestamp
```

Replace or optimize user-defined functions (UDFs). The following sample code provides an example of a UDF of Amazon Redshift:

CREATE OR REPLACE FUNCTION public.f_jdate(dt timestamp without time zone)
RETURNS character varying AS
'      from datetime import timedelta, datetime
       if dt.hour < 4:
              d = timedelta(days=-1)
              dt = dt + d
       return str(dt.date())'
LANGUAGE plpythonu IMMUTABLE;
COMMIT;

Replace the preceding function with the following function:

to_char(a - interval '4 hour', 'yyyy-mm-dd')

Replace other Amazon Redshift functions.
For information about standard PostgreSQL functions, see Functions and Operators. You can modify or rewrite the SQL functions of Amazon Redshift that are not supported by AnalyticDB for PostgreSQL, such as the following functions:

CREATE TABLE

Modify the name of a table.
The name of an Amazon Redshift table can be up to 127 characters in length, whereas the name of an AnalyticDB for PostgreSQL table can be up to 63 characters in length. If the names of database objects such as tables, functions, and views are more than 63 characters in length, you must remove the excess characters.
Modify the compression algorithm.
Replace ENCODE XXX in a CREATE TABLE statement of Amazon Redshift with the following clause:
```
WITH (COMPRESSTYPE={ZLIB|ZSTD|RLE_TYPE|NONE})
```
For information about the compression algorithms supported by AnalyticDB for PostgreSQL, see the "Data compression" section of the Define storage models for tables topic.
Modify distribution keys.
Amazon Redshift supports three distribution keys. For more information, see Distribution styles. You must modify distribution keys based on the following rules:
- EVEN distribution: Use DISTRIBUTED RANDOMLY.
- KEY distribution: Use DISTRIBUTED BY (column, [ ... ] ).
- ALL distribution: Use DISTRIBUTED REPLICATED.
Modify sort keys.
Replace the COMPOUND or INTERLEAVED option in the [ COMPOUND | INTERLEAVED ] SORTKEY (column_name [, ...] ) ] sort key clause of Amazon Redshift with the following clause:
```
order by (column, [ ... ])
```

Examples:

Example 1:

The following sample code provides an example of a CREATE TABLE statement of Amazon Redshift:

CREATE TABLE schema1.table1
(
    filed1 VARCHAR(100) ENCODE lzo,
    filed2 INTEGER DISTKEY,
    filed3 INTEGER,
    filed4 BIGINT ENCODE lzo,
    filed5 INTEGER
)
INTERLEAVED SORTKEY
(
    filed1,
    filed2
);

Convert the preceding statement into the following CREATE TABLE statement of AnalyticDB for PostgreSQL:

CREATE TABLE schema1.table1
(
    filed1 VARCHAR(100) ,
    filed2 INTEGER,
    filed3 INTEGER,
    filed4 BIGINT,
    filed5 INTEGER
)
WITH(APPENDONLY=true,ORIENTATION=column,COMPRESSTYPE=zstd)
DISTRIBUTED BY (filed2)
ORDER BY (filed1, filed2);

-- Sort
SORT schema1.table1;
MULTISORT schema1.table1;

Example 2:

The following sample code provides an example of a CREATE TABLE statement of Amazon Redshift, which includes the ENCODE and SORTKEY options:

CREATE TABLE schema2.table2
(
    filed1 VARCHAR(50) ENCODE lzo,
    filed2 VARCHAR(50) ENCODE lzo,
    filed3 VARCHAR(20) ENCODE lzo,
)
DISTSTYLE EVEN
INTERLEAVED SORTKEY
(
    filed1
);

Convert the preceding statement into the following CREATE TABLE statement of AnalyticDB for PostgreSQL:

CREATE TABLE schema2.table2
(
    filed1 VARCHAR(50),
    filed2 VARCHAR(50),
    filed3 VARCHAR(20),
)
WITH(APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zstd)
DISTRIBUTED randomly
ORDER BY (filed1);

-- Sort
SORT schema2.table2;
MULTISORT schema2.table2;

Note

For more information about sort keys, see Use sort keys and rough set indexes to accelerate queries in column-oriented tables.

CREATE VIEW

Convert a CREATE VIEW statement of Amazon Redshift into an SQL statement that complies with the AnalyticDB for PostgreSQL syntax. The conversion rules are similar to the conversion rules of CREATE TABLE statements.

Note

The WITH NO SCHEMA BINDING clause is not supported.