DataWorks provides Amazon Redshift Reader and Amazon Redshift Writer for you to read data from and write data to Amazon Redshift data sources. You can configure a synchronization task for an Amazon Redshift data source by using the codeless user interface (UI) or code editor. This topic describes the capabilities of synchronizing data from or to Amazon Redshift data sources.
Supported Amazon Redshift versions
The driver that is used by Amazon Redshift is redshift-jdbc4.2 Driver 2.1.0.1. For information about the capabilities of the driver, see Configure a JDBC driver version for Amazon Redshift.
Data type mappings
For information about the mappings between Amazon Redshift data types and SQL data types and the mappings between Amazon Redshift data types and Java data types, see the official Amazon Redshift documentation. The following table lists the data type mappings.
Amazon Redshift data type | SQL data type | Java data type |
BIGINT | SQL_BIGINT | LONG |
BOOLEAN | SQL_BIT | Boolean |
CHAR | SQL_CHAR | STRING |
DATE | SQL_TYPE_DATE | java.sql.Date |
DECIMAL | SQL_NUMERIC | BigDecimal |
DOUBLE PRECISION | SQL_DOUBLE | Double |
GEOMETRY | SQL_ LONGVARBINARY | byte[] |
INTEGER | SQL_INTEGER | INTEGER |
OID | SQL_BIGINT | LONG |
SUPER | SQL_LONGVARCHAR | STRING |
REAL | SQL_REAL | Float |
SMALLINT | SQL_SMALLINT | SHORT |
TEXT | SQL_VARCHAR | STRING |
TIME | SQL_TYPE_TIME | java.sql.Time |
TIMETZ | SQL_TYPE_TIME | java.sql.Time |
TIMESTAMP | SQL_TYPE_ TIMESTAMP | java.sql.Timestamp |
TIMESTAMPTZ | SQL_TYPE_ TIMESTAMP | java.sql.Timestamp |
VARCHAR | SQL_VARCHAR | STRING |
Establish a network connection for an Amazon Redshift data source
Before you run a synchronization task configured for an Amazon Redshift data source to synchronize data, you must establish a network connection between the data source and your exclusive resource group for Data Integration over an internal network. For more information, see Establish a network connection between a resource group and a data source.
Develop a synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the synchronization task.
Add a data source
Before you configure a synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources. You must configure the following parameters when you add an Amazon Redshift data source to DataWorks:
JDBC URL: Specify a JDBC connection string that contains an IP address, a port number, a database name, and connection parameters. You can specify a public IP address or a private IP address. If you specify a public IP address, make sure that your exclusive resource group for Data Integration can access the host on which the Amazon Redshift data source is deployed.
Username: Specify the username of the Amazon Redshift database that you want to connect to.
Password: Specify the password of the Amazon Redshift database that you want to connect to.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.
Appendix: Code and parameters
Configure a batch synchronization task by using the code editor
If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.
Code for Amazon Redshift Reader and parameters in code for Amazon Redshift Reader
Code for Amazon Redshift Reader
{
"stepType": "redshift"
"parameter":
{
"datasource":"redshift_datasource",
"table": "redshift_table_name",
"where": "xxx=3",
"splitPk": "id",
"column":
[
"id",
"table_id",
"table_no",
"table_name",
"table_status"
]
},
"name": "Reader",
"category": "reader"
}
Parameters in code for Amazon Redshift Reader
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. | Yes | No default value |
table | The name of the table from which you want to read data. | Yes | No default value |
column | The names of the columns from which you want to read data. Separate the names with commas (,), such as "column":["id","name","age"]. If you want to read data from all the columns in the source table, set this parameter to an asterisk (*), such as "column":["*"]. | Yes | No default value |
where | The WHERE clause. The batch synchronization task concatenates the values of the column, table, and where parameters to form an SQL statement and executes the SQL statement to read data from the source table. When you perform a test, you can set the where parameter to limit 10. To read the data that is generated on the current day, you can set the where parameter to gmt_create > $bizdate.
| No | No default value |
splitPk | The field that is used for data sharding. If you configure this parameter, the source table is sharded based on the value of this parameter. Data Integration then runs parallel threads to read data. This way, data can be synchronized more efficiently. | No | No default value |
Code for Amazon Redshift Writer and parameters in code for Amazon Redshift Writer
Code for Amazon Redshift Writer
{
"stepType": "redshift",// The plug-in name.
"parameter":
{
"postSql":["delete from XXX;"],
"preSql":["delete from XXX;"],
"datasource":"redshift_datasource",// The name of the data source.
"table": "redshift_table_name",// The name of the table.
"writeMode": "insert",
"batchSize": 2048,
"column":
[
"id",
"table_id",
"table_no",
"table_name",
"table_status"
]
},
"name": "Writer",
"category": "writer"
}
Parameters in code for Amazon Redshift Writer
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. | Yes | No default value |
table | The name of the table to which you want to write data. | Yes | No default value |
column | The names of the columns to which you want to write data. Separate the names with commas (,), such as "column":["id","name","age"]. If you want to write data to all the columns in the destination table, set this parameter to an asterisk (*), such as "column":["*"]. | Yes | No default value |
preSql | The SQL statement that you want to execute before the batch synchronization task is run. For example, you can set this parameter to the SQL statement that is used to delete outdated data. You can execute only one SQL statement on the codeless UI and multiple SQL statements in the code editor. | No | No default value |
postSql | The SQL statement that you want to execute after the batch synchronization task is run. For example, you can set this parameter to the SQL statement that is used to add a timestamp. You can execute only one SQL statement on the codeless UI and multiple SQL statements in the code editor. | No | No default value |
batchSize | The maximum number of data records to write at a time. | No | 2048 |
writeMode | The write mode. Set the value to insert. | No | insert |