DataWorks provides OSS Reader and OSS Writer for you to read data from and write data to Object Storage Service (OSS) data sources. This topic describes the capabilities of synchronizing data from or to OSS data sources.
Data type mappings and limits
Batch data read
OSS Reader reads data from OSS and converts the data to a format that is readable to Data Integration. OSS stores only unstructured data. The following table lists the features that are supported and not supported by OSS Reader:
Supported | Not supported |
|
|
If data in OSS is stored as CSV files, the data must comply with the standard CSV format. For example, if the data in a column of a CSV file is enclosed in a pair of single quotation marks ('), you must replace this pair of single quotation marks with a pair of double quotation marks ("). Otherwise, the data in the CSV file may be incorrectly parsed.
OSS is an unstructured data source that stores file-based data. Before you synchronize data from an OSS data source, you must check whether the field structure meets your expectation. If the field structure in an unstructured data source changes, you must re-confirm the field structure when you configure a synchronization task. Otherwise, the synchronized data may be out of order.
Batch data write
OSS stores only unstructured data. Therefore, OSS Writer converts the data obtained from a reader to text files and writes the files to OSS. The following table lists the features that are supported and not supported by OSS Writer:
Supported | Not supported |
|
|
Category | Data Integration data type |
Integer | LONG |
String | STRING |
Floating point | DOUBLE |
Boolean | BOOLEAN |
Date and time | DATE |
Real-time data write
Real-time data write is supported.
You can write data from OSS to Hudi 0.12.x in real time.
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.
If you want to add an OSS data source across accounts, you must grant the required permissions to the account. For more information, see Authorize a RAM user in another Alibaba Cloud account by adding a bucket policy.
For information about how to use the RAM role-based authorization mode to add an OSS data source, see Use the RAM role-based authorization mode to add a data source.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.
Configure a real-time synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Create a real-time synchronization task to synchronize incremental data from a single table and Configure a real-time synchronization task in DataStudio.
Configure synchronization settings to implement real-time synchronization of full or incremental data in a database
For more information about the configuration procedure, see Configure a synchronization task in Data Integration.
FAQ
Appendix: Code and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Common code for OSS Reader
Code for OSS Reader: Read data from ORC or Parquet objects in OSS
Parameters in code for OSS Reader
Common code for OSS Writer
Code for OSS Writer: Write ORC or Parquet files to OSS
Parameters in code for OSS Writer
Appendix: Convert the data types of data in Parquet files
If you do not configure the parquetSchema parameter, DataWorks converts the data types of data in source Parquet files. The following table provides the conversion policy.
Data type after conversion | Parquet type | Parquet logical type |
CHAR / VARCHAR / STRING | BINARY | UTF8 |
BOOLEAN | BOOLEAN | N/A |
BINARY / VARBINARY | BINARY | N/A |
DECIMAL | FIXED_LEN_BYTE_ARRAY | DECIMAL |
TINYINT | INT32 | INT_8 |
SMALLINT | INT32 | INT_16 |
INT/INTEGER | INT32 | N/A |
BIGINT | INT64 | N/A |
FLOAT | FLOAT | N/A |
DOUBLE | DOUBLE | N/A |
DATE | INT32 | DATE |
TIME | INT32 | TIME_MILLIS |
TIMESTAMP/DATETIME | INT96 | N/A |