Amazon Simple Storage Service (Amazon S3) is an object storage service built to store and retrieve any amount of data from anywhere. DataWorks Data Integration allows you to use Amazon S3 Reader to read data from Amazon S3 buckets. This topic describes the capabilities of synchronizing data from Amazon S3 data sources.
Supported Amazon S3 versions
Amazon S3 Reader uses Amazon S3 SDK for Java provided by Amazon to read data from an Amazon S3 data source.
Limits
Amazon S3 stores unstructured data. The following table describes the features that are supported and not supported by Amazon S3 Reader in Data Integration.
Supported | Unsupported |
|
|
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.
Add a data source
Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.
Appendix: Code and parameters
Appendix: Configure a batch synchronization task by using the code editor
If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader in the code editor.
Code for Amazon S3 Reader
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"s3",// The plug-in name.
"parameter":{
"nullFormat":"",// The string that represents a null pointer.
"compress":"",// The format in which objects are compressed.
"datasource":"",// The name of the data source.
"column":[// The names of the columns.
{
"index":0,// The ID of a column in the source object.
"type":"string"// The data type of the column.
},
{
"index":1,
"type":"long"
},
{
"index":2,
"type":"double"
},
{
"index":3,
"type":"boolean"
},
{
"format":"yyyy-MM-dd HH:mm:ss", // The time format.
"index":4,
"type":"date"
}
],
"skipHeader":"",// Specifies whether to skip the headers in a CSV-like object if the object has headers.
"encoding":"",// The encoding format.
"fieldDelimiter":",",// The column delimiter.
"fileFormat": "",// The format of the object.
"object":[]// The prefix of the object from which you want to read data.
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":""// The maximum number of dirty data records allowed.
},
"speed":{
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":1 // The maximum number of parallel threads.
"mbps":"12",// The maximum transmission rate. Unit: MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Parameters in code for Amazon S3 Reader
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. | Yes | No default value |
Object | The name of the Amazon S3 object. You can specify multiple objects from which Amazon S3 Reader reads data. For example, if a bucket contains the test folder in which the ll.txt object resides, the name of the object is test/ll.txt.
Note
| Yes | No default value |
column | The names of the columns from which you want to read data. The type parameter specifies the source data type. The index parameter specifies the ID of the column in the source object, starting from 0. The value parameter specifies the column value if the column is a constant column. Amazon S3 Reader does not read a constant column from the source. Instead, Amazon S3 Reader generates a constant column based on the value that you specify. By default, Amazon S3 Reader reads all data as strings. You can configure the column parameter in the following format:
You can configure the column parameter in the following format:
Note In the column parameter, you must configure the type parameter and configure the index or value parameter. | Yes | *, which indicates that Amazon S3 Reader reads all data as strings. |
fieldDelimiter | The column delimiter that is used in the Amazon S3 object from which you want to read data. Note Amazon S3 Reader uses a column delimiter to read data. The default column delimiter is a comma (,). If you do not specify the column delimiter, the default column delimiter is used. If the delimiter is invisible, enter Unicode-encoded characters, such as \u001b and \u007c. | Yes | Comma (,). |
compress | The format in which objects are compressed. By default, this parameter is left empty, which means that objects are not compressed. Amazon S3 Reader supports the following compression formats: GZIP, BZIP2, and ZIP. | No | Do Not Compress |
encoding | The encoding format of the object from which you want to read data. | No | utf-8 |
nullFormat | The string that represents a null pointer. No standard strings can represent a null pointer in TXT objects. You can use this parameter to define a string that represents a null pointer. For example, if you set the | No | No default value |
skipHeader | Specifies whether to skip the headers in a CSV-like object. Valid values:
Note The skipHeader parameter is unavailable for compressed objects. | No | false |
csvReaderConfig | The configurations required to read CSV-like objects. The parameter value must match the MAP type. You can use a CSV object reader to read data from CSV objects. The CSV object reader supports multiple configurations. If no configuration is performed, the default settings are used. | No | No default value |