Amazon Simple Storage Service (Amazon S3) is an object storage service built to store and retrieve any amount of data from anywhere. DataWorks Data Integration allows you to use Amazon S3 Reader to read data from Amazon S3 buckets. This topic describes the capabilities of synchronizing data from Amazon S3 data sources.
Supported Amazon S3 versions
Amazon S3 Reader uses Amazon S3 SDK for Java provided by Amazon to read data from an Amazon S3 data source.
Limits
Amazon S3 stores unstructured data. The following table describes the features that are supported and not supported by Amazon S3 Reader in Data Integration.
Supported | Unsupported |
Reads data from TXT objects. The data in the TXT objects must be logical two-dimensional tables. Reads data from CSV-like objects with custom delimiters. Reads data in the ORC or Parquet format. Reads data of various data types as strings, and supports constants and column pruning. Supports recursive data read and object name-based filtering. Supports object compression. The following compression formats are supported: GZIP , BZIP2 , and ZIP . Note You cannot compress multiple objects into one package. Uses parallel threads to read data from multiple objects at the same time.
| Uses parallel threads to read data from a single object . Uses parallel threads to read data from a compressed object . Reads data from an object that exceeds 100 GB in size.
|
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.
Add a data source
Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.
Configure a batch synchronization task to synchronize data of a single table
Appendix: Code and parameters
Appendix: Configure a batch synchronization task by using the code editor
If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader in the code editor.
Code for Amazon S3 Reader
{
"type":"job",
"version":"2.0",
"steps":[
{
"stepType":"s3",
"parameter":{
"nullFormat":"",
"compress":"",
"datasource":"",
"column":[
{
"index":0,
"type":"string"
},
{
"index":1,
"type":"long"
},
{
"index":2,
"type":"double"
},
{
"index":3,
"type":"boolean"
},
{
"format":"yyyy-MM-dd HH:mm:ss",
"index":4,
"type":"date"
}
],
"skipHeader":"",
"encoding":"",
"fieldDelimiter":",",
"fileFormat": "",
"object":[]
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":""
},
"speed":{
"throttle":true,
"concurrent":1
"mbps":"12",
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Parameters in code for Amazon S3 Reader
Parameter | Description | Required | Default value |
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. | Yes | No default value |
Object | The name of the Amazon S3 object. You can specify multiple objects from which Amazon S3 Reader reads data. For example, if a bucket contains the test folder in which the ll.txt object resides, the name of the object is test/ll.txt. If you specify a single Amazon S3 object, Amazon S3 Reader uses a single thread to read data. If you specify multiple Amazon S3 objects, Amazon S3 Reader uses parallel threads to read data. The number of threads is determined by the number of channels. If you specify an object whose name contains a wildcard, Amazon S3 Reader reads data from all objects that match the name. For example, if you set this parameter to abc*[0-9], Amazon S3 Reader reads data from objects abc0 to abc9. We recommend that you do not use wildcards because an out of memory (OOM) error may occur.
Note Data Integration considers all objects that are read in a synchronization task as a single table. Make sure that all objects that are read in a synchronization task use the same schema. Control the number of objects stored in a folder. If a folder contains excessive objects, an OOM error may occur. In this case, store the objects in different folders before you synchronize data.
| Yes | No default value |
column | The names of the columns from which you want to read data. The type parameter specifies the source data type. The index parameter specifies the ID of the column in the source object, starting from 0. The value parameter specifies the column value if the column is a constant column. Amazon S3 Reader does not read a constant column from the source. Instead, Amazon S3 Reader generates a constant column based on the value that you specify. By default, Amazon S3 Reader reads all data as strings. You can configure the column parameter in the following format:
You can configure the column parameter in the following format:
"column":
{
"type": "long",
"index": 0 // The first INT-type column in the object from which you want to read data.
},
{
"type": "string",
"value": "alibaba" // The value of the current column. In this code, the value is the constant alibaba.
}
Note In the column parameter, you must configure the type parameter and configure the index or value parameter. | Yes | *, which indicates that Amazon S3 Reader reads all data as strings. |
fieldDelimiter | The column delimiter that is used in the Amazon S3 object from which you want to read data. Note Amazon S3 Reader uses a column delimiter to read data. The default column delimiter is a comma (,). If you do not specify the column delimiter, the default column delimiter is used. If the delimiter is invisible, enter Unicode-encoded characters, such as \u001b and \u007c. | Yes | Comma (,). |
compress | The format in which objects are compressed. By default, this parameter is left empty, which means that objects are not compressed. Amazon S3 Reader supports the following compression formats: GZIP, BZIP2, and ZIP. | No | Do Not Compress |
encoding | The encoding format of the object from which you want to read data. | No | utf-8 |
nullFormat | The string that represents a null pointer. No standard strings can represent a null pointer in TXT objects. You can use this parameter to define a string that represents a null pointer. For example, if you set the nullFormat parameter to null , Amazon S3 Reader considers null as a null pointer. | No | No default value |
skipHeader | Specifies whether to skip the headers in a CSV-like object. Valid values: Note The skipHeader parameter is unavailable for compressed objects. | No | false |
csvReaderConfig | The configurations required to read CSV-like objects. The parameter value must match the MAP type. You can use a CSV object reader to read data from CSV objects. The CSV object reader supports multiple configurations. If no configuration is performed, the default settings are used. | No | No default value |