Amazon S3 (Simple Storage Service) is an object storage service that lets you store and retrieve any amount of data from anywhere. DataWorks Data Integration supports reading data from and writing data to Amazon S3. This topic describes the features of the Amazon S3 data source.
Limitations
Batch Read
Amazon S3 is an unstructured data storage service. The Amazon S3 Reader in Data Integration supports the following features.
Supported | Unsupported |
|
|
Batch Write
The Amazon S3 Writer converts data from the Data Synchronization protocol into text files in Amazon S3. Since Amazon S3 is an unstructured data storage service, the Amazon S3 Writer supports the following features.
Supported | Unsupported |
|
|
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a single-table batch synchronization task
For the procedure, see Configure a task in the codeless UI and Configure a task in the code editor.
For all parameters and a script example for configuring a task in Script Mode, see Appendix: Script examples and parameter descriptions later in this topic.
Appendix: Script examples and parameter descriptions
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a task in the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script example
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"s3",// The plugin name.
"parameter":{
"nullFormat":"",// Defines the string that represents a null value.
"compress":"",// The text compression type.
"datasource":"",// The data source.
"column":[// The fields.
{
"index":0,// The column index.
"type":"string"// The data type.
},
{
"index":1,
"type":"long"
},
{
"index":2,
"type":"double"
},
{
"index":3,
"type":"boolean"
},
{
"format":"yyyy-MM-dd HH:mm:ss", // The time format.
"index":4,
"type":"date"
}
],
"skipHeader":"",// Skips the header of a CSV-like file.
"encoding":"",// The encoding format.
"fieldDelimiter":",",// The field delimiter.
"fileFormat": "",// The text file type.
"object":[]// The object prefix.
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":""// The number of error records allowed.
},
"speed":{
"throttle":true,// Specifies whether to enable rate limiting. If this parameter is set to false, the mbps parameter is ignored.
"concurrent":1, // The number of concurrent jobs.
"mbps":"12"// The maximum transfer rate in MB/s. Note: In this context, 1 mbps is equal to 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}Reader script parameters
Parameter | Description | Required | Default Value |
datasource | The name of the data source. In Script Mode, the value for this parameter must match the name of the data source that you add. | Yes | None |
object | The S3 Object or Objects to read. You can specify multiple Objects. For example, if a bucket contains a folder named `test` with a file named ll.txt, set the object value to test/ll.txt.
Note
| Yes | None |
column | The list of fields to read. By default, you can read all data as the string type. The configuration is as follows: You can also specify detailed column information. The configuration is as follows: Note When you specify column information, the `type` parameter is required. You must specify either `index` or `value`. | Yes | All columns are read as the STRING type. |
fieldDelimiter | The delimiter that is used to separate fields. Note You must specify a field delimiter for the Amazon S3 Reader. If a field delimiter is not specified, a comma (,) is used by default. This default value is also pre-filled in the UI. If the delimiter is not a visible character, use its Unicode representation. For example, you can use \u001b or \u007c. | Yes | , (comma) |
compress | The text compression type. By default, this parameter is empty, which indicates that no compression is used. The following compression types are supported: gzip, bzip2, and zip. | No | None |
encoding | The encoding format of the source files. | No | UTF-8 |
nullFormat | A text file cannot use a standard string to represent a null value. You can use the nullFormat parameter to define a string that represents a null value. For example, if you set | No | None |
skipHeader | For CSV-like files, you can use the skipHeader parameter to specify whether to skip the header row.
Note You cannot use the skipHeader parameter for compressed files. | No | false |
csvReaderConfig | The advanced settings for reading CSV files. This parameter is of the map type. If this parameter is not configured, the default settings of CsvReader are used. | No | None |
Writer script example
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType": "s3",
"category": "writer",
"name": "Writer",
"parameter": {
"datasource": "datasource1",
"object": "test/csv_file.csv",
"fileFormat": "csv",
"encoding": "utf8/gbk/...",
"fieldDelimiter": ",",
"lineDelimiter": "\n",
"column": [
"0",
"1"
],
"header": [
"col_bigint",
"col_tinyint"
],
"writeMode": "truncate",
"writeSingleObject": true
}
}
],
"setting": {
"errorLimit": {
"record": "" // The number of error records allowed.
},
"speed": {
"throttle": true, // Specifies whether to enable rate limiting. If this parameter is set to false, the mbps parameter is ignored.
"concurrent": 1, // The number of concurrent jobs.
"mbps": "12" // The maximum transfer rate in MB/s. Note: In this context, 1 mbps is equal to 1 MB/s.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}Writer script parameters
Parameter | Description | Required | Default Value |
datasource | The name of the data source. In Script Mode, the value for this parameter must be the same as the name of the data source that you add. | Yes | None |
object | The name of the destination object. | Yes | None |
fileFormat | The file format. The following formats are supported:
| Yes | text |
writeMode |
| Yes | append |
fieldDelimiter | The delimiter that is used to separate fields in the output file. | No | , (comma) |
lineDelimiter | The delimiter that is used to separate lines in the output file. | No | \n (newline character) |
compress | The text compression type. By default, this parameter is empty, which indicates that no compression is used.
| No | None |
nullFormat | A text file cannot use a standard string to represent a null value. You can use the | No | None |
header | The header row to be written to the file. Example: | No | None |
writeSingleObject | Specifies whether to write data to a single file. Valid values: `true` and `false`. Note
| No | false |
encoding | The encoding format of the output file. | No | UTF-8 |
column | The column configuration of the output file.
| Yes | None |