Amazon S3 data source

0.0.201

Amazon Simple Storage Service (Amazon S3) is an object storage service built to store and retrieve any amount of data from anywhere. DataWorks Data Integration allows you to use Amazon S3 Reader to read data from Amazon S3 buckets. This topic describes the capabilities of synchronizing data from Amazon S3 data sources.

Supported Amazon S3 versions

Amazon S3 Reader uses Amazon S3 SDK for Java provided by Amazon to read data from an Amazon S3 data source.

Limits

Amazon S3 stores unstructured data. The following table describes the features that are supported and not supported by Amazon S3 Reader in Data Integration.

Supported	Unsupported

Supported

Unsupported

Reads data from TXT objects. The data in the TXT objects must be logical two-dimensional tables.
Reads data from CSV-like objects with custom delimiters.
Reads data in the ORC or Parquet format.
Reads data of various data types as strings, and supports constants and column pruning.
Supports recursive data read and object name-based filtering.
Supports object compression. The following compression formats are supported: GZIP, BZIP2, and ZIP.
Note
You cannot compress multiple objects into one package.
Uses parallel threads to read data from multiple objects at the same time.

Uses parallel threads to read data from a single object.
Uses parallel threads to read data from a compressed object.
Reads data from an object that exceeds 100 GB in size.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.

Add a data source

Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.

Configure a batch synchronization task to synchronize data of a single table

For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.

Appendix: Code and parameters

Appendix: Configure a batch synchronization task by using the code editor

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader in the code editor.

Code for Amazon S3 Reader

{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        {
            "stepType":"s3",// The plug-in name. 
            "parameter":{
                "nullFormat":"",// The string that represents a null pointer. 
                "compress":"",// The format in which objects are compressed. 
                "datasource":"",// The name of the data source. 
                "column":[// The names of the columns. 
                    {
                        "index":0,// The ID of a column in the source object. 
                        "type":"string"// The data type of the column. 
                    },
                    {
                        "index":1,
                        "type":"long"
                    },
                    {
                        "index":2,
                        "type":"double"
                    },
                    {
                        "index":3,
                        "type":"boolean"
                    },
                    {
                        "format":"yyyy-MM-dd HH:mm:ss", // The time format. 
                        "index":4,
                        "type":"date"
                    }
                ],
                "skipHeader":"",// Specifies whether to skip the headers in a CSV-like object if the object has headers. 
                "encoding":"",// The encoding format. 
                "fieldDelimiter":",",// The column delimiter. 
                "fileFormat": "",// The format of the object. 
                "object":[]// The prefix of the object from which you want to read data. 
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":""// The maximum number of dirty data records allowed. 
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1 // The maximum number of parallel threads. 
            "mbps":"12",// The maximum transmission rate. Unit: MB/s. 
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Parameters in code for Amazon S3 Reader

Parameter	Description	Required	Default value

Parameter	Description	Required	Default value
datasource	The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.	Yes	No default value
Object	The name of the Amazon S3 object. You can specify multiple objects from which Amazon S3 Reader reads data. For example, if a bucket contains the test folder in which the ll.txt object resides, the name of the object is test/ll.txt. If you specify a single Amazon S3 object, Amazon S3 Reader uses a single thread to read data. If you specify multiple Amazon S3 objects, Amazon S3 Reader uses parallel threads to read data. The number of threads is determined by the number of channels. If you specify an object whose name contains a wildcard, Amazon S3 Reader reads data from all objects that match the name. For example, if you set this parameter to abc[0-9], Amazon S3 Reader reads data from objects abc0 to abc9. We recommend that you do not use wildcards because an out of memory (OOM) error may occur. Note* Data Integration considers all objects that are read in a synchronization task as a single table. Make sure that all objects that are read in a synchronization task use the same schema. Control the number of objects stored in a folder. If a folder contains excessive objects, an OOM error may occur. In this case, store the objects in different folders before you synchronize data.	Yes	No default value
column	The names of the columns from which you want to read data. The type parameter specifies the source data type. The index parameter specifies the ID of the column in the source object, starting from 0. The value parameter specifies the column value if the column is a constant column. Amazon S3 Reader does not read a constant column from the source. Instead, Amazon S3 Reader generates a constant column based on the value that you specify. By default, Amazon S3 Reader reads all data as strings. You can configure the column parameter in the following format: `column": [""]` You can configure the column parameter in the following format: `"column": { "type": "long", "index": 0 // The first INT-type column in the object from which you want to read data. }, { "type": "string", "value": "alibaba" // The value of the current column. In this code, the value is the constant alibaba. }` Note* In the column parameter, you must configure the type parameter and configure the index or value parameter.	Yes	*, which indicates that Amazon S3 Reader reads all data as strings.
fieldDelimiter	The column delimiter that is used in the Amazon S3 object from which you want to read data. Note Amazon S3 Reader uses a column delimiter to read data. The default column delimiter is a comma (,). If you do not specify the column delimiter, the default column delimiter is used. If the delimiter is invisible, enter Unicode-encoded characters, such as \u001b and \u007c.	Yes	Comma (,).
compress	The format in which objects are compressed. By default, this parameter is left empty, which means that objects are not compressed. Amazon S3 Reader supports the following compression formats: GZIP, BZIP2, and ZIP.	No	Do Not Compress
encoding	The encoding format of the object from which you want to read data.	No	utf-8
nullFormat	The string that represents a null pointer. No standard strings can represent a null pointer in TXT objects. You can use this parameter to define a string that represents a null pointer. For example, if you set the `nullFormat` parameter to `null`, Amazon S3 Reader considers null as a null pointer.	No	No default value
skipHeader	Specifies whether to skip the headers in a CSV-like object. Valid values: True: Amazon S3 Reader reads the headers in a CSV-like object. False: Amazon S3 Reader ignores the headers in a CSV-like object. Note The skipHeader parameter is unavailable for compressed objects.	No	false
csvReaderConfig	The configurations required to read CSV-like objects. The parameter value must match the MAP type. You can use a CSV object reader to read data from CSV objects. The CSV object reader supports multiple configurations. If no configuration is performed, the default settings are used.	No	No default value