All Products
Search
Document Center

DataWorks:Azure Blob Storage data source

Last Updated:Jun 25, 2024

DataWorks provides Azure Blob Storage Reader for you to read data from files that are stored in Azure Blob Storage. You can use Azure Blob Storage Reader to access files stored in Azure Blob Storage, parse the data in the files, and then synchronize the data to a destination. This topic describes the capabilities of synchronizing data from Azure Blob Storage data sources.

Limits

Data type mappings

The following table describes the data types that are supported by Azure Blob Storage data sources.

Data type

Description

STRING

Text.

LONG

Integer.

BYTES

Byte array. The text that is read is converted into a byte array. The encoding format is UTF-8.

BOOL

Boolean.

DOUBLE

Floating point.

DATE

Date and time. The following date and time formats are supported:

  • YYYY-MM-dd HH:mm:ss

  • yyyy-MM-dd

  • HH:mm:ss

Add a data source

Before you develop a synchronization task that uses an Azure Blob Storage data source, you must add the Azure Blob Storage data source to DataWorks. For information about how to add a data source, see Add and manage data sources. You can view the infotips for the parameters that you must configure on the configuration tab of a data source.

Develop a data synchronization task

Configure a batch synchronization task to synchronize data of a single table

Appendix: Code and parameters

Configure a batch synchronization task by using the code editor

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader in the code editor.

Code for Azure Blob Storage Reader

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "azureblob",
      "parameter": {
        "datasource": "",
        "object": ["f/z/1.csv"],
        "fileFormat": "csv",
        "encoding": "utf8/gbk/...",
        "fieldDelimiter": ",",
        "useMultiCharDelimiter": true,
        "lineDelimiter": "\n",
        "skipHeader": true,
        "compress": "zip/gzip",
        "column": [
          {
            "index": 0,
            "type": "long"
          },
          {
            "index": 1,
            "type": "boolean"
          },
          {
            "index": 2,
            "type": "double"
          },
          {
            "index": 3,
            "type": "string"
          },
          {
            "index": 4,
            "type": "date"
          }
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"
    },
    "speed": {
      "concurrent": 1
    }
  },
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Parameters in code for Azure Blob Storage Reader

Parameter

Description

Required

Default value

datasource

The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.

Yes

No default value

fileFormat

The format of the source file. Valid values: csv, text, parquet, and orc.

Yes

No default value

object

The file path. This parameter is required only when the fileFormat parameter is set to csv or text.

Note

This parameter supports asterisks (*) and arrays.

For example, if you want to synchronize data from the 1.csv and 2.csv files that are stored in the a/b path, you can set this parameter to a/b/*.csv.

Required

if fileFormat is set to csv or text

No default value

path

The file path. This parameter is required only when the fileFormat parameter is set to parquet or orc.

Note

This parameter supports asterisks (*) and arrays.

For example, if you want to synchronize data from the 1.orc and 2.orc files that are stored in the a/b path, you can set this parameter to a/b/*.orc.

Required

if fileFormat is set to parquet or orc

No default value

column

The columns from which you want to read data. The type parameter specifies the source data type. The index parameter specifies the ID of the column in the source file. Column IDs start from 0. The value parameter specifies the column value of a constant column that is automatically generated, instead of being read from the source.

By default, the reader reads all data as strings based on the following configuration:

column": ["*"]

You can also configure the column parameter in the following format:

"column":    
    {       
        "type": "long",       
        "index": 0 // The first column of the file is read. The column is of the INT type. 
    },    
    {       
        "type": "string",       
        "value": "alibaba" // A constant column of the STRING type is generated by Azure Blob Storage Reader. The constant value of the column is alibaba.     
}
Note

For the column parameter, you must configure the type parameter and either the index or value parameter.

Yes

"column": ["*"]

fieldDelimiter

The column delimiter that is used in the file from which you want to read data.

Note
  • You must specify a column delimiter for Azure Blob Storage Reader. The default column delimiter is commas (,). If you do not specify a column delimiter, the default column delimiter is used.

  • If the delimiter is non-printable, enter a value encoded in Unicode, such as \u001b or \u007c.

Yes

,

lineDelimiter

The row delimiter that is used in the file from which you want to read data.

Note

This parameter takes effect only when the fileFormat parameter is set to text.

No

No default value

compress

The format in which files are compressed. By default, this parameter is left empty, which indicates that files are not compressed. The following compression formats are supported: GZIP, BZIP2, and ZIP.

No

No default value

encoding

The encoding format of the file from which you want to read data.

No

utf-8

nullFormat

The string that represents a null pointer. No standard strings can represent a null pointer in TXT files. You can use this parameter to define a string that represents a null pointer.

  • If you specify nullFormat:"null", the reader considers the printable string null as a null pointer.

  • If you specify nullFormat:"\u0001", the reader considers the non-printable string \u0001 as a null pointer.

  • If you do not configure the nullFormat parameter, the reader does not convert source data.

No

No default value

skipHeader

Specifies whether to skip the headers in a CSV file. Valid values:

  • True: The reader reads the headers in a CSV file.

  • False: The reader ignores the headers in a CSV file.

Note

The skipHeader parameter is unavailable for compressed files.

No

false

parquetSchema

The schema of Parquet files that you want to read. If you set the fileFormat parameter to parquet, you must configure the parquetSchema parameter. Make sure that the entire script complies with the JSON syntax.

message MessageTypeName {
required, dataType, columnName;
......................;
}

The parquetSchema parameter contains the following fields:

  • MessageTypeName: the name of the message type.

  • required: indicates that the column cannot be left empty. You can also specify optional based on your business requirements. We recommend that you specify optional for all columns.

  • dataType: Parquet files support various field types, such as BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BINARY, and FIXED_LEN_BYTE_ARRAY. Set this parameter to BINARY if the field stores strings.

  • Each line, including the last one, must end with a semicolon (;).

Configuration example:

"parquetSchema": "message m { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"

No

No default value

csvReaderConfig

The configurations required to read CSV files. The parameter value must match the MAP type. You can use a CSV file reader to read data from CSV files. If you do not configure this parameter, the default value is used.

No

No default value

maxRetryTimes

The maximum number of retries that are allowed if a file fails to be downloaded.

Note
  • You can set this parameter to 0 to disable the download retry feature.

  • This parameter is an advanced parameter, which is available only in the code editor.

No

0

retryIntervalSeconds

The retry interval that is allowed if a file fails to be downloaded. Unit: seconds.

Note

This parameter is an advanced parameter, which is available only in the code editor.

No

5