The Read CSV File component allows you to read CSV files from Object Storage Service (OSS), HTTP, and Hadoop Distributed File System (HDFS) data sources. This topic describes how to configure the Read CSV File component.
Limits
If you configure the Read CSV File component in the Machine Learning Platform for AI (PAI) console, only the computing resources of MaxCompute and Realtime Compute for Apache Flink are supported.
If you configure the Read CSV File component by using the PyAlink Script component, you must use the PyAlink Script component to call code. For more information, see PyAlink Script.
Prerequisites
(Optional) PAI is authorized to access OSS. For more information, see Grant the permissions that are required to use Machine Learning Designer.
You must perform the authorization if you select OSS for fileSource.
Configure the Read CSV File component
You can configure the Read CSV File component by using one of the following methods:
Method 1: PAI console
The following table describes the parameters that you must configure on the Visualized Modeling (Designer) page.
Tab | Parameter | Description |
Parameter Setting | fileSource | The source of the CSV file. Valid values: OSS and OTHERS. |
ossFilePath or filePath | The path of the CSV file.
| |
Schema | The data type for each column. Specify this parameter in the Important The configured data type for each column must be the same as that of the CSV file that you want to read. Otherwise, the system fails to read data in the specific column of the CSV file. | |
fieldDelimiter | The field delimiter. By default, commas (,) are used. | |
handleInvalidMethod | The method that is used to handle invalid data of the Tensor, Vector, or MTable type if data of these types fails to be parsed. These data types are defined by the Alink algorithm framework and have a fixed parsing format. Valid values:
| |
ignoreFirstLine | Specifies whether to skip data in the first row. You must turn on this switch if the first row of the CSV file that you want to read is table headers. | |
lenient | The method that is used to handle inconsistency when the schema of an input data record is inconsistent with the information specified by the Schema parameter. The inconsistency issue may lie in data types or the number of columns.
| |
quoteString | The quote character. By default, double quotation marks (") are used. | |
rowDelimiter | The row delimiter. By default, line feeds (\n) are used. | |
skipBlankLine | Specifies whether to skip blank rows. | |
Execution Tuning | Number of Workers | The number of nodes. The value must be a positive integer. This parameter must be used with the Memory per worker parameter. Valid values: 1 to 9999. |
Memory per worker | The memory size of each node. Unit: MB. The value must be a positive integer. Valid values: 1024 to 65536. |
Method 2: PyAlink Script component
The following table describes the parameters that you must configure when you use the PyAlink Script component to configure the Read CSV File component For more information about the PyAlink Script component, see PyAlink Script.
Parameter | Required | Description | Default value |
schemaStr | Yes | The data type of the CSV file. Specify this parameter in the colname0 coltype0[, colname1 coltype1[, ...]] format. Example: f0 string,f1 bigint,f2 double. | None |
filePath | No | The path of the CSV file. | None |
fieldDelimiter | No | The field delimiter. | Comma (,) |
handleInvalidMethod | No | The method that is used to handle invalid data of the Tensor, Vector, or MTable type. Data of these types fails to be parsed. These data types are defined by the Alink algorithm framework and have a fixed parsing format. Valid values:
| ERROR |
ignoreFirstLine | No | Specifies whether to skip data in the first row. You must set this parameter to True if the first row of the CSV file that you want to read is table headers. | False |
lenient | No | The method that is used to handle inconsistency when the schema of an input data record is inconsistent with the information specified by the Schema parameter. The inconsistency issue may lie in data types or the number of columns.
| False |
quoteString | No | The quote character. | Double quotation mark (") |
rowDelimiter | No | The row delimiter. | Line feed (\n) |
skipBlankLine | No | Specifies whether to skip blank rows. | True |
Sample PyAlink script:
filePath = 'https://alink-test-data.oss-cn-hangzhou.aliyuncs.com/iris.csv'
schema = 'sepal_length double, sepal_width double, petal_length double, petal_width double, category string'
csvSource = CsvSourceBatchOp()\
.setFilePath(filePath)\
.setSchemaStr(schema)\
.setFieldDelimiter(",")
BatchOperator.collectToDataframe(csvSource)