File analysis - ApsaraDB for SelectDB - Alibaba Cloud Documentation Center

ApsaraDB for SelectDB provides table-valued functions (TVFs), which allow you to map file data in common remote storage such as Amazon Simple Storage Service (Amazon S3) and Hadoop Distributed File System (HDFS) to tables in ApsaraDB for SelectDB. This way, you can analyze the file data with ease.

Amazon S3 TVF

The Amazon S3 TVF, s3(), allows you to read and access the content of files in object storage systems that are compatible with Amazon S3 in the same way as you access the content of files in relational databases. The CVS, CSV_with_names, CVS_with_names_and_types, JSON, Parquet, and ORC file formats are supported.

Syntax

s3(
  "uri" = "..",
  "s3.access_key" = "...",
  "s3.secret_key" = "...",
  "s3.region" = "...",
  "format" = "csv",
  "keyn" = "valuen",
  ...
);

Parameters

Each parameter in the Amazon S3 TVF is a key-value pair in the "key" = "value" format. The following table describes the common parameters.

Parameter	Required	Valid value	Description

Parameter	Required	Valid value	Description
uri	Yes	String	The uniform resource identifier (URI) for accessing Amazon S3.
s3.access_key	Yes	String	The access key ID that is used to access Amazon S3.
s3.secret_key	Yes	String	The secret access key that is used to access Amazon S3.
s3.region	Yes	String	The region of Amazon S3. Default value: `us-east-1`.
s3.session_token	No Note This parameter is required if temporary session authentication is enabled.	String	The temporary session token that is used to access Amazon S3.
use_path_style	No	true false	Specifies whether to use the path style to access Amazon S3. By default, Amazon S3 SDKs use the virtual-hosted style. However, the virtual-hosted style may not be enabled for or supported by some object storage systems. In this case, you can specify the `use_path_style` parameter to forcibly use the path style. For example, MinIO allows only the path style by default. To access MinIO, set the `use_path_style` parameter to true. Default value: `false`. Note The following URI schemas are supported: `http://`, `https://`, and `s3://`. If `http://` or `https://` is used, the system determines whether to use the path style to access Amazon S3 based on the value of the `use_path_style` parameter. If `s3://` is used, the virtual-hosted style is used to access Amazon S3. If the specified URI does not exist or the matched files are all empty, the Amazon S3 TVF returns an empty result set.
format	Yes	cvs csv_with_names csv_with_names_and_types json parquet orc	The format of the file to be accessed in Amazon S3.
column_separator	No	String	The column delimiter. Default value: `,`.
line_delimiter	No	String	The row delimiter. Default value: `\n`.
compress_type	No	`unknown` `plain` `gz` `lzo` `bz2` `lz4frame` `deflate`	The compression type of the file. Default value: unknown, which indicates that the compression type is automatically inferred based on the suffix of the URI.
read_json_by_line	No	true false	Specifies whether to read JSON-formatted data by row. Default value: true.
num_as_string	No	true false	Specifies whether to process numbers as strings. Default value: false.
fuzzy_parse	No	true false	Specifies whether to accelerate the import efficiency of JSON-formatted data. Default value: false.
jsonpaths	No	String	The fields to be extracted from JSON-formatted data. Format: `jsonpaths: [\"$.k2\", \"$.k1\"]`.
strip_outer_array	No	true false	Specifies whether to display JSON-formatted data as an array. Each element is considered as a row of data. Default value: false. Format: `strip_outer_array: true`.
json_root	No	String	The root node of JSON-formatted data. ApsaraDB for SelectDB extracts and parses the elements of the root node that is specified by the `json_root` parameter. By default, this parameter is left empty. Format: `json_root: $.RECORDS`.
path_partition_keys	No	String	The names of the partition key columns that are carried in the specified file path. For example, if the file path is /path/to/city=beijing/date="2023-07-09", set this parameter to city,date. In this case, ApsaraDB for SelectDB automatically reads the corresponding column names and column values from the path during data import.

Examples

Read and access a CSV file in an object storage system compatible with Amazon S3.

SELECT * FROM s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
                "s3.access_key"= "minioadmin",
                "s3.secret_key" = "minioadmin",
                "format" = "csv",
                "use_path_style" = "true") ORDER BY c1;

You can use this TVF together with the DESC FUNCTION function to query the file schema.

MySQL [(none)]> Desc FUNCTION s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
                 "s3.access_key"= "minioadmin",
                 "s3.secret_key" = "minioadmin",
                 "format" = "csv",
                 "use_path_style" = "true");

You must use the virtual-hosted style to access Object Storage Service (OSS).

SELECT * FROM s3(
    "uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "parquet",
    "use_path_style" = "false");

If the use_path_style parameter is set to true, the path style is used to access Amazon S3.

SELECT * FROM s3(
    "uri" = "https://endpoint/bucket/file/student.csv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "csv",
    "use_path_style"="true");

If the use_path_style parameter is set to false, the virtual-hosted style is used to access Amazon S3.

SELECT * FROM s3(
    "uri" = "https://bucket.endpoint/bucket/file/student.csv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "csv",
    "use_path_style"="false");

HDFS TVF

The HDFS TVF, hdfs(), allows you to read and access the content of files in HDFS in the same way as you access the content of files in relational databases. The CVS, CSV_with_names, CVS_with_names_and_types, JSON, Parquet, and ORC file formats are supported.

Syntax

hdfs(
  "uri" = "..",
  "fs.defaultFS" = "...",
  "hadoop.username" = "...",
  "format" = "csv",
  "keyn" = "valuen" 
  ...
);

Parameters

Each parameter in the HDFS TVF is a key-value pair in the "key" = "value" format. The following table describes the common parameters.

Parameter	Required	Valid value	Description

Parameter	Required	Valid value	Description
uri	Yes	String	The URI for accessing HDFS. If the specified URI does not exist or the matched files are all empty, the HDFS TVF returns an empty result set.
fs.defaultFS	Yes	String	The hostname and port number of HDFS.
hadoop.username	Yes	String	The username that is used to access HDFS. The value cannot be an empty string.
hadoop.security.authentication	No	Simple Kerberos	The authentication method for HDFS. Valid values: Simple and Kerberos.
hadoop.kerberos.principal	No	String	The Kerberos principal if Kerberos authentication is enabled for HDFS.
hadoop.kerberos.keytab	No	String	The path of the Kerberos keytab file if Kerberos authentication is enabled for HDFS.
dfs.client.read.shortcircuit	No	true false	Specifies whether to read local HDFS short circuit data. The value is of the BOOLEAN type.
dfs.domain.socket.path	No	String	The path that points to a UNIX domain socket for the communication between the DataNode and the local HDFS client. If you specify the string "_PORT" in the path, the string is replaced by the TCP port of the DataNode.
dfs.nameservices	No	String	The logical names of the nameservices that provide services. This parameter corresponds to the dfs.nameservices field in the core-site.xml file.
dfs.ha.namenodes.your-nameservices	No Note This parameter is required if Hadoop High Availability (HA) deployment is used.	String	The logical names of the NameNodes.
dfs.namenode.rpc-address.your-nameservices.your-namenode	No Note This parameter is required if Hadoop HA deployment is used.	String	The HTTP URL to which the NameNode listens.
dfs.client.failover.proxy.provider.your-nameservices	No Note This parameter is required if Hadoop HA deployment is used.	String	The implementation class of the failover proxy provider for client connections to the NameNode in the available state.
read_json_by_line	No	true false	Specifies whether to read JSON-formatted data by row. Default value: true.
num_as_string	No	true false	Specifies whether to process numbers as strings. Default value: false.
fuzzy_parse	No	true false	Specifies whether to accelerate the import efficiency of JSON-formatted data. Default value: false.
jsonpaths	No	String	The fields to be extracted from JSON-formatted data. Format: `jsonpaths: [\"$.k2\", \"$.k1\"]`.
strip_outer_array	No	true false	Specifies whether to display JSON-formatted data as an array. Each element is considered as a row of data. Default value: false. Format: `strip_outer_array: true`.
json_root	No	String	The root node of JSON-formatted data. ApsaraDB for SelectDB extracts and parses the elements of the root node that is specified by the `json_root` parameter. By default, this parameter is left empty. Format: `json_root: $.RECORDS`.
trim_double_quotes	No	true false	Specifies whether to trim the outermost double quotation marks (") of each field in the CSV file. Default value: false.
skip_lines	No	[0-Integer.MaxValue]	The value is of the INTEGER type. Default value: 0. Specifies whether to skip the first few rows of the CSV file. This parameter becomes invalid if the format parameter is set to `csv_with_names` or `csv_with_names_and_types`.
path_partition_keys	No	String	The names of the partition key columns that are carried in the specified file path. For example, if the file path is /path/to/city=beijing/date="2023-07-09", set this parameter to city,date. In this case, ApsaraDB for SelectDB automatically reads the corresponding column names and column values from the path during data import.

Examples

Read and access a CSV file in HDFS.

MySQL [(none)]> SELECT * FROM hdfs(
            "uri" = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
            "fs.defaultFS" = "hdfs://127.0.0.1:8424",
            "hadoop.username" = "doris",
            "format" = "csv");
            
-- Sample response   
+------+---------+------+
| c1   | c2      | c3   |
+------+---------+------+
| 1    | alice   | 18   |
| 2    | bob     | 20   |
| 3    | jack    | 24   |
| 4    | jackson | 19   |
| 5    | liming  | 18   |
+------+---------+------+

Read and access a CSV file in HDFS in high availability (HA) mode.

MySQL [(none)]> SELECT * FROM hdfs(
            "uri" = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
            "fs.defaultFS" = "hdfs://127.0.0.1:8424",
            "hadoop.username" = "doris",
            "format" = "csv",
            "dfs.nameservices" = "my_hdfs",
            "dfs.ha.namenodes.my_hdfs" = "nn1,nn2",
            "dfs.namenode.rpc-address.my_hdfs.nn1" = "nanmenode01:8020",
            "dfs.namenode.rpc-address.my_hdfs.nn2" = "nanmenode02:8020",
            "dfs.client.failover.proxy.provider.my_hdfs" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");

-- Sample response
+------+---------+------+
| c1   | c2      | c3   |
+------+---------+------+
| 1    | alice   | 18   |
| 2    | bob     | 20   |
| 3    | jack    | 24   |
| 4    | jackson | 19   |
| 5    | liming  | 18   |
+------+---------+------+

You can use this TVF together with the DESC FUNCTION function to query the file schema.

MySQL [(none)]> DECS FUNCTION hdfs(
            "uri" = "hdfs://127.0.0.1:8424/user/doris/csv_format_test/student_with_names.csv",
            "fs.defaultFS" = "hdfs://127.0.0.1:8424",
            "hadoop.username" = "doris",
            "format" = "csv_with_names");

Usage

This section describes how to use a TVF to analyze file data. In this example, the Amazon S3 TVF is used.

Automatically infer the column types of a file

ApsaraDB for SelectDB can analyze and infer the column types of files in the Parquet, ORC, CSV, and JSON formats.

> DESC FUNCTION s3 (
    "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "parquet",
    "use_path_style"="true"
);

-- Sample response
+---------------+--------------+------+-------+---------+-------+
| Field         | Type         | Null | Key   | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey     | INT          | Yes  | false | NULL    | NONE  |
| p_name        | TEXT         | Yes  | false | NULL    | NONE  |
| p_mfgr        | TEXT         | Yes  | false | NULL    | NONE  |
| p_brand       | TEXT         | Yes  | false | NULL    | NONE  |
| p_type        | TEXT         | Yes  | false | NULL    | NONE  |
| p_size        | INT          | Yes  | false | NULL    | NONE  |
| p_container   | TEXT         | Yes  | false | NULL    | NONE  |
| p_retailprice | DECIMAL(9,0) | Yes  | false | NULL    | NONE  |
| p_comment     | TEXT         | Yes  | false | NULL    | NONE  |
+---------------+--------------+------+-------+---------+-------+

The sample response shows that ApsaraDB for SelectDB can automatically infer the column types of a Parquet file based on the metadata of the file.

Read the schema of a CSV file

By default, ApsaraDB for SelectDB processes the types of all columns in CSV files as STRING. You can use the csv_schema parameter to separately specify the column names and column types of a CSV file. ApsaraDB for SelectDB reads the schema of the CSV file based on the specified column types. Format:

name1:type1;name2:type2;...

ApsaraDB for SelectDB returns NULL for columns with mismatched types or missing columns. For example, the type of a column is STRING but you specify the column type as INT, or you specify five columns for a file that contains four columns. The following table describes the column types that are supported.

Column type	Mapping type

Column type	Mapping type
tinyint	tinyint
smallint	smallint
int	int
bigint	bigint
largeint	largeint
float	float
double	double
decimal(p,s)	decimalv3(p,s)
date	datev2
datetime	datetimev2
char	string
varchar	string
string	string
boolean	boolean

Example:

s3 (
    "URI" = "https://bucket1/inventory.dat",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "csv",
    "column_separator" = "|",
    "csv_schema" = "k1:int;k2:int;k3:int;k4:decimal(38,10)",
    "use_path_style"="true"
)

Query and analyze data

You can use the TVF together with an SQL statement to query and analyze a file.

SELECT * FROM s3(
    "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "parquet",
    "use_path_style"="true")
LIMIT 5;

-- Sample response
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name                                   | p_mfgr         | p_brand  | p_type                  | p_size | p_container | p_retailprice | p_comment           |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
|         1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER  |      7 | JUMBO PKG   |           901 | ly. slyly ironi     |
|         2 | blush thistle blue yellow saddle         | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS     |      1 | LG CASE     |           902 | lar accounts amo    |
|         3 | spring green yellow purple cornsilk      | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS |     21 | WRAP CASE   |           903 | egular deposits hag |
|         4 | cornflower chocolate smoke green pink    | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS      |     14 | MED DRUM    |           904 | p furiously r       |
|         5 | forest brown coral puff cream            | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN   |     15 | SM PKG      |           905 |  wake carefully     |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+

You can place the TVF anywhere that you place a table in the SQL statement, such as in the WITH or FROM clause of a common table expression (CTE). You can analyze the file as a standard table.

You can also create a logical view for the TVF by using the CREATE VIEW statement. This allows you to access this TVF and manage its permissions in the same way as you perform operations on other views. You can also allow other users to access this TVF.

CREATE VIEW v1 AS 
SELECT * FROM s3(
    "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "parquet",
    "use_path_style"="true");

DESC v1;

SELECT * FROM v1;

GRANT SELECT_PRIV ON db1.v1 TO user1;

Import data

You can use the TVF together with the INSERT INTO SELECT statement to import the data of a file to a table in ApsaraDB for SelectDB for faster analysis.

-- 1. Create an internal table in ApsaraDB for SelectDB.
CREATE TABLE IF NOT EXISTS test_table
(
    id int,
    name varchar(50),
    age int
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");

-- 2. Insert data by using the Amazon S3 TVF.
INSERT INTO test_table (id,name,age)
SELECT cast(id as INT) as id, name, cast (age as INT) as age
FROM s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "parquet",
    "use_path_style" = "true");

Usage notes

If no file is matched for the URI specified by the Amazon S3 or HDFS TVF or all the matched files are empty, the Amazon S3 or HDFS TVF returns an empty result set. In this case, if you execute the DESC FUNCTION statement to query the schema of the file specified by the TVF, a dummy column __dummy_col is returned, which can be ignored.
If the file format specified by the TVF is CSV and the read file is not empty but the first row of the file is empty, the The first line is empty, can not parse column numbers error message is reported.

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)