All Products
Search
Document Center

ApsaraDB for SelectDB:File analysis

Last Updated:Jun 12, 2024

ApsaraDB for SelectDB provides table-valued functions (TVFs), which allow you to map file data in common remote storage such as Amazon Simple Storage Service (Amazon S3) and Hadoop Distributed File System (HDFS) to tables in ApsaraDB for SelectDB. This way, you can analyze the file data with ease.

Amazon S3 TVF

The Amazon S3 TVF, s3(), allows you to read and access the content of files in object storage systems that are compatible with Amazon S3 in the same way as you access the content of files in relational databases. The CVS, CSV_with_names, CVS_with_names_and_types, JSON, Parquet, and ORC file formats are supported.

Syntax

s3(
  "uri" = "..",
  "s3.access_key" = "...",
  "s3.secret_key" = "...",
  "s3.region" = "...",
  "format" = "csv",
  "keyn" = "valuen",
  ...
);

Parameters

Each parameter in the Amazon S3 TVF is a key-value pair in the "key" = "value" format. The following table describes the common parameters.

Parameter

Required

Valid value

Description

uri

Yes

String

The uniform resource identifier (URI) for accessing Amazon S3.

s3.access_key

Yes

String

The access key ID that is used to access Amazon S3.

s3.secret_key

Yes

String

The secret access key that is used to access Amazon S3.

s3.region

Yes

String

The region of Amazon S3. Default value: us-east-1.

s3.session_token

No

Note

This parameter is required if temporary session authentication is enabled.

String

The temporary session token that is used to access Amazon S3.

use_path_style

No

  • true

  • false

Specifies whether to use the path style to access Amazon S3. By default, Amazon S3 SDKs use the virtual-hosted style. However, the virtual-hosted style may not be enabled for or supported by some object storage systems. In this case, you can specify the use_path_style parameter to forcibly use the path style. For example, MinIO allows only the path style by default. To access MinIO, set the use_path_style parameter to true.

Default value: false.

Note

The following URI schemas are supported: http://, https://, and s3://.

  • If http:// or https:// is used, the system determines whether to use the path style to access Amazon S3 based on the value of the use_path_style parameter.

  • If s3:// is used, the virtual-hosted style is used to access Amazon S3.

  • If the specified URI does not exist or the matched files are all empty, the Amazon S3 TVF returns an empty result set.

format

Yes

  • cvs

  • csv_with_names

  • csv_with_names_and_types

  • json

  • parquet

  • orc

The format of the file to be accessed in Amazon S3.

column_separator

No

String

The column delimiter. Default value: ,.

line_delimiter

No

String

The row delimiter. Default value: \n.

compress_type

No

  • unknown

  • plain

  • gz

  • lzo

  • bz2

  • lz4frame

  • deflate

The compression type of the file. Default value: unknown, which indicates that the compression type is automatically inferred based on the suffix of the URI.

read_json_by_line

No

  • true

  • false

Specifies whether to read JSON-formatted data by row. Default value: true.

num_as_string

No

  • true

  • false

Specifies whether to process numbers as strings. Default value: false.

fuzzy_parse

No

  • true

  • false

Specifies whether to accelerate the import efficiency of JSON-formatted data. Default value: false.

jsonpaths

No

String

The fields to be extracted from JSON-formatted data.

Format: jsonpaths: [\"$.k2\", \"$.k1\"].

strip_outer_array

No

  • true

  • false

Specifies whether to display JSON-formatted data as an array. Each element is considered as a row of data. Default value: false.

Format: strip_outer_array: true.

json_root

No

String

The root node of JSON-formatted data. ApsaraDB for SelectDB extracts and parses the elements of the root node that is specified by the json_root parameter. By default, this parameter is left empty.

Format: json_root: $.RECORDS.

path_partition_keys

No

String

The names of the partition key columns that are carried in the specified file path. For example, if the file path is /path/to/city=beijing/date="2023-07-09", set this parameter to city,date. In this case, ApsaraDB for SelectDB automatically reads the corresponding column names and column values from the path during data import.

Examples

  1. Read and access a CSV file in an object storage system compatible with Amazon S3.

    SELECT * FROM s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
                    "s3.access_key"= "minioadmin",
                    "s3.secret_key" = "minioadmin",
                    "format" = "csv",
                    "use_path_style" = "true") ORDER BY c1;  

    You can use this TVF together with the DESC FUNCTION function to query the file schema.

    MySQL [(none)]> Desc FUNCTION s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
                     "s3.access_key"= "minioadmin",
                     "s3.secret_key" = "minioadmin",
                     "format" = "csv",
                     "use_path_style" = "true");
  2. You must use the virtual-hosted style to access Object Storage Service (OSS).

    SELECT * FROM s3(
        "uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
        "s3.access_key"= "ak",
        "s3.secret_key" = "sk",
        "format" = "parquet",
        "use_path_style" = "false");
  3. If the use_path_style parameter is set to true, the path style is used to access Amazon S3.

    SELECT * FROM s3(
        "uri" = "https://endpoint/bucket/file/student.csv",
        "s3.access_key"= "ak",
        "s3.secret_key" = "sk",
        "format" = "csv",
        "use_path_style"="true");
  4. If the use_path_style parameter is set to false, the virtual-hosted style is used to access Amazon S3.

    SELECT * FROM s3(
        "uri" = "https://bucket.endpoint/bucket/file/student.csv",
        "s3.access_key"= "ak",
        "s3.secret_key" = "sk",
        "format" = "csv",
        "use_path_style"="false");

HDFS TVF

The HDFS TVF, hdfs(), allows you to read and access the content of files in HDFS in the same way as you access the content of files in relational databases. The CVS, CSV_with_names, CVS_with_names_and_types, JSON, Parquet, and ORC file formats are supported.

Syntax

hdfs(
  "uri" = "..",
  "fs.defaultFS" = "...",
  "hadoop.username" = "...",
  "format" = "csv",
  "keyn" = "valuen" 
  ...
);

Parameters

Each parameter in the HDFS TVF is a key-value pair in the "key" = "value" format. The following table describes the common parameters.

Parameter

Required

Valid value

Description

uri

Yes

String

The URI for accessing HDFS. If the specified URI does not exist or the matched files are all empty, the HDFS TVF returns an empty result set.

fs.defaultFS

Yes

String

The hostname and port number of HDFS.

hadoop.username

Yes

String

The username that is used to access HDFS. The value cannot be an empty string.

hadoop.security.authentication

No

  • Simple

  • Kerberos

The authentication method for HDFS. Valid values: Simple and Kerberos.

hadoop.kerberos.principal

No

String

The Kerberos principal if Kerberos authentication is enabled for HDFS.

hadoop.kerberos.keytab

No

String

The path of the Kerberos keytab file if Kerberos authentication is enabled for HDFS.

dfs.client.read.shortcircuit

No

  • true

  • false

Specifies whether to read local HDFS short circuit data. The value is of the BOOLEAN type.

dfs.domain.socket.path

No

String

The path that points to a UNIX domain socket for the communication between the DataNode and the local HDFS client. If you specify the string "_PORT" in the path, the string is replaced by the TCP port of the DataNode.

dfs.nameservices

No

String

The logical names of the nameservices that provide services. This parameter corresponds to the dfs.nameservices field in the core-site.xml file.

dfs.ha.namenodes.your-nameservices

No

Note

This parameter is required if Hadoop High Availability (HA) deployment is used.

String

The logical names of the NameNodes.

dfs.namenode.rpc-address.your-nameservices.your-namenode

No

Note

This parameter is required if Hadoop HA deployment is used.

String

The HTTP URL to which the NameNode listens.

dfs.client.failover.proxy.provider.your-nameservices

No

Note

This parameter is required if Hadoop HA deployment is used.

String

The implementation class of the failover proxy provider for client connections to the NameNode in the available state.

read_json_by_line

No

  • true

  • false

Specifies whether to read JSON-formatted data by row. Default value: true.

num_as_string

No

  • true

  • false

Specifies whether to process numbers as strings. Default value: false.

fuzzy_parse

No

  • true

  • false

Specifies whether to accelerate the import efficiency of JSON-formatted data. Default value: false.

jsonpaths

No

String

The fields to be extracted from JSON-formatted data.

Format: jsonpaths: [\"$.k2\", \"$.k1\"].

strip_outer_array

No

  • true

  • false

Specifies whether to display JSON-formatted data as an array. Each element is considered as a row of data. Default value: false.

Format: strip_outer_array: true.

json_root

No

String

The root node of JSON-formatted data. ApsaraDB for SelectDB extracts and parses the elements of the root node that is specified by the json_root parameter. By default, this parameter is left empty.

Format: json_root: $.RECORDS.

trim_double_quotes

No

  • true

  • false

Specifies whether to trim the outermost double quotation marks (") of each field in the CSV file. Default value: false.

skip_lines

No

[0-Integer.MaxValue]

The value is of the INTEGER type. Default value: 0. Specifies whether to skip the first few rows of the CSV file. This parameter becomes invalid if the format parameter is set to csv_with_names or csv_with_names_and_types.

path_partition_keys

No

String

The names of the partition key columns that are carried in the specified file path. For example, if the file path is /path/to/city=beijing/date="2023-07-09", set this parameter to city,date. In this case, ApsaraDB for SelectDB automatically reads the corresponding column names and column values from the path during data import.

Examples

Read and access a CSV file in HDFS.

MySQL [(none)]> SELECT * FROM hdfs(
            "uri" = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
            "fs.defaultFS" = "hdfs://127.0.0.1:8424",
            "hadoop.username" = "doris",
            "format" = "csv");
            
-- Sample response   
+------+---------+------+
| c1   | c2      | c3   |
+------+---------+------+
| 1    | alice   | 18   |
| 2    | bob     | 20   |
| 3    | jack    | 24   |
| 4    | jackson | 19   |
| 5    | liming  | 18   |
+------+---------+------+

Read and access a CSV file in HDFS in high availability (HA) mode.

MySQL [(none)]> SELECT * FROM hdfs(
            "uri" = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
            "fs.defaultFS" = "hdfs://127.0.0.1:8424",
            "hadoop.username" = "doris",
            "format" = "csv",
            "dfs.nameservices" = "my_hdfs",
            "dfs.ha.namenodes.my_hdfs" = "nn1,nn2",
            "dfs.namenode.rpc-address.my_hdfs.nn1" = "nanmenode01:8020",
            "dfs.namenode.rpc-address.my_hdfs.nn2" = "nanmenode02:8020",
            "dfs.client.failover.proxy.provider.my_hdfs" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");

-- Sample response
+------+---------+------+
| c1   | c2      | c3   |
+------+---------+------+
| 1    | alice   | 18   |
| 2    | bob     | 20   |
| 3    | jack    | 24   |
| 4    | jackson | 19   |
| 5    | liming  | 18   |
+------+---------+------+

You can use this TVF together with the DESC FUNCTION function to query the file schema.

MySQL [(none)]> DECS FUNCTION hdfs(
            "uri" = "hdfs://127.0.0.1:8424/user/doris/csv_format_test/student_with_names.csv",
            "fs.defaultFS" = "hdfs://127.0.0.1:8424",
            "hadoop.username" = "doris",
            "format" = "csv_with_names");

Usage

This section describes how to use a TVF to analyze file data. In this example, the Amazon S3 TVF is used.

Automatically infer the column types of a file

ApsaraDB for SelectDB can analyze and infer the column types of files in the Parquet, ORC, CSV, and JSON formats.

> DESC FUNCTION s3 (
    "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "parquet",
    "use_path_style"="true"
);

-- Sample response
+---------------+--------------+------+-------+---------+-------+
| Field         | Type         | Null | Key   | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey     | INT          | Yes  | false | NULL    | NONE  |
| p_name        | TEXT         | Yes  | false | NULL    | NONE  |
| p_mfgr        | TEXT         | Yes  | false | NULL    | NONE  |
| p_brand       | TEXT         | Yes  | false | NULL    | NONE  |
| p_type        | TEXT         | Yes  | false | NULL    | NONE  |
| p_size        | INT          | Yes  | false | NULL    | NONE  |
| p_container   | TEXT         | Yes  | false | NULL    | NONE  |
| p_retailprice | DECIMAL(9,0) | Yes  | false | NULL    | NONE  |
| p_comment     | TEXT         | Yes  | false | NULL    | NONE  |
+---------------+--------------+------+-------+---------+-------+

The sample response shows that ApsaraDB for SelectDB can automatically infer the column types of a Parquet file based on the metadata of the file.

Read the schema of a CSV file

By default, ApsaraDB for SelectDB processes the types of all columns in CSV files as STRING. You can use the csv_schema parameter to separately specify the column names and column types of a CSV file. ApsaraDB for SelectDB reads the schema of the CSV file based on the specified column types. Format:

name1:type1;name2:type2;...

ApsaraDB for SelectDB returns NULL for columns with mismatched types or missing columns. For example, the type of a column is STRING but you specify the column type as INT, or you specify five columns for a file that contains four columns. The following table describes the column types that are supported.

Column type

Mapping type

tinyint

tinyint

smallint

smallint

int

int

bigint

bigint

largeint

largeint

float

float

double

double

decimal(p,s)

decimalv3(p,s)

date

datev2

datetime

datetimev2

char

string

varchar

string

string

string

boolean

boolean

Example:

s3 (
    "URI" = "https://bucket1/inventory.dat",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "csv",
    "column_separator" = "|",
    "csv_schema" = "k1:int;k2:int;k3:int;k4:decimal(38,10)",
    "use_path_style"="true"
)

Query and analyze data

You can use the TVF together with an SQL statement to query and analyze a file.

SELECT * FROM s3(
    "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "parquet",
    "use_path_style"="true")
LIMIT 5;

-- Sample response
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name                                   | p_mfgr         | p_brand  | p_type                  | p_size | p_container | p_retailprice | p_comment           |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
|         1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER  |      7 | JUMBO PKG   |           901 | ly. slyly ironi     |
|         2 | blush thistle blue yellow saddle         | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS     |      1 | LG CASE     |           902 | lar accounts amo    |
|         3 | spring green yellow purple cornsilk      | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS |     21 | WRAP CASE   |           903 | egular deposits hag |
|         4 | cornflower chocolate smoke green pink    | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS      |     14 | MED DRUM    |           904 | p furiously r       |
|         5 | forest brown coral puff cream            | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN   |     15 | SM PKG      |           905 |  wake carefully     |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+

You can place the TVF anywhere that you place a table in the SQL statement, such as in the WITH or FROM clause of a common table expression (CTE). You can analyze the file as a standard table.

You can also create a logical view for the TVF by using the CREATE VIEW statement. This allows you to access this TVF and manage its permissions in the same way as you perform operations on other views. You can also allow other users to access this TVF.

CREATE VIEW v1 AS 
SELECT * FROM s3(
    "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "parquet",
    "use_path_style"="true");

DESC v1;

SELECT * FROM v1;

GRANT SELECT_PRIV ON db1.v1 TO user1;

Import data

You can use the TVF together with the INSERT INTO SELECT statement to import the data of a file to a table in ApsaraDB for SelectDB for faster analysis.

-- 1. Create an internal table in ApsaraDB for SelectDB.
CREATE TABLE IF NOT EXISTS test_table
(
    id int,
    name varchar(50),
    age int
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");

-- 2. Insert data by using the Amazon S3 TVF.
INSERT INTO test_table (id,name,age)
SELECT cast(id as INT) as id, name, cast (age as INT) as age
FROM s3(
    "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "parquet",
    "use_path_style" = "true");

Usage notes

  • If no file is matched for the URI specified by the Amazon S3 or HDFS TVF or all the matched files are empty, the Amazon S3 or HDFS TVF returns an empty result set. In this case, if you execute the DESC FUNCTION statement to query the schema of the file specified by the TVF, a dummy column __dummy_col is returned, which can be ignored.

  • If the file format specified by the TVF is CSV and the read file is not empty but the first row of the file is empty, the The first line is empty, can not parse column numbers error message is reported.