ApsaraDB for SelectDB provides table-valued functions (TVFs), which allow you to map file data in common remote storage such as Amazon Simple Storage Service (Amazon S3) and Hadoop Distributed File System (HDFS) to tables in ApsaraDB for SelectDB. This way, you can analyze the file data with ease.
Amazon S3 TVF
The Amazon S3 TVF, s3(), allows you to read and access the content of files in object storage systems that are compatible with Amazon S3 in the same way as you access the content of files in relational databases. The CVS, CSV_with_names, CVS_with_names_and_types, JSON, Parquet, and ORC file formats are supported.
Syntax
s3(
"uri" = "..",
"s3.access_key" = "...",
"s3.secret_key" = "...",
"s3.region" = "...",
"format" = "csv",
"keyn" = "valuen",
...
);
Parameters
Each parameter in the Amazon S3 TVF is a key-value pair in the "key" = "value"
format. The following table describes the common parameters.
Parameter | Required | Valid value | Description |
uri | Yes | String | The uniform resource identifier (URI) for accessing Amazon S3. |
s3.access_key | Yes | String | The access key ID that is used to access Amazon S3. |
s3.secret_key | Yes | String | The secret access key that is used to access Amazon S3. |
s3.region | Yes | String | The region of Amazon S3. Default value: |
s3.session_token | No Note This parameter is required if temporary session authentication is enabled. | String | The temporary session token that is used to access Amazon S3. |
use_path_style | No |
| Specifies whether to use the path style to access Amazon S3. By default, Amazon S3 SDKs use the virtual-hosted style. However, the virtual-hosted style may not be enabled for or supported by some object storage systems. In this case, you can specify the Default value: Note The following URI schemas are supported:
|
format | Yes |
| The format of the file to be accessed in Amazon S3. |
column_separator | No | String | The column delimiter. Default value: |
line_delimiter | No | String | The row delimiter. Default value: |
compress_type | No |
| The compression type of the file. Default value: unknown, which indicates that the compression type is automatically inferred based on the suffix of the URI. |
read_json_by_line | No |
| Specifies whether to read JSON-formatted data by row. Default value: true. |
num_as_string | No |
| Specifies whether to process numbers as strings. Default value: false. |
fuzzy_parse | No |
| Specifies whether to accelerate the import efficiency of JSON-formatted data. Default value: false. |
jsonpaths | No | String | The fields to be extracted from JSON-formatted data. Format: |
strip_outer_array | No |
| Specifies whether to display JSON-formatted data as an array. Each element is considered as a row of data. Format: |
json_root | No | String | The root node of JSON-formatted data. Format: |
path_partition_keys | No | String | The names of the partition key columns that are carried in the specified file path. For example, if the file path is /path/to/city=beijing/date="2023-07-09", set this parameter to city,date. In this case, ApsaraDB for SelectDB automatically reads the corresponding column names and column values from the path during data import. |
Examples
Read and access a CSV file in an object storage system compatible with Amazon S3.
SELECT * FROM s3("uri" = "http://127.0.0.1:9312/test2/student1.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "format" = "csv", "use_path_style" = "true") ORDER BY c1;
You can use this TVF together with the
DESC FUNCTION
function to query the file schema.MySQL [(none)]> Desc FUNCTION s3("uri" = "http://127.0.0.1:9312/test2/student1.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "format" = "csv", "use_path_style" = "true");
You must use the virtual-hosted style to access Object Storage Service (OSS).
SELECT * FROM s3( "uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "format" = "parquet", "use_path_style" = "false");
If the
use_path_style
parameter is set to true, the path style is used to access Amazon S3.SELECT * FROM s3( "uri" = "https://endpoint/bucket/file/student.csv", "s3.access_key"= "ak", "s3.secret_key" = "sk", "format" = "csv", "use_path_style"="true");
If the
use_path_style
parameter is set to false, the virtual-hosted style is used to access Amazon S3.SELECT * FROM s3( "uri" = "https://bucket.endpoint/bucket/file/student.csv", "s3.access_key"= "ak", "s3.secret_key" = "sk", "format" = "csv", "use_path_style"="false");
HDFS TVF
The HDFS TVF, hdfs(), allows you to read and access the content of files in HDFS in the same way as you access the content of files in relational databases. The CVS, CSV_with_names, CVS_with_names_and_types, JSON, Parquet, and ORC file formats are supported.
Syntax
hdfs(
"uri" = "..",
"fs.defaultFS" = "...",
"hadoop.username" = "...",
"format" = "csv",
"keyn" = "valuen"
...
);
Parameters
Each parameter in the HDFS TVF is a key-value pair in the "key" = "value"
format. The following table describes the common parameters.
Parameter | Required | Valid value | Description |
uri | Yes | String | The URI for accessing HDFS. If the specified URI does not exist or the matched files are all empty, the HDFS TVF returns an empty result set. |
fs.defaultFS | Yes | String | The hostname and port number of HDFS. |
hadoop.username | Yes | String | The username that is used to access HDFS. The value cannot be an empty string. |
hadoop.security.authentication | No |
| The authentication method for HDFS. Valid values: Simple and Kerberos. |
hadoop.kerberos.principal | No | String | The Kerberos principal if Kerberos authentication is enabled for HDFS. |
hadoop.kerberos.keytab | No | String | The path of the Kerberos keytab file if Kerberos authentication is enabled for HDFS. |
dfs.client.read.shortcircuit | No |
| Specifies whether to read local HDFS short circuit data. The value is of the BOOLEAN type. |
dfs.domain.socket.path | No | String | The path that points to a UNIX domain socket for the communication between the DataNode and the local HDFS client. If you specify the string "_PORT" in the path, the string is replaced by the TCP port of the DataNode. |
dfs.nameservices | No | String | The logical names of the nameservices that provide services. This parameter corresponds to the dfs.nameservices field in the core-site.xml file. |
dfs.ha.namenodes.your-nameservices | No Note This parameter is required if Hadoop High Availability (HA) deployment is used. | String | The logical names of the NameNodes. |
dfs.namenode.rpc-address.your-nameservices.your-namenode | No Note This parameter is required if Hadoop HA deployment is used. | String | The HTTP URL to which the NameNode listens. |
dfs.client.failover.proxy.provider.your-nameservices | No Note This parameter is required if Hadoop HA deployment is used. | String | The implementation class of the failover proxy provider for client connections to the NameNode in the available state. |
read_json_by_line | No |
| Specifies whether to read JSON-formatted data by row. Default value: true. |
num_as_string | No |
| Specifies whether to process numbers as strings. Default value: false. |
fuzzy_parse | No |
| Specifies whether to accelerate the import efficiency of JSON-formatted data. Default value: false. |
jsonpaths | No | String | The fields to be extracted from JSON-formatted data. Format: |
strip_outer_array | No |
| Specifies whether to display JSON-formatted data as an array. Each element is considered as a row of data. Format: |
json_root | No | String | The root node of JSON-formatted data. Format: |
trim_double_quotes | No |
| Specifies whether to trim the outermost double quotation marks (") of each field in the CSV file. Default value: false. |
skip_lines | No | [0-Integer.MaxValue] | The value is of the INTEGER type. Default value: 0. Specifies whether to skip the first few rows of the CSV file. This parameter becomes invalid if the format parameter is set to |
path_partition_keys | No | String | The names of the partition key columns that are carried in the specified file path. For example, if the file path is /path/to/city=beijing/date="2023-07-09", set this parameter to city,date. In this case, ApsaraDB for SelectDB automatically reads the corresponding column names and column values from the path during data import. |
Examples
Read and access a CSV file in HDFS.
MySQL [(none)]> SELECT * FROM hdfs(
"uri" = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
"fs.defaultFS" = "hdfs://127.0.0.1:8424",
"hadoop.username" = "doris",
"format" = "csv");
-- Sample response
+------+---------+------+
| c1 | c2 | c3 |
+------+---------+------+
| 1 | alice | 18 |
| 2 | bob | 20 |
| 3 | jack | 24 |
| 4 | jackson | 19 |
| 5 | liming | 18 |
+------+---------+------+
Read and access a CSV file in HDFS in high availability (HA) mode.
MySQL [(none)]> SELECT * FROM hdfs(
"uri" = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
"fs.defaultFS" = "hdfs://127.0.0.1:8424",
"hadoop.username" = "doris",
"format" = "csv",
"dfs.nameservices" = "my_hdfs",
"dfs.ha.namenodes.my_hdfs" = "nn1,nn2",
"dfs.namenode.rpc-address.my_hdfs.nn1" = "nanmenode01:8020",
"dfs.namenode.rpc-address.my_hdfs.nn2" = "nanmenode02:8020",
"dfs.client.failover.proxy.provider.my_hdfs" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
-- Sample response
+------+---------+------+
| c1 | c2 | c3 |
+------+---------+------+
| 1 | alice | 18 |
| 2 | bob | 20 |
| 3 | jack | 24 |
| 4 | jackson | 19 |
| 5 | liming | 18 |
+------+---------+------+
You can use this TVF together with the DESC FUNCTION
function to query the file schema.
MySQL [(none)]> DECS FUNCTION hdfs(
"uri" = "hdfs://127.0.0.1:8424/user/doris/csv_format_test/student_with_names.csv",
"fs.defaultFS" = "hdfs://127.0.0.1:8424",
"hadoop.username" = "doris",
"format" = "csv_with_names");
Usage
This section describes how to use a TVF to analyze file data. In this example, the Amazon S3 TVF is used.
Automatically infer the column types of a file
ApsaraDB for SelectDB can analyze and infer the column types of files in the Parquet, ORC, CSV, and JSON formats.
> DESC FUNCTION s3 (
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "parquet",
"use_path_style"="true"
);
-- Sample response
+---------------+--------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey | INT | Yes | false | NULL | NONE |
| p_name | TEXT | Yes | false | NULL | NONE |
| p_mfgr | TEXT | Yes | false | NULL | NONE |
| p_brand | TEXT | Yes | false | NULL | NONE |
| p_type | TEXT | Yes | false | NULL | NONE |
| p_size | INT | Yes | false | NULL | NONE |
| p_container | TEXT | Yes | false | NULL | NONE |
| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
| p_comment | TEXT | Yes | false | NULL | NONE |
+---------------+--------------+------+-------+---------+-------+
The sample response shows that ApsaraDB for SelectDB can automatically infer the column types of a Parquet file based on the metadata of the file.
Read the schema of a CSV file
By default, ApsaraDB for SelectDB processes the types of all columns in CSV files as STRING. You can use the csv_schema
parameter to separately specify the column names and column types of a CSV file. ApsaraDB for SelectDB reads the schema of the CSV file based on the specified column types. Format:
name1:type1;name2:type2;...
ApsaraDB for SelectDB returns NULL for columns with mismatched types or missing columns. For example, the type of a column is STRING but you specify the column type as INT, or you specify five columns for a file that contains four columns. The following table describes the column types that are supported.
Column type | Mapping type |
tinyint | tinyint |
smallint | smallint |
int | int |
bigint | bigint |
largeint | largeint |
float | float |
double | double |
decimal(p,s) | decimalv3(p,s) |
date | datev2 |
datetime | datetimev2 |
char | string |
varchar | string |
string | string |
boolean | boolean |
Example:
s3 (
"URI" = "https://bucket1/inventory.dat",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"column_separator" = "|",
"csv_schema" = "k1:int;k2:int;k3:int;k4:decimal(38,10)",
"use_path_style"="true"
)
Query and analyze data
You can use the TVF together with an SQL statement to query and analyze a file.
SELECT * FROM s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "parquet",
"use_path_style"="true")
LIMIT 5;
-- Sample response
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
You can place the TVF anywhere that you place a table in the SQL statement, such as in the WITH or FROM clause of a common table expression (CTE). You can analyze the file as a standard table.
You can also create a logical view for the TVF by using the CREATE VIEW
statement. This allows you to access this TVF and manage its permissions in the same way as you perform operations on other views. You can also allow other users to access this TVF.
CREATE VIEW v1 AS
SELECT * FROM s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "parquet",
"use_path_style"="true");
DESC v1;
SELECT * FROM v1;
GRANT SELECT_PRIV ON db1.v1 TO user1;
Import data
You can use the TVF together with the INSERT INTO SELECT
statement to import the data of a file to a table in ApsaraDB for SelectDB for faster analysis.
-- 1. Create an internal table in ApsaraDB for SelectDB.
CREATE TABLE IF NOT EXISTS test_table
(
id int,
name varchar(50),
age int
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");
-- 2. Insert data by using the Amazon S3 TVF.
INSERT INTO test_table (id,name,age)
SELECT cast(id as INT) as id, name, cast (age as INT) as age
FROM s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "parquet",
"use_path_style" = "true");
Usage notes
If no file is matched for the URI specified by the Amazon S3 or HDFS TVF or all the matched files are empty, the Amazon S3 or HDFS TVF returns an empty result set. In this case, if you execute the
DESC FUNCTION
statement to query the schema of the file specified by the TVF, a dummy column__dummy_col
is returned, which can be ignored.If the file format specified by the TVF is CSV and the read file is not empty but the first row of the file is empty, the
The first line is empty, can not parse column numbers
error message is reported.