This topic describes how to use Object Storage Service (OSS) foreign tables to import OSS data for data analysis based on OSS foreign data wrapper (FDW).
Description
OSS FDW is developed based on PostgreSQL FDW. OSS FDW allows you to perform the following operations:
Import OSS data to row-oriented or column-oriented tables of AnalyticDB for PostgreSQL instances for accelerated data analysis.
Query and analyze large amounts of OSS data.
Join OSS foreign tables with AnalyticDB for PostgreSQL tables for data analysis.
OSS FDW supports the following data objects in a variety of business scenarios:
Uncompressed CSV, TEXT, JSON, and JSON Lines objects.
GZIP- and standard Snappy-compressed CSV and TEXT objects.
GZIP-compressed JSON and JSON Lines objects.
ORC binary objects. For information about data type mappings between ORC and AnalyticDB for PostgreSQL, see the "Data type mappings between ORC and AnalyticDB for PostgreSQL" section of the Data type mappings for OSS foreign tables topic.
Parquet binary objects. For information about data type mappings between Parquet and AnalyticDB for PostgreSQL, see the "Data type mappings between Parquet and AnalyticDB for PostgreSQL" section of the Data type mappings for OSS foreign tables topic.
Arvo binary objects. For information about data type mappings between Arvo and AnalyticDB for PostgreSQL, see the "Data type mappings between Avro and AnalyticDB for PostgreSQL" section of the Data type mappings for OSS foreign tables topic.
Preparations
Prepare OSS data
Prepare a sample file named example.csv.
Obtain the OSS bucket information
The following procedure demonstrates how to obtain the bucket name, object path, endpoint, and bucket domain name.
Log on to the OSS console.
In the left-side navigation pane, click Buckets.
On the Buckets page, click the name of a bucket.
You can obtain a bucket name on the Buckets page.
On the Object Management page, obtain an object path.
In the left-side navigation pane, click Overview.
In the Port section of the Overview page, obtain an endpoint and the corresponding bucket domain name.
We recommend that you use the endpoint of Access from ECS over the VPC (internal network) for data access.
Obtain the AccessKey ID and AccessKey secret
For information about how to obtain the AccessKey ID and AccessKey secret, see Create an AccessKey pair.
Create an OSS server
Execute a CREATE SERVER statement to create an OSS server. You must specify a name for the OSS server that you want to access. For more information about CREATE SERVER, see CREATE SERVER.
Syntax
CREATE SERVER server_name
FOREIGN DATA WRAPPER fdw_name
[ OPTIONS ( option 'value' [, ... ] ) ]
Parameters
Parameter | Data type | Required | Description |
server_name | STRING | Yes | The name of the OSS server. |
fdw_name | STRING | Yes | The name of the foreign data wrapper that manages the server, which is automatically set to oss_fdw. |
The following table describes the parameters involved in OPTIONS.
Parameter | Data type | Required | Description |
endpoint | STRING | Yes | The domain name of the bucket. For information about how to obtain the domain name, see the "Preparations" section of this topic. |
bucket | STRING | No | The name of the bucket that stores the data objects. For information about how to obtain the bucket name, see the "Preparations" section of this topic. Note
|
speed_limit | NUMERIC | No | The transmission threshold that triggers a timeout. Default value: 1024. Unit: bytes. If this parameter is specified, the speed_time parameter is required. Note By default, if less than 1,024 bytes of data is transmitted within 90 consecutive seconds, a timeout is triggered. For more information, see Error handling. |
speed_time | NUMERIC | No | The timeout period threshold. Default value: 90. Unit: seconds. If this parameter is specified, the speed_limit parameter is required. Note By default, if less than 1,024 bytes of data is transmitted within 90 consecutive seconds, a timeout is triggered. For more information, see Error handling. |
connect_timeout | NUMERIC | No | The timeout period for connections. Default value: 10. Unit: seconds. |
dns_cache_timeout | NUMERIC | No | The timeout period for DNS resolution. Default value: 60. Unit: seconds. |
Example
CREATE SERVER oss_serv
FOREIGN DATA WRAPPER oss_fdw
OPTIONS (
endpoint 'oss-cn-********.aliyuncs.com',
bucket 'adb-pg'
);
You can also execute an ALTER SERVER statement to modify the configurations of an OSS server. For more information, see ALTER SERVER.
Examples of modifying the configurations of an OSS server:
Modify a parameter.
ALTER SERVER oss_serv OPTIONS(SET endpoint 'oss-cn-********.aliyuncs.com');
Add a parameter.
ALTER SERVER oss_serv OPTIONS(ADD connect_timeout '20');
Remove a parameter.
ALTER SERVER oss_serv OPTIONS(DROP connect_timeout);
You can also execute a DROP SERVER statement to delete an OSS server. For more information, see DROP SERVER.
Create a user mapping to the OSS server
After you create an OSS server, you must create a user that accesses the OSS server. You can execute a CREATE USER MAPPING statement to create a user mapping between an AnalyticDB for PostgreSQL database user and the user that accesses the OSS server. For more information, see CREATE USER MAPPING.
Syntax
CREATE USER MAPPING FOR { username | USER | CURRENT_USER | PUBLIC }
SERVER servername
[ OPTIONS ( option 'value' [, ... ] ) ]
Parameters
Parameter | Data type | Required | Description |
username | STRING | Yes, one of four parameters | The database username of the AnalyticDB for PostgreSQL instance to be mapped. |
USER | STRING | The current database username of the AnalyticDB for PostgreSQL instance to be mapped. | |
CURRENT_USER | STRING | ||
PUBLIC | STRING | Creates a public mapping that can match all database usernames of the AnalyticDB for PostgreSQL instance, including usernames to be created later. | |
servername | STRING | Yes | The name of the OSS server. |
The following table describes the parameters involved in OPTIONS.
Parameter | Data type | Required | Description |
id | STRING | Yes | The AccessKey ID that is used to access the OSS bucket. For information about how to obtain an AccessKey ID, see Create an AccessKey pair. |
key | STRING | Yes | The AccessKey secret that is used to access the OSS bucket. For information about how to obtain an AccessKey secret, see Create an AccessKey pair. |
Example
CREATE USER MAPPING FOR PUBLIC
SERVER oss_serv
OPTIONS (
id 'LTAI5t7Ge***************',
key 'FikziJd2La*******************'
);
You can also execute a DROP USER MAPPING statement to delete a user. For more information, see DROP USER MAPPING.
Create an OSS foreign table
After you create an OSS server and a user to access the server, you can execute a CREATE FOREIGN TABLE statement to create an OSS foreign table. For more information, see CREATE FOREIGN TABLE.
Syntax
CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ] [ COLLATE collation ] [ column_constraint [ ... ] ]
[, ... ]
] )
SERVER server_name
[ OPTIONS ( option 'value' [, ... ] ) ]
Parameters
Parameter | Data type | Required | Description |
table_name | STRING | Yes | The name of the OSS foreign table. |
column_name | STRING | Yes | The name of the column. |
data_type | STRING | Yes | The data type of the column. |
The following table describes the parameters involved in OPTIONS.
Parameter | Data type | Required | Description |
filepath | STRING | Yes, one of three parameters | The object name that contains the OSS object path. If the filepath parameter is specified, only the specified object is selected. |
prefix | STRING | The prefix of the object path. Only the specified prefix is matched and regular expressions are not supported. If the prefix parameter is specified, all OSS objects stored in the object path whose name starts with the prefix are selected. Examples:
| |
dir | STRING | The OSS directory in which data objects are stored. The OSS directory must end with a forward slash (/). Example: test/mydir/. If the dir parameter is specified, all objects stored in the directory are selected, excluding its subdirectories and objects in the subdirectories. | |
bucket | STRING | No | The name of the bucket that stores data objects. For information about how to obtain the bucket name, see the "Preparations" section of this topic. Note
|
format | STRING | Yes | The object format. Valid values:
|
filetype | STRING | No | The object type. Valid values:
Note
|
log_errors | BOOLEAN | No | Specifies whether to record errors in log files. Default value: false. For more information, see the "Fault tolerance" section of this topic. Note This parameter takes effect only for CSV and TEXT objects. |
segment_reject_limit | NUMERIC | No | The number of abort errors. A value that includes a percent sign (%) indicates the percentage of error rows. A value without percent signs (%) indicates the number of error rows. Examples:
Note This parameter takes effect only for CSV and TEXT objects. |
header | BOOLEAN | No | Specifies whether to include the header row for fields in the source object. Valid values:
Note This parameter takes effect only for CSV objects. |
delimiter | STRING | No | The delimiter between fields. It can be set only to a single-byte character.
Note This parameter takes effect only for CSV and TEXT objects. |
quote | STRING | No | The quotation mark that encloses fields. It can be set only to a single-byte character. Default value: double quotation mark ("). Note This parameter takes effect only for CSV objects. |
escape | STRING | No | The string that matches the quote parameter. It can be set only to a single-byte character. Default value: double quotation mark ("). Note This parameter takes effect only for CSV objects. |
null | STRING | No | The representation of NULL strings in objects.
Note This parameter takes effect only for CSV and TEXT objects. |
encoding | STRING | No | The encoding format of data objects. Default value: encoding format of the client. Note This parameter takes effect only for CSV and TEXT objects. |
force_not_null | BOOLEAN | No | Specifies whether the field value cannot be an empty string. Valid values:
Note This parameter takes effect only for CSV and TEXT objects. |
force_null | BOOLEAN | No | The method used to process an empty string. Valid values:
Note This parameter takes effect only for CSV and TEXT objects. |
Example
CREATE FOREIGN TABLE ossexample (
date text,
time text,
open float,
high float,
low float,
volume int
) SERVER oss_serv OPTIONS (dir 'dir_oss_adb/', format 'csv');
After you create an OSS foreign table, you can use one of the following methods to check whether the OSS objects that the foreign table matches meet the expectation:
Method 1
EXPLAIN VERBOSE SELECT * FROM <Name of the OSS foreign table>;
Method 2
SELECT * FROM get_oss_table_meta('<Name of the OSS foreign table>');
You can also execute a DROP FOREIGN TABLE statement to delete an OSS foreign table. For more information, see DROP FOREIGN TABLE.
Query and analyze OSS data
You can query the data of OSS foreign tables in the same way as that for AnalyticDB for PostgreSQL tables. The following query methods can be used:
Query data by specifying a key-value pair.
SELECT * FROM ossexample WHERE volume = 5;
Query data by using an aggregate function.
SELECT count(*) FROM ossexample WHERE volume = 5;
Query data by specifying columns and using GROUP BY and LIMIT clauses.
SELECT low, sum(volume) FROM ossexample GROUP BY low ORDER BY low limit 5;
Join the OSS foreign table with an AnalyticDB for PostgreSQL table for data analysis
Create an AnalyticDB for PostgreSQL table named example for join analysis and insert data into the table.
CREATE TABLE example (id int, volume int); INSERT INTO example VALUES(1,1), (2,3), (4,5);
Join the OSS foreign table ossexample with the AnalyticDB for PostgreSQL table example for data query.
SELECT example.volume, min(high), max(low) FROM ossexample, example WHERE ossexample.volume = example.volume GROUP BY(example.volume) ORDER BY example.volume;
Fault tolerance
OSS FDW uses the log_errors and segment_reject_limit parameters to provide the fault tolerance feature so that the scanning of OSS foreign tables is not interrupted by errors in raw data.
For more information about the log_errors and segment_reject_limit parameters, see the "Create an OSS foreign table" section of this topic.
Create an OSS foreign table that supports fault tolerance.
CREATE FOREIGN TABLE oss_error_sales (id int, value float8, x text) SERVER oss_serv OPTIONS (log_errors 'true', -- Record the information of error rows. segment_reject_limit '10', -- Specify the threshold to stop scanning. If the number of error rows exceeds 10, the scanning stops. dir 'error_sales/', -- Specify the OSS object directory that the foreign table matches. format 'csv', -- Specify CSV as the format to parse objects. encoding 'utf8'); -- Specify the encoding format.
Query the logs of error rows.
SELECT * FROM gp_read_error_log('oss_error_sales');
Delete the logs of error rows.
SELECT gp_truncate_error_log('oss_error_sales');
FAQ
Q: Can data stored in OSS also be deleted if I delete data from an OSS foreign table?
A: No, data stored in OSS cannot be deleted if you delete data from an OSS foreign table.