All Products
Search
Document Center

AnalyticDB:Use OSS foreign tables for data lake analysis

Last Updated:Sep 08, 2024

This topic describes how to use Object Storage Service (OSS) foreign tables to import OSS data for data analysis based on OSS foreign data wrapper (FDW).

Description

OSS FDW is developed based on PostgreSQL FDW. OSS FDW allows you to perform the following operations:

  • Import OSS data to row-oriented or column-oriented tables of AnalyticDB for PostgreSQL instances for accelerated data analysis.

  • Query and analyze large amounts of OSS data.

  • Join OSS foreign tables with AnalyticDB for PostgreSQL tables for data analysis.

OSS FDW supports the following data objects in a variety of business scenarios:

Preparations

Prepare OSS data

Prepare a sample file named example.csv.

Obtain the OSS bucket information

The following procedure demonstrates how to obtain the bucket name, object path, endpoint, and bucket domain name.

  1. Log on to the OSS console.

  2. In the left-side navigation pane, click Buckets.

  3. On the Buckets page, click the name of a bucket.

    You can obtain a bucket name on the Buckets page.

  4. On the Object Management page, obtain an object path.

  5. In the left-side navigation pane, click Overview.

  6. In the Port section of the Overview page, obtain an endpoint and the corresponding bucket domain name.

    We recommend that you use the endpoint of Access from ECS over the VPC (internal network) for data access.

Obtain the AccessKey ID and AccessKey secret

For information about how to obtain the AccessKey ID and AccessKey secret, see Create an AccessKey pair.

Create an OSS server

Execute a CREATE SERVER statement to create an OSS server. You must specify a name for the OSS server that you want to access. For more information about CREATE SERVER, see CREATE SERVER.

Syntax

CREATE SERVER server_name
    FOREIGN DATA WRAPPER fdw_name
    [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters

Parameter

Data type

Required

Description

server_name

STRING

Yes

The name of the OSS server.

fdw_name

STRING

Yes

The name of the foreign data wrapper that manages the server, which is automatically set to oss_fdw.

The following table describes the parameters involved in OPTIONS.

Parameter

Data type

Required

Description

endpoint

STRING

Yes

The domain name of the bucket. For information about how to obtain the domain name, see the "Preparations" section of this topic.

bucket

STRING

No

The name of the bucket that stores the data objects. For information about how to obtain the bucket name, see the "Preparations" section of this topic.

Note
  • The bucket parameter must be specified for at least one of the OSS server and OSS foreign table. For more information about the bucket parameter specified for the OSS foreign table, see the "Create an OSS foreign table" section of this topic.

  • If the bucket parameter is specified for both the OSS server and the OSS foreign table, the bucket parameter specified for the OSS foreign table takes effect.

speed_limit

NUMERIC

No

The transmission threshold that triggers a timeout. Default value: 1024. Unit: bytes.

If this parameter is specified, the speed_time parameter is required.

Note

By default, if less than 1,024 bytes of data is transmitted within 90 consecutive seconds, a timeout is triggered. For more information, see Error handling.

speed_time

NUMERIC

No

The timeout period threshold. Default value: 90. Unit: seconds.

If this parameter is specified, the speed_limit parameter is required.

Note

By default, if less than 1,024 bytes of data is transmitted within 90 consecutive seconds, a timeout is triggered. For more information, see Error handling.

connect_timeout

NUMERIC

No

The timeout period for connections. Default value: 10. Unit: seconds.

dns_cache_timeout

NUMERIC

No

The timeout period for DNS resolution. Default value: 60. Unit: seconds.

Example

CREATE SERVER oss_serv
    FOREIGN DATA WRAPPER oss_fdw
    OPTIONS (
        endpoint 'oss-cn-********.aliyuncs.com',
        bucket 'adb-pg'
  );

You can also execute an ALTER SERVER statement to modify the configurations of an OSS server. For more information, see ALTER SERVER.

Examples of modifying the configurations of an OSS server:

  • Modify a parameter.

    ALTER SERVER oss_serv OPTIONS(SET endpoint 'oss-cn-********.aliyuncs.com');
  • Add a parameter.

    ALTER SERVER oss_serv OPTIONS(ADD connect_timeout '20');
  • Remove a parameter.

    ALTER SERVER oss_serv OPTIONS(DROP connect_timeout);

You can also execute a DROP SERVER statement to delete an OSS server. For more information, see DROP SERVER.

Create a user mapping to the OSS server

After you create an OSS server, you must create a user that accesses the OSS server. You can execute a CREATE USER MAPPING statement to create a user mapping between an AnalyticDB for PostgreSQL database user and the user that accesses the OSS server. For more information, see CREATE USER MAPPING.

Syntax

CREATE USER MAPPING FOR { username | USER | CURRENT_USER | PUBLIC }
    SERVER servername
    [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters

Parameter

Data type

Required

Description

username

STRING

Yes, one of four parameters

The database username of the AnalyticDB for PostgreSQL instance to be mapped.

USER

STRING

The current database username of the AnalyticDB for PostgreSQL instance to be mapped.

CURRENT_USER

STRING

PUBLIC

STRING

Creates a public mapping that can match all database usernames of the AnalyticDB for PostgreSQL instance, including usernames to be created later.

servername

STRING

Yes

The name of the OSS server.

The following table describes the parameters involved in OPTIONS.

Parameter

Data type

Required

Description

id

STRING

Yes

The AccessKey ID that is used to access the OSS bucket. For information about how to obtain an AccessKey ID, see Create an AccessKey pair.

key

STRING

Yes

The AccessKey secret that is used to access the OSS bucket. For information about how to obtain an AccessKey secret, see Create an AccessKey pair.

Example

CREATE USER MAPPING FOR PUBLIC
    SERVER oss_serv
    OPTIONS (
        id 'LTAI5t7Ge***************',
        key 'FikziJd2La*******************'
    );

You can also execute a DROP USER MAPPING statement to delete a user. For more information, see DROP USER MAPPING.

Create an OSS foreign table

After you create an OSS server and a user to access the server, you can execute a CREATE FOREIGN TABLE statement to create an OSS foreign table. For more information, see CREATE FOREIGN TABLE.

Syntax

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
    column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ] [ COLLATE collation ] [ column_constraint [ ... ] ]
      [, ... ]
] )
    SERVER server_name
  [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters

Parameter

Data type

Required

Description

table_name

STRING

Yes

The name of the OSS foreign table.

column_name

STRING

Yes

The name of the column.

data_type

STRING

Yes

The data type of the column.

The following table describes the parameters involved in OPTIONS.

Parameter

Data type

Required

Description

filepath

STRING

Yes, one of three parameters

The object name that contains the OSS object path.

If the filepath parameter is specified, only the specified object is selected.

prefix

STRING

The prefix of the object path. Only the specified prefix is matched and regular expressions are not supported.

If the prefix parameter is specified, all OSS objects stored in the object path whose name starts with the prefix are selected. Examples:

  • If you set prefix to test/filename, the following objects are imported:

    • test/filename

    • test/filenamexxx

    • test/filename/aa

    • test/filenameyyy/aa

    • test/filenameyyy/bb/aa

  • If you set prefix to test/filename/, only the following object out of the preceding objects is imported:

    • test/filename/aa

dir

STRING

The OSS directory in which data objects are stored. The OSS directory must end with a forward slash (/). Example: test/mydir/.

If the dir parameter is specified, all objects stored in the directory are selected, excluding its subdirectories and objects in the subdirectories.

bucket

STRING

No

The name of the bucket that stores data objects. For information about how to obtain the bucket name, see the "Preparations" section of this topic.

Note
  • The bucket parameter must be specified for at least one of the OSS server and OSS foreign table.

  • If the bucket parameter is specified for both the OSS server and the OSS foreign table, the bucket parameter specified for the OSS foreign table takes effect.

format

STRING

Yes

The object format. Valid values:

  • csv

  • text

  • orc

  • avro

  • parquet

  • json

    For more information about JSON, see Introducing JSON.

  • jsonline

    In this format, JSON data is delimited by line feeds. All data that can be read by using JSON Lines can also be read by using JSON, but not vice versa. We recommend that you use JSON Lines if possible. For more information about JSON Lines, see JSON Lines.

filetype

STRING

No

The object type. Valid values:

  • plain (default): The system only reads the raw binary data.

  • gzip: The system reads the raw binary data and decompresses the package by using GZIP.

  • snappy: The system reads the raw binary data and decompresses the package by using Snappy.

    Only standard Snappy-compressed objects are supported. Hadoop Snappy-compressed objects are not supported.

Note
  • If the filetype parameter is specified, only CSV, TEXT, JSON, and JSON Lines objects are supported.

  • If the filetype parameter is set to snappy, JSON and JSON Lines objects are not supported.

log_errors

BOOLEAN

No

Specifies whether to record errors in log files. Default value: false. For more information, see the "Fault tolerance" section of this topic.

Note

This parameter takes effect only for CSV and TEXT objects.

segment_reject_limit

NUMERIC

No

The number of abort errors.

A value that includes a percent sign (%) indicates the percentage of error rows. A value without percent signs (%) indicates the number of error rows. Examples:

  • segment_reject_limit = '10' indicates that the scanning stops and quits with an error message if the number of error rows exceeds 10.

  • segment_reject_limit = '10%' indicates that the scanning stops and quits with an error message if the percentage of error rows exceeds 10% of the processed rows.

Note

This parameter takes effect only for CSV and TEXT objects.

header

BOOLEAN

No

Specifies whether to include the header row for fields in the source object. Valid values:

  • true: includes the header row.

  • false (default): does not include the header row.

Note

This parameter takes effect only for CSV objects.

delimiter

STRING

No

The delimiter between fields. It can be set only to a single-byte character.

  • Default value for CSV objects: comma (,)

  • Default value for TEXT objects: tab key

Note

This parameter takes effect only for CSV and TEXT objects.

quote

STRING

No

The quotation mark that encloses fields. It can be set only to a single-byte character. Default value: double quotation mark (").

Note

This parameter takes effect only for CSV objects.

escape

STRING

No

The string that matches the quote parameter. It can be set only to a single-byte character. Default value: double quotation mark (").

Note

This parameter takes effect only for CSV objects.

null

STRING

No

The representation of NULL strings in objects.

  • Default value for CSV objects: \N

  • Default value for TEXT objects: empty character without being enclosed by quotation marks

Note

This parameter takes effect only for CSV and TEXT objects.

encoding

STRING

No

The encoding format of data objects. Default value: encoding format of the client.

Note

This parameter takes effect only for CSV and TEXT objects.

force_not_null

BOOLEAN

No

Specifies whether the field value cannot be an empty string. Valid values:

  • true: The field value cannot be an empty string.

  • false (default): The field value can be an empty string.

Note

This parameter takes effect only for CSV and TEXT objects.

force_null

BOOLEAN

No

The method used to process an empty string. Valid values:

  • true: An empty string is returned as NULL no matter whether the field value is enclosed by quotation marks.

  • false (default): An empty string is returned as NULL only if the field value is not enclosed by quotation marks.

Note

This parameter takes effect only for CSV and TEXT objects.

Example

CREATE FOREIGN TABLE ossexample (
    date text,
    time text,
    open float,
    high float,
    low float,
    volume int
) SERVER oss_serv OPTIONS (dir 'dir_oss_adb/', format 'csv');
Note

After you create an OSS foreign table, you can use one of the following methods to check whether the OSS objects that the foreign table matches meet the expectation:

  • Method 1

    EXPLAIN VERBOSE SELECT * FROM <Name of the OSS foreign table>;
  • Method 2

    SELECT * FROM get_oss_table_meta('<Name of the OSS foreign table>');

You can also execute a DROP FOREIGN TABLE statement to delete an OSS foreign table. For more information, see DROP FOREIGN TABLE.

Query and analyze OSS data

You can query the data of OSS foreign tables in the same way as that for AnalyticDB for PostgreSQL tables. The following query methods can be used:

  • Query data by specifying a key-value pair.

    SELECT * FROM ossexample WHERE volume = 5;
  • Query data by using an aggregate function.

    SELECT count(*) FROM ossexample WHERE volume = 5;
  • Query data by specifying columns and using GROUP BY and LIMIT clauses.

    SELECT low, sum(volume)
      FROM ossexample
     GROUP BY low
     ORDER BY low
     limit 5;

Join the OSS foreign table with an AnalyticDB for PostgreSQL table for data analysis

  1. Create an AnalyticDB for PostgreSQL table named example for join analysis and insert data into the table.

    CREATE TABLE example (id int, volume int);
    INSERT INTO example VALUES(1,1), (2,3), (4,5);
  2. Join the OSS foreign table ossexample with the AnalyticDB for PostgreSQL table example for data query.

    SELECT example.volume, min(high), max(low)
    FROM
    ossexample,
    example
    WHERE ossexample.volume = example.volume
    GROUP BY(example.volume)
    ORDER BY example.volume;

Fault tolerance

OSS FDW uses the log_errors and segment_reject_limit parameters to provide the fault tolerance feature so that the scanning of OSS foreign tables is not interrupted by errors in raw data.

For more information about the log_errors and segment_reject_limit parameters, see the "Create an OSS foreign table" section of this topic.

  • Create an OSS foreign table that supports fault tolerance.

    CREATE FOREIGN TABLE oss_error_sales (id int, value float8, x text)
        SERVER oss_serv
        OPTIONS (log_errors 'true',         -- Record the information of error rows.
                 segment_reject_limit '10', -- Specify the threshold to stop scanning. If the number of error rows exceeds 10, the scanning stops. 
                 dir 'error_sales/',        -- Specify the OSS object directory that the foreign table matches.
                 format 'csv',              -- Specify CSV as the format to parse objects.
                 encoding 'utf8');          -- Specify the encoding format.
  • Query the logs of error rows.

    SELECT * FROM gp_read_error_log('oss_error_sales');
  • Delete the logs of error rows.

    SELECT gp_truncate_error_log('oss_error_sales');

FAQ

Q: Can data stored in OSS also be deleted if I delete data from an OSS foreign table?

A: No, data stored in OSS cannot be deleted if you delete data from an OSS foreign table.

References