AnalyticDB for PostgreSQL allows you to query external Hadoop data sources, such as Hadoop Distributed File System (HDFS) and Hive data, by using external tables.
Usage notes
This feature is available only for AnalyticDB for PostgreSQL instances in elastic storage mode. The AnalyticDB for PostgreSQL instances that you use for query must be in the same virtual private cloud (VPC) as the external data sources.
This feature is unavailable for AnalyticDB for PostgreSQL instances in elastic storage mode that were created before September 6, 2020. This is because these instances cannot be connected to external Hadoop data sources over different network architectures. If you want to use this feature on existing instances, we recommend that you contact Alibaba Cloud technical support to apply for new instances and migrate data.
Configure a server
Server configurations vary based on user requirements. If you want to query external Hadoop data sources for federated analytics, submit a ticket to obtain technical support to configure a server. The following table describes the information that is required when you submit a ticket.
External data source | Required information |
Hadoop (HDFS, Hive, and HBase) | core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, and hive-site.xml Note Profiles such as keytab and krb5.conf are also required for Kerberos authentication. |
Syntax
Create an extension
CREATE extension pxf;
Create an external table
CREATE EXTERNAL TABLE <table_name> ( <column_name> <data_type> [, ...] | LIKE <other_table> ) LOCATION('pxf://<path-to-data>?PROFILE[&<custom-option>=<value>[...]]&[SERVER=value]') FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
For more information about the syntax that is used to create an external table, see the "CREATE EXTERNAL TABLE" section of the SQL syntax topic.
Query HDFS data
The following table describes the supported data formats.
Data format | Profile |
text | hdfs:text |
csv | hdfs:text:multi and hdfs:text |
Avro | hdfs:avro |
JSON | hdfs:json |
Parquet | hdfs:parquet |
AvroSequenceFile | hdfs:AvroSequenceFile |
SequenceFile | hdfs:SequenceFile |
For information about FORMAT
and formatting-properties
, see the "Create an external table" section of this topic.
Example: Query an HDFS file
Create a test data file named
pxf_hdfs_simple.txt
on HDFS.echo 'Prague,Jan,101,4875.33 Rome,Mar,87,1557.39 Bangalore,May,317,8936.99 Beijing,Jul,411,11600.67' > /tmp/pxf_hdfs_simple.txt # Create a directory. hdfs dfs -mkdir -p /data/pxf_examples # Put the file into the directory on HDFS. hdfs dfs -put /tmp/pxf_hdfs_simple.txt /data/pxf_examples/ # View the file. hdfs dfs -cat /data/pxf_examples/pxf_hdfs_simple.txt
Create an external table and query it on an AnalyticDB for PostgreSQL instance.
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text&SERVER=hdp3') FORMAT 'TEXT' (delimiter=E','); postgres=# select * from pxf_hdfs_textsimple; location | month | num_orders | total_sales -----------+-------+------------+-------------------- Prague | Jan | 101 | 4875.3299999999999 Rome | Mar | 87 | 1557.3900000000001 Bangalore | May | 317 | 8936.9899999999998 Beijing | Jul | 411 | 11600.67 (4 rows)
The following list describes the parameters in the LOCATION clause:
pxf://: the pxf protocol. It cannot be modified.
data/pxf_examples/pxf_hdfs_simple.txt: the pxf_hdfs_simple.txt file on HDFS.
PROFILE=hdfs:text: the profile that is used to query HDFS data. In this example, hdfs:text is used.
SERVER=hdp3: the location of the profile used to query HDFS data based on PXF. In this example, the profile is located in the PXF_SERVER/hdp3/ directory. This value is provided by the technical support of AnalyticDB for PostgreSQL.
FORMAT 'TEXT' (delimiter=E','): the file format and formatting option of the external data source. In this example, the file format is set to
TEXT
, and the delimiter is set to commas (,
).
For information about the parameter descriptions, see the "Syntax" section of this topic.
Example: Write data to HDFS in the TEXT or CSV format
Prerequisites: The /data/pxf_examples/pxfwritable_hdfs_textsimple1 directory is created on HDFS by using the following command:
hdfs dfs -mkdir -p /data/pxf_examples/pxfwritable_hdfs_textsimple1
NoteTo execute INSERT statements on AnalyticDB for PostgreSQL, you must have write permissions on the preceding directory.
Create a writable external table on an AnalyticDB for PostgreSQL instance.
CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writabletbl_1(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple1?PROFILE=hdfs:text&SERVER=hdp3') FORMAT 'TEXT' (delimiter=','); INSERT INTO pxf_hdfs_writabletbl_1 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 ); INSERT INTO pxf_hdfs_writabletbl_1 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
View information on HDFS.
# View the directory. hdfs dfs -ls /data/pxf_examples/pxfwritable_hdfs_textsimple1 # View data in the directory. hdfs dfs -cat /data/pxf_examples/pxfwritable_hdfs_textsimple1/* Frankfurt,Mar,777,3956.98 Cleveland,Oct,3812,96645.37
Query Hive data
Data format | Profile |
TextFile | Hive, HiveText |
SequenceFile | Hive |
RCFile | Hive, HiveRC |
ORC | Hive, HiveORC, HiveVectorizedORC |
Parquet | Hive |
For information about FORMAT and formatting-properties
, see the "Create an external table" section of this topic.
Example: Use the Hive profile
Generate data.
echo 'Prague,Jan,101,4875.33 Rome,Mar,87,1557.39 Bangalore,May,317,8936.99 Beijing,Jul,411,11600.67 San Francisco,Sept,156,6846.34 Paris,Nov,159,7134.56 San Francisco,Jan,113,5397.89 Prague,Dec,333,9894.77 Bangalore,Jul,271,8320.55 Beijing,Dec,100,4248.41' > /tmp/pxf_hive_datafile.txt
Create a Hive table.
hive> CREATE TABLE sales_info (location string, month string, number_of_orders int, total_sales double) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile; hive> LOAD DATA LOCAL INPATH '/tmp/pxf_hive_datafile.txt' INTO TABLE sales_info; hive> SELECT * FROM sales_info;
Create an external table and query it on an AnalyticDB for PostgreSQL instance.
postgres=# create extension pxf; postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://default.sales_info?PROFILE=Hive&SERVER=hdp3') FORMAT 'custom' (formatter='pxfwritable_import'); postgres=# SELECT * FROM salesinfo_hiveprofile; location | month | num_orders | total_sales ---------------+-------+------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 Beijing | Jul | 411 | 11600.67 San Francisco | Sept | 156 | 6846.34 Paris | Nov | 159 | 7134.56 ......
The following list describes the parameters in the LOCATION clause:
pxf://: the pxf protocol. It cannot be modified.
default.sales_info: the sales_info table that is stored in the default database on Hive.
PROFILE=Hive: the profile that is used to query Hive data. In this example, Hive is used.
SERVER=hdp3: the location of the profile used to query Hive data based on PXF. In this example, the profile is located in the PXF_SERVER/hdp3/ directory. This value is provided by the technical support of AnalyticDB for PostgreSQL.
FORMAT 'custom' (formatter='pxfwritable_import'): the file format and formatting option of the external data source. In this example, the file format is set to custom, and the formatter is set to pxfwritable_import.
For information about the parameter descriptions, see the "Syntax" section of this topic.
Example: Use the HiveText profile
postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://default.sales_info?PROFILE=HiveText&SERVER=hdp3') FORMAT 'TEXT' (delimiter=E','); postgres=# select * from salesinfo_hivetextprofile; location | month | num_orders | total_sales ---------------+-------+------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 ......
Example: Use the HiveRC profile
Create a Hive table.
hive> CREATE TABLE sales_info_rcfile (location string, month string, number_of_orders int, total_sales double) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS rcfile; OK ## Import data. hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info; ## View data. hive> SELECT * FROM sales_info_rcfile;
Create an external table and query it on an AnalyticDB for PostgreSQL instance.
postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://default.sales_info_rcfile?PROFILE=HiveRC&SERVER=hdp3') FORMAT 'TEXT' (delimiter=E','); postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile; location | total_sales ---------------+------------- Prague | 4875.33 Rome | 1557.39 Bangalore | 8936.99 ......
Example: Use the HiveORC profile
ORC-supporting profiles, HiveORC and HiveVectorizedORC, provide the following features:
Read a single row of data at a time.
Support column projection.
Support complex types and allow you to query Hive tables that are composed of array, map, struct, and union data types.
Create a Hive table.
hive> CREATE TABLE sales_info_ORC (location string, month string, number_of_orders int, total_sales double) STORED AS ORC; hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info; hive> SELECT * FROM sales_info_ORC;
Create an external table and query it on an AnalyticDB for PostgreSQL instance.
postgres=# CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://default.sales_info_ORC?PROFILE=HiveORC&SERVER=hdp3') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); postgres=# SELECT * FROM salesinfo_hiveORCprofile; ...... Prague | Dec | 333 | 9894.77 Bangalore | Jul | 271 | 8320.55 Beijing | Dec | 100 | 4248.41 (60 rows) Time: 420.920 ms
Example: Use the HiveVectorizedORC profile
This profile provides the following features: Reads up to 1,024 rows of data at a time.
Does not support column projection.
Does not support complex types or the timestamp data type.
Create an external table and query it on an AnalyticDB for PostgreSQL instance.
CREATE EXTERNAL TABLE salesinfo_hiveVectORC(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://default.sales_info_ORC?PROFILE=HiveVectorizedORC&SERVER=hdp3') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); select * from salesinfo_hiveVectORC; location | month | num_orders | total_sales ---------------+-------+------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 Beijing | Jul | 411 | 11600.67 San Francisco | Sept | 156 | 6846.34 ......
Example: Query a Parquet-format Hive table
Create a Hive table.
hive> CREATE TABLE hive_parquet_table (location string, month string, number_of_orders int, total_sales double) STORED AS parquet; INSERT INTO TABLE hive_parquet_table SELECT * FROM sales_info; select * from hive_parquet_table;
Create an external table and query it on an AnalyticDB for PostgreSQL instance.
postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (location text, month text, number_of_orders int, total_sales double precision) LOCATION ('pxf://default.hive_parquet_table?profile=Hive&SERVER=hdp3') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); postgres=# SELECT month, number_of_orders FROM pxf_parquet_table; month | number_of_orders -------+------------------ Jan | 101 Mar | 87 May | 317 Jul | 411 Sept | 156 Nov | 159 ......