All Products
Search
Document Center

AnalyticDB:Use external tables for federated analytics of Hadoop data sources

Last Updated:Jun 14, 2024

AnalyticDB for PostgreSQL allows you to query external Hadoop data sources, such as Hadoop Distributed File System (HDFS) and Hive data, by using external tables.

Usage notes

  • This feature is available only for AnalyticDB for PostgreSQL instances in elastic storage mode. The AnalyticDB for PostgreSQL instances that you use for query must be in the same virtual private cloud (VPC) as the external data sources.

  • This feature is unavailable for AnalyticDB for PostgreSQL instances in elastic storage mode that were created before September 6, 2020. This is because these instances cannot be connected to external Hadoop data sources over different network architectures. If you want to use this feature on existing instances, we recommend that you contact Alibaba Cloud technical support to apply for new instances and migrate data.

Configure a server

Server configurations vary based on user requirements. If you want to query external Hadoop data sources for federated analytics, submit a ticket to obtain technical support to configure a server. The following table describes the information that is required when you submit a ticket.

External data source

Required information

Hadoop (HDFS, Hive, and HBase)

core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, and hive-site.xml

Note

Profiles such as keytab and krb5.conf are also required for Kerberos authentication.

Syntax

  • Create an extension

    CREATE extension pxf; 
  • Create an external table

    CREATE EXTERNAL TABLE <table_name>
            ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    LOCATION('pxf://<path-to-data>?PROFILE[&<custom-option>=<value>[...]]&[SERVER=value]')
    FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);

For more information about the syntax that is used to create an external table, see the "CREATE EXTERNAL TABLE" section of the SQL syntax topic.

Query HDFS data

The following table describes the supported data formats.

Data format

Profile

text

hdfs:text

csv

hdfs:text:multi and hdfs:text

Avro

hdfs:avro

JSON

hdfs:json

Parquet

hdfs:parquet

AvroSequenceFile

hdfs:AvroSequenceFile

SequenceFile

hdfs:SequenceFile

For information about FORMAT and formatting-properties, see the "Create an external table" section of this topic.

  • Example: Query an HDFS file

    • Create a test data file named pxf_hdfs_simple.txt on HDFS.

      echo 'Prague,Jan,101,4875.33
      Rome,Mar,87,1557.39
      Bangalore,May,317,8936.99
      Beijing,Jul,411,11600.67' > /tmp/pxf_hdfs_simple.txt
      
      # Create a directory.
      hdfs dfs -mkdir -p /data/pxf_examples
      # Put the file into the directory on HDFS.
      hdfs dfs -put /tmp/pxf_hdfs_simple.txt /data/pxf_examples/
      
      # View the file.
      hdfs dfs -cat /data/pxf_examples/pxf_hdfs_simple.txt

      Create an external table and query it on an AnalyticDB for PostgreSQL instance.

      postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8)
                  LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text&SERVER=hdp3')
                FORMAT 'TEXT' (delimiter=E',');
      
      postgres=# select * from pxf_hdfs_textsimple;
       location  | month | num_orders |    total_sales
      -----------+-------+------------+--------------------
       Prague    | Jan   |        101 | 4875.3299999999999
       Rome      | Mar   |         87 | 1557.3900000000001
       Bangalore | May   |        317 | 8936.9899999999998
       Beijing   | Jul   |        411 |           11600.67
      (4 rows)

      The following list describes the parameters in the LOCATION clause:

      • pxf://: the pxf protocol. It cannot be modified.

      • data/pxf_examples/pxf_hdfs_simple.txt: the pxf_hdfs_simple.txt file on HDFS.

      • PROFILE=hdfs:text: the profile that is used to query HDFS data. In this example, hdfs:text is used.

      • SERVER=hdp3: the location of the profile used to query HDFS data based on PXF. In this example, the profile is located in the PXF_SERVER/hdp3/ directory. This value is provided by the technical support of AnalyticDB for PostgreSQL.

      • FORMAT 'TEXT' (delimiter=E','): the file format and formatting option of the external data source. In this example, the file format is set to TEXT, and the delimiter is set to commas (,).

      For information about the parameter descriptions, see the "Syntax" section of this topic.

  • Example: Write data to HDFS in the TEXT or CSV format

    • Prerequisites: The /data/pxf_examples/pxfwritable_hdfs_textsimple1 directory is created on HDFS by using the following command:

      hdfs dfs -mkdir -p /data/pxf_examples/pxfwritable_hdfs_textsimple1
      Note

      To execute INSERT statements on AnalyticDB for PostgreSQL, you must have write permissions on the preceding directory.

      1. Create a writable external table on an AnalyticDB for PostgreSQL instance.

        CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writabletbl_1(location text, month text, num_orders int, total_sales float8)
                    LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple1?PROFILE=hdfs:text&SERVER=hdp3')
                  FORMAT 'TEXT' (delimiter=',');
                  
        INSERT INTO pxf_hdfs_writabletbl_1 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
        INSERT INTO pxf_hdfs_writabletbl_1 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
      2. View information on HDFS.

        # View the directory.
        hdfs dfs -ls /data/pxf_examples/pxfwritable_hdfs_textsimple1
        
        # View data in the directory.
        hdfs dfs -cat /data/pxf_examples/pxfwritable_hdfs_textsimple1/*
        Frankfurt,Mar,777,3956.98
        Cleveland,Oct,3812,96645.37

Query Hive data

Data format

Profile

TextFile

Hive, HiveText

SequenceFile

Hive

RCFile

Hive, HiveRC

ORC

Hive, HiveORC, HiveVectorizedORC

Parquet

Hive

For information about FORMAT and formatting-properties, see the "Create an external table" section of this topic.

  • Example: Use the Hive profile

    1. Generate data.

      echo 'Prague,Jan,101,4875.33
      Rome,Mar,87,1557.39
      Bangalore,May,317,8936.99
      Beijing,Jul,411,11600.67
      San Francisco,Sept,156,6846.34
      Paris,Nov,159,7134.56
      San Francisco,Jan,113,5397.89
      Prague,Dec,333,9894.77
      Bangalore,Jul,271,8320.55
      Beijing,Dec,100,4248.41' > /tmp/pxf_hive_datafile.txt
    2. Create a Hive table.

      hive> CREATE TABLE sales_info (location string, month string,
              number_of_orders int, total_sales double)
              ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
              STORED AS textfile;
      
      hive> LOAD DATA LOCAL INPATH '/tmp/pxf_hive_datafile.txt'
              INTO TABLE sales_info;
      hive> SELECT * FROM sales_info;
    3. Create an external table and query it on an AnalyticDB for PostgreSQL instance.

      postgres=# create extension pxf;
      
      postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
      LOCATION ('pxf://default.sales_info?PROFILE=Hive&SERVER=hdp3')
      FORMAT 'custom' (formatter='pxfwritable_import');
      
      postgres=# SELECT * FROM salesinfo_hiveprofile;
         location    | month | num_orders | total_sales
      ---------------+-------+------------+-------------
       Prague        | Jan   |        101 |     4875.33
       Rome          | Mar   |         87 |     1557.39
       Bangalore     | May   |        317 |     8936.99
       Beijing       | Jul   |        411 |    11600.67
       San Francisco | Sept  |        156 |     6846.34
       Paris         | Nov   |        159 |     7134.56
      ......

      The following list describes the parameters in the LOCATION clause:

      • pxf://: the pxf protocol. It cannot be modified.

      • default.sales_info: the sales_info table that is stored in the default database on Hive.

      • PROFILE=Hive: the profile that is used to query Hive data. In this example, Hive is used.

      • SERVER=hdp3: the location of the profile used to query Hive data based on PXF. In this example, the profile is located in the PXF_SERVER/hdp3/ directory. This value is provided by the technical support of AnalyticDB for PostgreSQL.

      • FORMAT 'custom' (formatter='pxfwritable_import'): the file format and formatting option of the external data source. In this example, the file format is set to custom, and the formatter is set to pxfwritable_import.

      For information about the parameter descriptions, see the "Syntax" section of this topic.

  • Example: Use the HiveText profile

    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
                 LOCATION ('pxf://default.sales_info?PROFILE=HiveText&SERVER=hdp3')
               FORMAT 'TEXT' (delimiter=E',');
    postgres=# select * from salesinfo_hivetextprofile;
       location    | month | num_orders | total_sales
    ---------------+-------+------------+-------------
     Prague        | Jan   |        101 |     4875.33
     Rome          | Mar   |         87 |     1557.39
     Bangalore     | May   |        317 |     8936.99
     ......
  • Example: Use the HiveRC profile

    1. Create a Hive table.

      hive> CREATE TABLE sales_info_rcfile (location string, month string,
                   number_of_orders int, total_sales double)
                 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
                 STORED AS rcfile;
      OK
      ## Import data.
      hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
      ## View data.
      hive> SELECT * FROM sales_info_rcfile;
    2. Create an external table and query it on an AnalyticDB for PostgreSQL instance.

      postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
                   LOCATION ('pxf://default.sales_info_rcfile?PROFILE=HiveRC&SERVER=hdp3')
                 FORMAT 'TEXT' (delimiter=E',');
      postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
         location    | total_sales
      ---------------+-------------
       Prague        |     4875.33
       Rome          |     1557.39
       Bangalore     |     8936.99
       ......
  • Example: Use the HiveORC profile

    ORC-supporting profiles, HiveORC and HiveVectorizedORC, provide the following features:

    • Read a single row of data at a time.

    • Support column projection.

    • Support complex types and allow you to query Hive tables that are composed of array, map, struct, and union data types.

    1. Create a Hive table.

      hive> CREATE TABLE sales_info_ORC (location string, month string,
              number_of_orders int, total_sales double)
            STORED AS ORC;
      hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info;
      hive> SELECT * FROM sales_info_ORC;
    2. Create an external table and query it on an AnalyticDB for PostgreSQL instance.

      postgres=# CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8)
                   LOCATION ('pxf://default.sales_info_ORC?PROFILE=HiveORC&SERVER=hdp3')
                   FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
                   
      postgres=# SELECT * FROM salesinfo_hiveORCprofile;
      ......
       Prague        | Dec   |        333 |     9894.77
       Bangalore     | Jul   |        271 |     8320.55
       Beijing       | Dec   |        100 |     4248.41
      (60 rows)
      Time: 420.920 ms
  • Example: Use the HiveVectorizedORC profile

    • This profile provides the following features: Reads up to 1,024 rows of data at a time.

    • Does not support column projection.

    • Does not support complex types or the timestamp data type.

    Create an external table and query it on an AnalyticDB for PostgreSQL instance.

    CREATE EXTERNAL TABLE salesinfo_hiveVectORC(location text, month text, num_orders int, total_sales float8)
                 LOCATION ('pxf://default.sales_info_ORC?PROFILE=HiveVectorizedORC&SERVER=hdp3')
                 FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
                 
    select * from salesinfo_hiveVectORC;
       location    | month | num_orders | total_sales
    ---------------+-------+------------+-------------
     Prague        | Jan   |        101 |     4875.33
     Rome          | Mar   |         87 |     1557.39
     Bangalore     | May   |        317 |     8936.99
     Beijing       | Jul   |        411 |    11600.67
     San Francisco | Sept  |        156 |     6846.34
     ......
  • Example: Query a Parquet-format Hive table

    1. Create a Hive table.

      hive> CREATE TABLE hive_parquet_table (location string, month string,
                  number_of_orders int, total_sales double)
              STORED AS parquet;
              
      INSERT INTO TABLE hive_parquet_table SELECT * FROM sales_info;
      
      select * from hive_parquet_table;
    2. Create an external table and query it on an AnalyticDB for PostgreSQL instance.

      postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (location text, month text, number_of_orders int, total_sales double precision)
          LOCATION ('pxf://default.hive_parquet_table?profile=Hive&SERVER=hdp3')
          FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
      postgres=# SELECT month, number_of_orders FROM pxf_parquet_table;
       month | number_of_orders
      -------+------------------
       Jan   |              101
       Mar   |               87
       May   |              317
       Jul   |              411
       Sept  |              156
       Nov   |              159
       ......