×
Community Blog Build an All-in-one Real-time Data Warehouse (Code-level) Based on AnalyticDB for PostgreSQL

Build an All-in-one Real-time Data Warehouse (Code-level) Based on AnalyticDB for PostgreSQL

This article introduces the process of building an all-in-one real-time data warehouse using AnalyticDB for PostgreSQL at the code level.

By Hanshu

1. Background

With the advancement of information technology and the widespread use of the internet, there is an increasing demand for business timeliness. In the field of data analysis, the need for timely data is also growing. Many businesses are transitioning from traditional batch-based offline modes to real-time stream-based modes. However, building a real-time data solution using stream processing engines and big data products comes with several disadvantages:

1) Different architectures for offline and real-time data warehouses, resulting in code reusability challenges and high development and operation costs;

2) Limited support for complex analysis services such as large offline dimension table associations, multi-layer nesting, and historical data tracing;

3) High debugging costs due to multiple components and the need to have a deep understanding of new concepts like window functions.

AnalyticDB for PostgreSQL provides an all-in-one real-time data warehouse solution based on real-time materialized views, allowing for the import of real-time data sources for analysis in the data warehouse.

2. Technical Solution

The process of building a real-time data warehouse typically involves three steps: real-time data writing, real-time processing, and real-time analysis (consumption). When dealing with small data volumes and simple business requirements (e.g., statistics on cumulative order numbers), the process can be accomplished using only a stream processing engine. However, for large-scale data, non-standardized data formats, complex computational logic, and heavy dependencies on intermediate tables, a simple real-time analysis link is insufficient to meet business needs. In such cases, it is necessary to adopt a layered architecture design similar to that used in data warehouses, such as ODS, DWD, DWS, and ADS. AnalyticDB for PostgreSQL seamlessly integrates the layered architecture of offline data warehouses using real-time materialized views, allowing users to focus on business design and application development. Additionally, the product ensures the reliability of real-time data flow.

2.1 Data Architecture

The architecture of a real-time data warehouse built with AnalyticDB for PostgreSQL is as follows:

1) Real-time writing: It offers high performance, low latency, and support for a wide range of data sources. The written data is immediately visible, and complete transaction support is provided.

2) Real-time processing: Incremental data is processed in real-time through real-time materialized views, enabling real-time ETL processing. Unlike normal materialized views, real-time materialized views automatically refresh when data is updated, eliminating the need for manual refresh commands. It is possible to create a real-time materialized view on a base table and create another real-time materialized view on the created one. This allows for automatic update of related cascaded real-time materialized views when the base table changes, enabling the construction of a real-time ETL processing chain for data analysis.

3) Real-time analysis: High-performance real-time analysis is achieved through the use of SIMD instruction set vectorized execution engine, CBO query optimizer, and columnar storage engine.

_1
(AnalyticDB for PostgreSQL real-time data warehouse architecture)

A typical workflow is as follows:

Real-time and offline multi-source data can be simultaneously written into real-time data warehouses with high performance. Real-time incremental importing and processing is performed based on pre-developed real-time materialized views through SQL. Data processing operations such as field cleansing and conversion, JOIN dimension addition, multi-dimensional GROUP BY pre-aggregation, and multi-link UNION ALL aggregation can be completed within the data warehouses. Finally, real-time materialized views can be directly used for real-time analysis, such as real-time Ad-hoc queries, analysis reports, real-time kanban, and real-time dashboards.

2.2 Core Advantages

Compared with traditional solutions, the all-in-one real-time data warehouse solution based on AnalyticDB for PostgreSQL has the following advantages:

2.2.1 Advanced Architecture and Optimal Cost

AnalyticDB for PostgreSQL allows you to build an all-in-one real-time data warehouse based on the real-time logs of ApsaraDB RDS for PostgreSQL databases. AnalyticDB for PostgreSQL provides more cost-effective services compared to stream processing engines and big data services. Additionally, it requires fewer components and provides better stability and performance in terms of operation and maintenance. The development and data transfer of an all-in-one real-time data warehouse are completed within the data warehouse, eliminating the need to transfer data between multiple systems repeatedly. Overall, this solution offers the following benefits:

1) Cost: The overall resource cost is optimized because there is only one copy of data, one system, and one write overhead.

2) Performance: There are no complex link flows, low resource overhead, and low data latency.

3) Development: A set of SQL statements is sufficient, without the need for multi-system adaptation and joint debugging.

4) O&M: You only need to maintain one system.

5) Maintenance: It is easy to identify and rectify data exceptions.

2.2.2 Support for Stream-batch Integration

AnalyticDB for PostgreSQL supports a wide range of data source writing methods in addition to synchronizing business database logs from Relational Database Service (RDS). This enables efficient fusion processing and fusion queries after data warehousing.

  1. Data Transmission Service (DTS) is connected to RDS binary logs to synchronize business data to the real-time data warehouse in a timely manner.
  2. It supports real-time data sources such as Message Queue Kafka and RocketMQ.
  3. It supports integration with real-time stream processing engines to facilitate data consumption.
  4. Data can be written to AnalyticDB for PostgreSQL by synchronizing data or reading foreign tables.

Real-time processing links built based on real-time materialized views support both stream and batch data processing with just one SQL statement. When a real-time materialized view is created for the first time, SQL statements are executed in batch mode to calculate the full result set. Subsequently, the SQL result set is incrementally and continuously maintained during the real-time writing process. In scenarios where only a portion of the data needs to be updated in real time and the rest of the data needs to be updated regularly, you can set some tables that participate in streaming JOIN to not participate in real-time updates. Then, at an appropriate time, you can refresh the result set that is regularly updated in batches using the Refresh batch mode.

_2
(Stream-batch integration advantages)

2.2.3 Support for Complex Batch Processing Tasks

One of the main challenges in building real-time data warehouses is converting complex batch processing tasks into real-time processing tasks. Generally, batch processing can easily handle complex SQL syntax, especially multiple nesting. However, stream processing has many limitations on SQL syntax. AnalyticDB for PostgreSQL leverages the advantages of traditional data warehouses in supporting complex SQL queries. Compared to stream computing engines, it reduces the cost of transforming complex batch processing tasks into real-time processing tasks. In fact, many customers' batch processing tasks can be directly processed in real-time.

2.2.4 Support for Unlimited Windows

Database engines are designed for disk storage, while stream computing engines are designed for memory. Compared to stream computing engines, database engines are better at supporting real-time JOIN of very large tables, especially complex real-time JOIN of multiple large tables. Real-time materialized views built on AnalyticDB for PostgreSQL support backtracking historical data without any window limit. Real-time materialized views are convenient for revising and backtracking historical data. You only need to update the raw data to automatically reflect it in real-time links.

2.2.5 Simple and Transparent Query Rewriting

In traditional data warehousing, business SQL statements need to explicitly specify access to preprocessed result sets. In an all-in-one real-time data warehouse, real-time analysis and real-time processing are combined in the same system, enabling them to interoperate and perceive each other. This easily supports transparent query rewriting. Real-time analysis can be accelerated or decelerated by establishing or revoking real-time processing links, without changing the fixed access SQL. With the powerful optimizer of the database, it automatically optimizes and selects the best acceleration scheme, and conveniently switches and maintains the hot and cold links.

The following image shows a typical example of transparent query rewriting. Business SQL only needs to access the base table for aggregation. The optimizer automatically optimizes and selects the real-time pre-aggregation links defined by the real-time materialized view, easily matching the optimal GROUP BY result set and switching between hot and cold links without modifying the code.

_3
(Transparent query rewriting)

3. Practice in Typical Scenarios

3.1 Dataset Loading

AnalyticDB for PostgreSQL comes with built-in standard TPCH datasets. For newly created or existing instances, you can use the following method to automatically load a TPC-H dataset. After the dataset is loaded, it will be stored in the adb_sampledata_realtime database.

3.1.1 Newly Created Instances

You can choose to load the Sample Dataset when creating a new instance.

_1

3.1.2 Existing Instances

For existing instances, you can load datasets through the console.

_2

3.2 Scenario 1: Real-time Large Wide Table

The main business scenarios include dimension table join, real-time wide table, and data summary, as shown in the following figure:

_5
(Real-time large wide table)

Taking the data provided by the TPC-H standard dataset as an example, the customer table stores customer information, the supplier table stores supplier information, the nation table stores nation information, the region table stores intercontinental region information such as Asia, Europe, and America, the orders table stores order information, the lineitem table stores detailed order information (one order can have multiple detailed orders), and the part table stores detailed information about products (the products are parts).

In real-time processing, the widening operation is performed in several steps:

1) The customer and supplier tables are joined with the nation and region tables to supplement the nation and intercontinental information of the customers and suppliers.

2) The lineitem table is joined with the customer table through the orders intermediate table. The result is filtered by nation (this is only for demonstration purposes, as in actual business scenarios, lineitem data sources from different countries may circulate through different channels) to generate detailed information of all orders from different countries (a wide table containing customer information of orders).

3) The lineitem_china and lineitem_us tables are unified into a lineitem_wide wide table that contains all data.

-- The supplier information dimension table is used to complete the names of regions and countries in real time.
DROP MATERIALIZED VIEW if exists supplier_info;
CREATE INCREMENTAL MATERIALIZED VIEW supplier_info AS 
SELECT 
  s.*, 
  r.r_name s_rname, 
  n.n_name s_nname
FROM 
  supplier s 
  join nation n ON s.s_nationkey = n.n_nationkey 
  join region r ON n.n_regionkey = r.r_regionkey
DISTRIBUTED BY (s_suppkey);
-- The customer dimension table is used to complete the names of regions and countries in real time.
DROP MATERIALIZED VIEW if exists customer_info;
CREATE INCREMENTAL MATERIALIZED VIEW customer_info AS 
SELECT 
  c.*, 
  r.r_name c_rname, 
  n.n_name c_nname
FROM 
  customer c 
  join nation n ON c.c_nationkey = n.n_nationkey 
  join region r ON n.n_regionkey = r.r_regionkey
DISTRIBUTED BY (c_custkey);
-- Clean and extract the factual data of buyers in China from the raw data of lineitem, and associate dimension tables such as customer and parts information to form a wide table from the perspective of customers in China.
DROP MATERIALIZED VIEW if exists lineitem_china;
CREATE INCREMENTAL MATERIALIZED VIEW lineitem_china AS
SELECT
  l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity,
  l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus,
  l_shipdate, l_commitdate, l_receiptdate,
  CASE
    WHEN l_shipinstruct = 'NONE' THEN 0
    WHEN l_shipinstruct = 'COLLECT COD' THEN 1
    WHEN l_shipinstruct = 'DELIVER IN PERSON' THEN 2
    WHEN l_shipinstruct = 'TAKE BACK RETURN' THEN 3
  END l_shipinstruct_flag,
  CASE WHEN l_shipmode LIKE '%AIR%' THEN 0 ELSE 1 END l_shipmode_flag,
  o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate,
  p_partkey, p_name, p_type, p_size, p_retailprice,
  c_custkey, c_name, c_phone, c_nname, c_rname
FROM
  lineitem l
  LEFT JOIN orders o ON l.l_orderkey = o.o_orderkey
  LEFT JOIN part p ON l.l_partkey = p.p_partkey
  LEFT JOIN customer_info c ON o.o_custkey = c.c_custkey
WHERE
  c_nname = 'CHINA'
DISTRIBUTED BY (l_orderkey);
-- Clean and extract the factual data of buyers in the United States from the original data of lineitem, and associate dimension tables such as customer and parts information to form a wide table from the perspective of customers in the United States.
DROP MATERIALIZED VIEW if exists lineitem_us;
CREATE INCREMENTAL MATERIALIZED VIEW lineitem_us AS
SELECT
  l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity,
  l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus,
  l_shipdate, l_commitdate, l_receiptdate,
  CASE
    WHEN l_shipinstruct = 'NONE' THEN 0
    WHEN l_shipinstruct = 'COLLECT COD' THEN 1
    WHEN l_shipinstruct = 'DELIVER IN PERSON' THEN 2
    WHEN l_shipinstruct = 'TAKE BACK RETURN' THEN 3
  END l_shipinstruct_flag,
  CASE WHEN l_shipmode LIKE '%AIR%' THEN 0 ELSE 1 END l_shipmode_flag,
  o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate,
  p_partkey, p_name, p_type, p_size, p_retailprice,
  c_custkey, c_name, c_phone, c_nname, c_rname
FROM
  lineitem l
  LEFT JOIN orders o ON l.l_orderkey = o.o_orderkey
  LEFT JOIN part p ON l.l_partkey = p.p_partkey
  LEFT JOIN customer_info c ON o.o_custkey = c.c_custkey
WHERE
  c_nname = 'UNITED STATES'
DISTRIBUTED BY (l_orderkey);
-- The summary wide table of detailed data from all dimensions, including supplier, customer, and part information.
DROP MATERIALIZED VIEW if exists lineitem_wide;
CREATE INCREMENTAL MATERIALIZED VIEW lineitem_wide AS
SELECT
  l.*,
  s_suppkey, s_name, s_nname, s_rname
FROM
  lineitem_china l
  LEFT JOIN supplier_info s ON l.l_suppkey = s.s_suppkey
UNION ALL
SELECT
  l.*,
  s_suppkey, s_name, s_nname, s_rname
FROM
  lineitem_us l
  LEFT JOIN supplier_info s ON l.l_suppkey = s.s_suppkey
DISTRIBUTED BY (l_orderkey);
Real-time Analysis Effect

When you perform analytical queries, such as the Q7 queries of TPCH standard queries, you can improve the query performance by 5 to 10 times when you use a wide table after real-time processing.

-- Original TPCH Q7
SELECT supp_nation,
         cust_nation,
         l_year,
         sum(volume) AS revenue
FROM 
    (SELECT n1.n_name AS supp_nation,
         n2.n_name AS cust_nation,
         extract( year
    FROM l_shipdate ) AS l_year, l_extendedprice * (1 - l_discount) AS volume
    FROM supplier, lineitem, orders, customer, nation n1, nation n2
    WHERE s_suppkey = l_suppkey
            AND o_orderkey = l_orderkey
            AND c_custkey = o_custkey
            AND s_nationkey = n1.n_nationkey
            AND c_nationkey = n2.n_nationkey
            AND ( ( n1.n_name = 'CHINA'
            AND n2.n_name = 'UNITED STATES' )
            OR ( n1.n_name = 'UNITED STATES'
            AND n2.n_name = 'CHINA' ) )
            AND l_shipdate
        BETWEEN date '1995-01-01'
            AND date '1996-12-31' ) AS shipping
GROUP BY  supp_nation, cust_nation, l_year
ORDER BY  supp_nation, cust_nation, l_year;
-- Use a real-time wide table for AD-HOC queries.
SELECT supp_nation,
         cust_nation,
         l_year,
         sum(volume) AS revenue
FROM 
    (SELECT s_nname AS supp_nation,
         c_nname AS cust_nation,
         extract( year
    FROM l_shipdate ) AS l_year, l_extendedprice * (1 - l_discount) AS volume
    FROM lineitem_wide
    WHERE ( ( s_nname = 'CHINA'
            AND c_nname = 'UNITED STATES' )
            OR ( s_nname = 'UNITED STATES'
            AND c_nname = 'CHINA' ) )
            AND l_shipdate
        BETWEEN date '1995-01-01'
            AND date '1996-12-31' ) AS shipping
GROUP BY  supp_nation, cust_nation, l_year
ORDER BY  supp_nation, cust_nation, l_year;

3.3 Scenario 2: Real-time Multi-dimensional Aggregate Summary

_6
(Real-time multi-dimensional aggregation)

Taking the tables provided by the TPC-H standard dataset as an example, the lineitem table stores detailed order information, and the supplier table stores supplier information. After joining the lineitem table and the supplier table and initially aggregating and summarizing the order status, a real-time materialized view called lineitem_sup_sum is generated. This view includes regional information. Additionally, the data can be further aggregated and summarized by sub-regions to create a real-time materialized view called lineitem_sum, which contains global summary data.

-- lineitem provides partial summary statistics by supplier dimension, and can provide analysis by supplier territory (country, region) dimension. At the same time, l_shipdate partition is used to filter by time condition during subsequent calculation.
DROP MATERIALIZED VIEW if exists lineitem_sup_sum;
CREATE INCREMENTAL MATERIALIZED VIEW lineitem_sup_sum AS
SELECT
  sum(l_quantity) as sum_qty, 
  sum(l_extendedprice) as sum_base_price, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) as sum_disc_price, 
  sum(
    l_extendedprice * (1 - l_discount) * (1 + l_tax)
  ) as sum_charge, 
  count(*) as count_order, 
  l_returnflag, 
  l_linestatus, 
  l_shipdate, 
  l_shipmode, 
  s_rname, 
  s_nname
FROM
  lineitem 
  LEFT JOIN supplier_info s ON l_suppkey = s_suppkey 
GROUP BY
  l_returnflag, 
  l_linestatus, 
  l_shipdate, 
  l_shipmode, 
  s_rname, 
  s_nname
DISTRIBUTED BY (l_shipdate)
;
-- Based on the lineitem_sup_sum table and further summary, provide summary data excluding the supplier dimension. As the data volume is small, further partitioning is unnecessary.
DROP MATERIALIZED VIEW if exists lineitem_sum;
CREATE INCREMENTAL MATERIALIZED VIEW lineitem_sum AS
SELECT
  sum(sum_qty) as sum_qty, 
  sum(sum_base_price) as sum_base_price, 
  sum(sum_disc_price) as sum_disc_price, 
  sum(sum_charge) as sum_charge, 
  sum(count_order) as count_order, 
  l_returnflag, 
  l_linestatus, 
  l_shipdate, 
  l_shipmode
FROM
  lineitem_sup_sum
GROUP BY
  l_returnflag, 
  l_linestatus, 
  l_shipdate, 
  l_shipmode
DISTRIBUTED BY (l_shipdate);
Real-time Analysis Effect

You can use transparent query rewriting to leverage the result sets of real-time pre-aggregated processing (enabled by default). By enabling or disabling transparent query rewriting, you will notice a significant improvement in the effectiveness of aggregation analysis, with performance gains of more than 10 to 20 times.

-- AD-HOC aggregate queries (automatic transparent query rewriting).
set enable_incremental_matview_query_rewrite to off; -- 关闭查询改写查看原始SQL效果
SELECT
  sum(l_quantity) as sum_qty, 
  sum(l_extendedprice) as sum_base_price, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) as sum_disc_price, 
  sum(
    l_extendedprice * (1 - l_discount) * (1 + l_tax)
  ) as sum_charge
FROM
  lineitem 
  LEFT JOIN supplier_info s ON l_suppkey = s_suppkey
WHERE
  l_shipdate BETWEEN date '1995-01-01' AND date '1996-12-31'
GROUP BY
  l_returnflag, 
  l_linestatus,
  l_shipmode;
-- Transparent query rewriting uses the result sets of real-time pre-aggregated materialized views for computing.
SELECT
  sum(l_quantity) as sum_qty, 
  sum(l_extendedprice) as sum_base_price, 
  sum(
    l_extendedprice * (1 - l_discount)
  ) as sum_disc_price, 
  sum(
    l_extendedprice * (1 - l_discount) * (1 + l_tax)
  ) as sum_charge
FROM
  lineitem 
  LEFT JOIN supplier_info s ON l_suppkey = s_suppkey
WHERE
  l_shipdate BETWEEN date '1995-01-01' AND date '1996-12-31'
GROUP BY
  l_returnflag, 
  l_linestatus,
  l_shipmode;

3.4 Scenario 3: Real-time and Offline Data Combination

In actual real-time scenarios, there are often large fact tables, such as detailed order information, that need to be reflected in the analysis results in real-time. Some dimension tables have infrequent updates, and changes in certain attribute fields do not require high timeliness. Considering the cost of real-time updates, a combination of stream and batch processing can be used.

_7
(Stream-batch Combination)

Taking the tables provided by the TPCH standard dataset as an example, the customer table stores customer information, the orders table stores order information, the lineitem table stores detailed order information (an order can have multiple detailed orders), and the part table stores detailed information about products (the products are parts).

By joining the lineitem table, orders table, customer table, and part table, we can create a real-time wide table called lineitem_stream, which contains complete order information. When creating real-time materialized views, we can use WITH(bypass='customer_info,part') to specify that changes in the customer_info and part tables do not need to be reflected in historical data in real-time, while updates in other tables are immediately reflected in the result set of the real-time wide table. Then, a full refresh operation is periodically performed to backfill the updates of the customer_info and part tables in the lineitem_stream real-time wide table.

-- WITH(bypass='customer_info,part') specifies that the updates of the customer_info and part tables do not backtrack the historical join data, the data newly written to lineitem and orders tables will be joined in real time with the latest customer_info and part information, but will not backtrack the historical information.
DROP MATERIALIZED VIEW if exists lineitem_stream;
CREATE INCREMENTAL MATERIALIZED VIEW lineitem_stream WITH(bypass='customer_info,part') AS
SELECT
  l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity,
  l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus,
  l_shipdate, l_commitdate, l_receiptdate,
  CASE
    WHEN l_shipinstruct = 'NONE' THEN 0
    WHEN l_shipinstruct = 'COLLECT COD' THEN 1
    WHEN l_shipinstruct = 'DELIVER IN PERSON' THEN 2
    WHEN l_shipinstruct = 'TAKE BACK RETURN' THEN 3
  END l_shipinstruct_flag,
  CASE WHEN l_shipmode LIKE '%AIR%' THEN 0 ELSE 1 END l_shipmode_flag,
  o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate,
  p_partkey, p_name, p_type, p_size, p_retailprice,
  c_custkey, c_name, c_phone, c_nname, c_rname
FROM
  lineitem l
  LEFT JOIN orders o ON l.l_orderkey = o.o_orderkey
  LEFT JOIN part p ON l.l_partkey = p.p_partkey
  LEFT JOIN customer_info c ON o.o_custkey = c.c_custkey
WHERE
  c_nname = 'CHINA'
DISTRIBUTED BY (l_orderkey);
-- Backfill the customer_info and part data in batches regularly.
REFRESH MATERIALIZED VIEW lineitem_stream;
0 1 0
Share on

ApsaraDB

439 posts | 93 followers

You may also like

Comments