By Digoal
In stream computing applications, retaining the window and data recalculability-which is, to put in simpler terms, the ability to set the start point to compute again-as well as the overwriting of existing computing results are two special but very useful requirements. What does all of this mean? Well, consider a few examples to better understand these requirements.
Consider this issue. Streaming data comes from multiple business lines and is generated at different times. However, due to system defects, the data arrival sequence may be incorrect. For example, data that was generated earlier may actually arrive later, rather than the chronological sequence that would be expected.
Now, here's a more real-worldish example from an e-commerce scenario: a customer places an order first, then pays for the order, and then, as the final step, good or service that the customer bought is delivered. The related data is generated by multiple business lines-one for the order placing, another for the payment, and another for the shipping or delivery of the order.
These should come into the system in the same chronological order that they occurred in reality: order, then payment, and then delivery. However, the actual data arrival sequence may be, and often is, as follows: payment, then order, and then delivery. This incorrect data arrival sequence affects the stream computing accuracy and can be corrected through recalculation.
Now to get back to the requirements above. These capabilities can help fix these sorts of issues. Let's explain a few things. Recalculation can be abstracted into the following stream computing functions:
To satisfy PostgreSQL's User-Defined Functions (UDFs), async batch consume with atomicity, transaction atomicity, result persistence, support for INSERT ON CONFLICT write merging, and flash back (rolling back computing results to a certain time point) can be used to address the preceding requirements.
This article uses SDT as an example to provide a design suggestion and demo for this type of stream computing in PostgreSQL.
1. Create an input table.
2. Create a window retention table.
3. Create a result table.
4. Write data into the input table (enable pgbench, stress testing, and real-time writing).
5. Compile the batch consumption and SDT functions (PLpgSQL).
Including the recalculation logic:
Result table without records
Input table. Delete up to N records from the record set R1 in chronological order.
Window table. Insert R1.
Compute R1 and write the result to the result table.
Result table with records (check whether any results need to be recalculated)
Window table. Retrieve the minimum time T2 and the maximum time T3.
Input table. Directly delete data if the data < T2.
Input table. Retrieve the minimum time T1.
Result table. Retrieve the maximum time T4.
If T1 <= T4:
Result table. Delete data if the data >= T1.
Window table. Retrieve the record set R2, which contains data >= T1.
Input table. Delete the record set R3 (where ts <= T3) in chronological order.
Window table. Insert R3.
Input table. Delete up to N records from the record set R4 in chronological order.
Window table. Insert R4.
Compute R2 + R3 + R4 and write the result to the result table.
If T1 > T4:
Window table. Retrieve the record set R5 that contains data > = T4.
Input table. Delete up to N records from the record set R6 in chronological order.
Window table. Insert R6.
Compute R5 + R6 and write the result to the result table.
6. Schedule the SDT function.
PipelineDB Sharding Cluster for Distributed Stream Computing
digoal - July 24, 2019
Alibaba Clouder - July 29, 2019
Alibaba Clouder - July 31, 2019
digoal - July 25, 2019
digoal - May 16, 2019
digoal - January 25, 2021
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAn array of powerful multimedia services providing massive cloud storage and efficient content delivery for a smooth and rich user experience.
Learn MoreAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreMore Posts by digoal