How to Do Incremental Processing Correctly

By Mujiao

1. Incremental Processing

More than ten years ago, incremental processing was not a method that was worth mentioning because the storage and computing performance of relational databases was very limited. At that time, incremental processing was the most common method; even the MPP database platform did not do all full processing.

The database system supports transactions, and characteristics of atomicity, consistency, isolation, and durability (ACID) can support updating, deleting, and inserting operations on a data table at the same time. The data storage of the database system is on each 4 KB or 8 KB data block, and the detailed statistical information and index structure help us do incremental data processing efficiently.

1.1. Brief Description of Problems

These operations are not easy in the current distributed file systems, such as MaxCompute. The data block has been 64 MB, rather than 4 KB or 8 KB. We also do not have the index structure that accelerates searching 50 rows of data from 10 million rows.

How do we do incremental processing in MaxCompute? Honestly, it is not easy. Since there is no index structure, we have to use full data retrieval for each processing. If the submission is still as frequent as in the relational database, incremental processing fails to gain the advantages of performance and resource. If we still need to use features (such as deleting and updating) supported by relational databases, we can refer to the new feature Transactional Table recently launched by MaxCompute public cloud.

Should we do it? To sum up, we can do incremental processing for those parts that can be done but do not force to do it or do it on a large scale. After all, it is not easy to do incremental processing.

2. Solutions

The premise of incremental processing is that we have obtained incremental data. Compared with full data, incremental data is a smaller set. Then, we expect to use this small set of incremental data to complete the data processing instead of using full data. Therefore, the entire data processing can be completed more quickly and economically.

However, incremental processing in MaxCompute is summarized into two scenarios:

Scenario 1: The resources required for full processing cannot meet the requirement for timeliness, and performance needs to be optimized.

Scenario 2: The incremental processing logic is simple and has advantages over full processing in performance.

2.1. Processing Principles

We need to establish some principles of using incremental processing. It is unreasonable or incorrect to break through or not abide by these principles.

Incremental table (incremental state [UDIK] and data update time)
Two incremental tables cannot be directly associated, and at least one of the tables is a full data table.
The result table generated by incremental processing needs to record the incremental state and data update time.
If multiple tables are associated, incremental identifiers of multiple tables need to be taken. As long as the associated row of a table is incremental, the incremental identifier of the table is used.
Only the INSERT and DELETE states of the primary table or the INNER JOIN table can be passed to the next layer, and the incremental states of other tables are UPDATE.

2.2. MERGE Logic

After incremental data is integrated into the MaxCompute platform, you need to perform a MERGE operation to generate full data at the ODS layer. Therefore, MERGE logic is the simplest and most classic incremental processing logic. The simplest MERGE logic is listed below:

INSERT OVERWRITE TABLE table_old PARTITION(ds='${ds}')
SELECT `(ds)?+.+`
FROM table_old a –Full table
LEFT ANTI JOIN
table_new b –Incremental table
ON a.pk = b.pk
AND b.ds = '${ds}'
WHERE a.ds = '${lastdate}'
UNION ALL
SELECT b.*
FROM table_new b
WHERE b.ds = '${ds}' 
-- AND b.operation not in('D')
;

This logic uses a JOIN logic and a UNION logic to implement a MERGE logic, which merges increment data into full data. Here is an option [-- AND b.operation not in('D')]. You can select whether to delete the physical deletion from the current full table according to actual business requirements.

2.3. Business Computational Logic

MERGE logic is the simplest logic that involves increment data, but the actual business computational logic is more complex than this scenario.

2.3.1. The Processing of Two Incremental Tables

We also have two tables in the MERGE logic, but they are incremental and full tables, respectively. If there are two incremental tables, how can we deal with them? Based on the principle that two incremental tables cannot be associated, we must introduce a full table.

We need to use the daily incremental data and full data of two tables, meaning four tables are involved in the computing.
If you do not want the full table to be directly associated, you need to find the set union of the primary keys of the two incremental tables. Then, the set union is separated from the full data of the two tables, and then the two tables are associated.

The logic is listed below:

- The incremental table of ta_add ta table

- The full table of ta_all ta table

- The incremental table of tb_add tb table

- The full table of tb_all tb table

- *Note:* MAPJOIN is used in this scenario. The amount of data in the incremental table is limited.

with tx_add as(
 select distinct pk from(
 select pk from ta_add where ds = '${ds}'
 union all
 select pk from tb_add where ds = '${ds}')t)
,ta_add2(
 select /*+mapjoin(t2)*/ t1.* 
   from ta_all t1 join tx_add t2 on t1.pk=t2.pk
  where t1.ds = '${ds}'
)
,tb_add2(
 select /*+mapjoin(t2)*/ t1.*
   from tb_all t1 join tx_add t2 on t1.pk=t2.pk
  where t1.ds = '${ds}'
)
insert overwrite table tc_add partition (ds='${ds}')
select *
from ta_add2 t1 join tb_add2 t2 on t1.pk=t2.pk
;

This logic takes advantage of the feature of small incremental tables and the use of MAPJOIN to quickly produce two associative set unions and then associate them. This avoids the redistribution of large tables, which improves operational efficiency and reduces resource consumption. The reason for performing incremental processing here is that the table is really large. If the full tables are two million-level tables, it is recommended to test the performance, which may be simpler and more efficient to associate directly. Therefore, it is unnecessary to perform incremental processing and computing in many scenarios in MaxCompute.

2.3.2. The Processing of More Than Two Incremental Tables

In general, tables of incremental processing still refer to business tables rather than small tables (such as code tables or parameter tables). The performance gap between incremental and full correlation computing can be ignored for such a small table that contains ten or more thousands of data entries. Incremental calculation is also of little significance for tables containing millions of data entries. Let's look at the long logic in the preceding section. It only needs two lines of code, but it has become more complicated. If more than two tables are associated with the same PK, the logic can still be used for the tables. If there are multiple association PKs, the problem will be changed from one dimension to two dimensions. Unless it is really necessary, it is not recommended to engage in incremental processing.

The scenario I encountered in the process of this optimization is incremental processing of much more than two tables, and the association PKs are also multiple. The developers originally selected the primary table as the incremental table, and the other tables are all the computational logic of the full table. Since this is a minute-level task, the original developers probably still expect to do some efficient design from the performance perspective.

However, the primary table is not too large in this scenario, which contains millions of data entries. On the contrary, some left-associated tables contain tens of millions of data entries. Therefore, I do not see the significance of incremental processing to bring huge performance improvement. In addition, the increment of the primary table is to identify the last time segment through a specified time interval, which is feasible when the integrated source is the source system. However, this happens to be the secondary database of an unstable secondary database. Therefore, using a fixed time interval may not be recognized as an increment due to data delay.

Simply, I will directly change to use full processing, so there is no problem. However, it is impossible to identify which data are incremental data, which involves the incremental push problem mentioned below.

2.4. Incremental Push Logic

There are two ways to gain the incremental data required to be pushed. One is to keep increment identifier fields from the original increment, and the other is to compare the full data T and T +1 to generate the increment data from the final result. Like the first scenario we mentioned above, we need to maintain the mark of the incremental identifier during the processing and calculate the result of this field after association.

2.4.1. Calculation of Incremental Identifier Field

The increment identifier data is calculable if it is to be used for calculation. In the preceding section, I mentioned that the data time of delayed business in the system is unreliable, so how can we judge the incremental data? It is difficult for the integration task to obtain reliable timestamps from the secondary database. However, the incremental data for this integration is a definite incremental set. Therefore, this ETL_DATE (which is usually the bizdate or yyyymmddhhmiss) of our DataWorks is the incremental timestamp available for batch processing in MaxCompute. The change states of the data identified by the database synchronization tool include inserting, deletion, update, and primary key update (I, D, U, and K), which we can use directly.

The logic we use here is listed below:

select ...
,case when a.etl_partition ='${ds}' then a.etl_partition
     when b.etl_partition ='${ds}' then b.etl_partition
     ...
     else a.etl_partition end as etl_date
,case when a.etl_partition ='${ds}' then a.operation
     when b.etl_partition ='${ds}' then 'U'
     ...
     else a.operation end as operation
from tablea a
left join tableb on a.pk=b.pk
...
where ...;

Therefore, this method can keep the incremental state. Nevertheless, the calculated result is superimposed over and over again, so you may not know whether it is right or not. As a result, we need to look at the specific business scenarios.

2.4.2. Full Field Comparison

Full field comparison is a violent calculation method. Without incremental processing, I can calculate incremental data with this method. This calculation result is still true and reliable, especially compared to a business result table after multi-layer calculation.

The full field comparison logic is listed below:

There are more records in the T+1-day table than in the T-day table: INSERT status.
The T-day table has more records than the T+1-day table: DELETE status.
Compare the T-day table and the T+1-day table, after the association non-primary key field values of the same primary key are inconsistent: UPDATE status.

This comparison consumes a lot of computing resources, especially transaction tables and event tables with the finest business granularity. However, the problem is not serious for some user tables. The comparison logic is listed below:

-- I
select a.*,'I' as operation
 from      table1 a
 left join table1 b on a.pk = b.pk and b.ds='${lastdate}'
 where a.ds='${ds}'
  and b.pk is null
;
-- D
select a.*,'D' as operation
 from      table1 a
 left join table1 b on a.pk = b.pk and b.ds='${ds}'
 where a.ds='${lastdate}'
  and b.pk is null
;
-- U
select a.*,'U' as operation
 from table1 a
 join table1 b on a.pk = b.pk and b.ds='${ds}'
 where a.ds='${lastdate}'
  and(coalesce(b.col,'')<>coalesce(b.col,'') -- cagracter
   or coalesce(b.col,0)<>coalesce(b.col,0)   -- numeric value
   or coalesce(b.col,'0001-01-01')<>coalesce(b.col,'0001-01-01'))  -- date
;

The full field comparison does not look elegant; it is a bit rough. You may have an easier way to identify incremental data. You can try other ways, but the full field comparison will be the guaranteed method.

3. Summary

This article explains the incremental processing method. I hope the method I explained in this article can help you to use incremental processing correctly in future projects and obtain significant performance improvements in some scenarios. In addition, I need to mention that incremental processing logic is more complex than full processing logic, and more complicated maintenance problems will be encountered, such as exception troubleshooting and data supplement. In the actual project, we must weigh the pros and cons before deciding on plans.

Community

How to Do Incremental Processing Correctly

1. Incremental Processing

1.1. Brief Description of Problems

2. Solutions

2.1. Processing Principles

2.2. MERGE Logic

2.3. Business Computational Logic

2.3.1. The Processing of Two Incremental Tables

2.3.2. The Processing of More Than Two Incremental Tables

2.4. Incremental Push Logic

2.4.1. Calculation of Incremental Identifier Field

2.4.2. Full Field Comparison

3. Summary

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Alibaba Cloud MaxCompute

Related Products

Big Data Consulting for Data Technology Solution

MaxCompute

Big Data Consulting Services for Retail Solution

ApsaraDB for HBase