By Mujiao
More than ten years ago, incremental processing was not a method that was worth mentioning because the storage and computing performance of relational databases was very limited. At that time, incremental processing was the most common method; even the MPP database platform did not do all full processing.
The database system supports transactions, and characteristics of atomicity, consistency, isolation, and durability (ACID) can support updating, deleting, and inserting operations on a data table at the same time. The data storage of the database system is on each 4 KB or 8 KB data block, and the detailed statistical information and index structure help us do incremental data processing efficiently.
These operations are not easy in the current distributed file systems, such as MaxCompute. The data block has been 64 MB, rather than 4 KB or 8 KB. We also do not have the index structure that accelerates searching 50 rows of data from 10 million rows.
How do we do incremental processing in MaxCompute? Honestly, it is not easy. Since there is no index structure, we have to use full data retrieval for each processing. If the submission is still as frequent as in the relational database, incremental processing fails to gain the advantages of performance and resource. If we still need to use features (such as deleting and updating) supported by relational databases, we can refer to the new feature Transactional Table recently launched by MaxCompute public cloud.
Should we do it? To sum up, we can do incremental processing for those parts that can be done but do not force to do it or do it on a large scale. After all, it is not easy to do incremental processing.
The premise of incremental processing is that we have obtained incremental data. Compared with full data, incremental data is a smaller set. Then, we expect to use this small set of incremental data to complete the data processing instead of using full data. Therefore, the entire data processing can be completed more quickly and economically.
However, incremental processing in MaxCompute is summarized into two scenarios:
Scenario 1: The resources required for full processing cannot meet the requirement for timeliness, and performance needs to be optimized.
Scenario 2: The incremental processing logic is simple and has advantages over full processing in performance.
We need to establish some principles of using incremental processing. It is unreasonable or incorrect to break through or not abide by these principles.
After incremental data is integrated into the MaxCompute platform, you need to perform a MERGE operation to generate full data at the ODS layer. Therefore, MERGE logic is the simplest and most classic incremental processing logic. The simplest MERGE logic is listed below:
INSERT OVERWRITE TABLE table_old PARTITION(ds='${ds}')
SELECT `(ds)?+.+`
FROM table_old a –Full table
LEFT ANTI JOIN
table_new b –Incremental table
ON a.pk = b.pk
AND b.ds = '${ds}'
WHERE a.ds = '${lastdate}'
UNION ALL
SELECT b.*
FROM table_new b
WHERE b.ds = '${ds}'
-- AND b.operation not in('D')
;
This logic uses a JOIN logic and a UNION logic to implement a MERGE logic, which merges increment data into full data. Here is an option [-- AND b.operation not in('D')]
. You can select whether to delete the physical deletion from the current full table according to actual business requirements.
MERGE logic is the simplest logic that involves increment data, but the actual business computational logic is more complex than this scenario.
We also have two tables in the MERGE logic, but they are incremental and full tables, respectively. If there are two incremental tables, how can we deal with them? Based on the principle that two incremental tables cannot be associated, we must introduce a full table.
The logic is listed below:
- The incremental table of ta_add ta table
- The full table of ta_all ta table
- The incremental table of tb_add tb table
- The full table of tb_all tb table
- *Note:* MAPJOIN is used in this scenario. The amount of data in the incremental table is limited.
with tx_add as(
select distinct pk from(
select pk from ta_add where ds = '${ds}'
union all
select pk from tb_add where ds = '${ds}')t)
,ta_add2(
select /*+mapjoin(t2)*/ t1.*
from ta_all t1 join tx_add t2 on t1.pk=t2.pk
where t1.ds = '${ds}'
)
,tb_add2(
select /*+mapjoin(t2)*/ t1.*
from tb_all t1 join tx_add t2 on t1.pk=t2.pk
where t1.ds = '${ds}'
)
insert overwrite table tc_add partition (ds='${ds}')
select *
from ta_add2 t1 join tb_add2 t2 on t1.pk=t2.pk
;
This logic takes advantage of the feature of small incremental tables and the use of MAPJOIN to quickly produce two associative set unions and then associate them. This avoids the redistribution of large tables, which improves operational efficiency and reduces resource consumption. The reason for performing incremental processing here is that the table is really large. If the full tables are two million-level tables, it is recommended to test the performance, which may be simpler and more efficient to associate directly. Therefore, it is unnecessary to perform incremental processing and computing in many scenarios in MaxCompute.
In general, tables of incremental processing still refer to business tables rather than small tables (such as code tables or parameter tables). The performance gap between incremental and full correlation computing can be ignored for such a small table that contains ten or more thousands of data entries. Incremental calculation is also of little significance for tables containing millions of data entries. Let's look at the long logic in the preceding section. It only needs two lines of code, but it has become more complicated. If more than two tables are associated with the same PK, the logic can still be used for the tables. If there are multiple association PKs, the problem will be changed from one dimension to two dimensions. Unless it is really necessary, it is not recommended to engage in incremental processing.
The scenario I encountered in the process of this optimization is incremental processing of much more than two tables, and the association PKs are also multiple. The developers originally selected the primary table as the incremental table, and the other tables are all the computational logic of the full table. Since this is a minute-level task, the original developers probably still expect to do some efficient design from the performance perspective.
However, the primary table is not too large in this scenario, which contains millions of data entries. On the contrary, some left-associated tables contain tens of millions of data entries. Therefore, I do not see the significance of incremental processing to bring huge performance improvement. In addition, the increment of the primary table is to identify the last time segment through a specified time interval, which is feasible when the integrated source is the source system. However, this happens to be the secondary database of an unstable secondary database. Therefore, using a fixed time interval may not be recognized as an increment due to data delay.
Simply, I will directly change to use full processing, so there is no problem. However, it is impossible to identify which data are incremental data, which involves the incremental push problem mentioned below.
There are two ways to gain the incremental data required to be pushed. One is to keep increment identifier fields from the original increment, and the other is to compare the full data T and T +1 to generate the increment data from the final result. Like the first scenario we mentioned above, we need to maintain the mark of the incremental identifier during the processing and calculate the result of this field after association.
The increment identifier data is calculable if it is to be used for calculation. In the preceding section, I mentioned that the data time of delayed business in the system is unreliable, so how can we judge the incremental data? It is difficult for the integration task to obtain reliable timestamps from the secondary database. However, the incremental data for this integration is a definite incremental set. Therefore, this ETL_DATE (which is usually the bizdate or yyyymmddhhmiss) of our DataWorks is the incremental timestamp available for batch processing in MaxCompute. The change states of the data identified by the database synchronization tool include inserting, deletion, update, and primary key update (I, D, U, and K), which we can use directly.
The logic we use here is listed below:
select ...
,case when a.etl_partition ='${ds}' then a.etl_partition
when b.etl_partition ='${ds}' then b.etl_partition
...
else a.etl_partition end as etl_date
,case when a.etl_partition ='${ds}' then a.operation
when b.etl_partition ='${ds}' then 'U'
...
else a.operation end as operation
from tablea a
left join tableb on a.pk=b.pk
...
where ...;
Therefore, this method can keep the incremental state. Nevertheless, the calculated result is superimposed over and over again, so you may not know whether it is right or not. As a result, we need to look at the specific business scenarios.
Full field comparison is a violent calculation method. Without incremental processing, I can calculate incremental data with this method. This calculation result is still true and reliable, especially compared to a business result table after multi-layer calculation.
The full field comparison logic is listed below:
This comparison consumes a lot of computing resources, especially transaction tables and event tables with the finest business granularity. However, the problem is not serious for some user tables. The comparison logic is listed below:
-- I
select a.*,'I' as operation
from table1 a
left join table1 b on a.pk = b.pk and b.ds='${lastdate}'
where a.ds='${ds}'
and b.pk is null
;
-- D
select a.*,'D' as operation
from table1 a
left join table1 b on a.pk = b.pk and b.ds='${ds}'
where a.ds='${lastdate}'
and b.pk is null
;
-- U
select a.*,'U' as operation
from table1 a
join table1 b on a.pk = b.pk and b.ds='${ds}'
where a.ds='${lastdate}'
and(coalesce(b.col,'')<>coalesce(b.col,'') -- cagracter
or coalesce(b.col,0)<>coalesce(b.col,0) -- numeric value
or coalesce(b.col,'0001-01-01')<>coalesce(b.col,'0001-01-01')) -- date
;
The full field comparison does not look elegant; it is a bit rough. You may have an easier way to identify incremental data. You can try other ways, but the full field comparison will be the guaranteed method.
This article explains the incremental processing method. I hope the method I explained in this article can help you to use incremental processing correctly in future projects and obtain significant performance improvements in some scenarios. In addition, I need to mention that incremental processing logic is more complex than full processing logic, and more complicated maintenance problems will be encountered, such as exception troubleshooting and data supplement. In the actual project, we must weigh the pros and cons before deciding on plans.
Technical Points and Implementation Features of MaxCompute Multi-Tenancy Design on Public Clouds
137 posts | 19 followers
FollowApache Flink Community China - May 14, 2021
ApsaraDB - June 12, 2024
Apache Flink Community China - June 15, 2021
Alibaba Cloud Storage - February 27, 2020
XianYu Tech - March 11, 2020
Apache Flink Community China - May 18, 2022
137 posts | 19 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Alibaba Cloud MaxCompute