The challenge of stream computing engine data correctness
Stream computing provides a fast data construction pipeline for real-time data warehouse, data lake, real-time business indicators, business intelligence, etc. In people's cognition, stream computing provides speed (low latency) as the main gain, followed by correct data results. In the field of flow computing, it is not easy to get the correct calculation results. It requires sophisticated engine design, accurate business data definition, cooperation between upstream and downstream systems, etc. This paper introduces the connotation of data correctness in stream computing and the challenges faced in order to solve this problem, which leads to the key design of stream computing engine: stream data integrity reasoning. This paper first gives the formal definition of integrity and the generalized application on various engines. At the same time, the article will also compare the design and trade-offs of several mainstream stream computing engines, such as Flink, Kafka Streams, on data integrity reasoning. It is hoped that this article can provide more input for everyone to use the stream computing engine, and provide some practical references for obtaining correct computing data.
1、 Correctness in stream computing
Correct data means that the results calculated by the stream computing engine correctly reflect the objects in the real physical world. If the user has paid three bills for a period of time, totaling 100 yuan, for the indicator "Total user payment history", when the user's payment behavior is completed, we should observe that there is only one indicator and its value is 100. Due to the unbounded and unordered nature of stream elements, the above logical derivation is not easy to implement in stream computing. Some "incorrect" results may exist, such as the indicator value is less than 100, and there are multiple indicator values. In order to obtain accurate results, there is still a large number of Lamda architectures that combine streaming and batch in engineering practice. They combine the incremental data built in real time with the stock data obtained through batch processing to provide more rapid and quality controllable data for downstream businesses.
Incorrect data may affect decision makers (people or machines) 'judgments on business and market, resulting in slow or wrong business decisions. Although there are many streaming and batching computing practices, these solutions are more to solve the complexity of traditional Lamda computing architecture in platform operation and maintenance and computing resources. The data quality problems caused by streaming computing itself have not been much improved due to the development of the streaming engine. The following section will analyze the causes of incorrect data processing in stream computing and the necessary and sufficient conditions for obtaining correct results.
2、 Necessary and sufficient conditions for correctness
The essence of stream computing is distributed computing through asynchronous messages. However, due to clock synchronization, network delays, server hangs (such as garbage collection) and other reasons, the logical order of stream data generation is usually inconsistent with the physical order of the arriving stream processing system or the operators of the stream processing system. The stream computing engine must be able to process unordered data, otherwise incorrect computing results will result.
2.1 Consistency is not equal to correctness
There are two concepts in the field of data computing: data consistency and data correctness, which are easily confused, but from the perspective of the concept itself, data consistency ≠ data correctness. The definitions of the two concepts are as follows:
1. Correctness: The results of stream computing correctly reflect the objects in the real physical world.
2. Consistency: The data of upstream and downstream systems across all flows reflect the same information.
Consistency is usually combined with the term "exactly once", which means that the stream computing engine can recover from failure to a consistent state, and the final calculation result does not contain duplicate entries or lose any data. In other words, the output of the stream computing engine is like data processing only once without any failure. The requirement of correctness is more strict than that of consistency. If the engine cannot achieve consistent data processing, the correctness cannot be achieved. If the data is correct, it must be consistent, that is, data consistency is a necessary and insufficient condition for data correctness. For example, four records in the stream data source have been processed 5 times in the engine due to engine failure/incorrect processing, and the results may also be incorrect. On the other hand, if the engine achieves consistency, the correct results may not be obtained due to the disorder and delay of the source data.
The correctness of calculation is jointly guaranteed by data integrity and engine consistency. If the stream computing process is considered as a function mapping: output=f (input), in the above model, data integrity is a constraint on unbounded and unordered data sets, that is, the input in is defined, and the engine data consistency defines the data processing process (including output) f. Therefore, the output is determined (for engine end-to-end consistency, refer to the article "The Nature of Stream Computing Engine Data Consistency").
2.2 Necessity of stream data integrity
What is data integrity in stream computing? Since the data in the stream computing engine is unbounded and unordered, we must convert this kind of uncertain data set into a logical "current partition" in some way to analyze the deterministic data fragments in the partition. The "current partition" can be a period of time in the past (e.g. sliding window) or a number of records in the past. From another point of view, whether it is stream computing or batch computing, in order to get the correct calculation results, a deterministic input dataset is required. The batch computing mode is just a special case scenario where a partition covers the entire bounded dataset. At present, the stream computing engine is criticized as "inaccurate" largely because the existing technical solutions are not good at partitioning unbounded and disordered stream data.
Integrity reasoning is a means of representing data readiness. Even if the input streams may arrive out of order, the flow computing engine will not use incomplete input calculations as the final output results. Integrity requires that the calculation engine can track the current calculation progress in a timely manner, and estimate the degree of completion corresponding to the output result and its input stream. This reasoning about data integrity is crucial in many flow computing scenarios: for example, in a flow based alarm system, the flow computing engine must generate a single and correct alarm indicator. It is meaningless to send some results in advance, which requires that the flow computing engine, a distributed system, has the ability to infer that "all the data required for the alarm indicator is ready"; For another example, in the scenario of business missing value detection through streaming CEP, if there is no integrity reasoning, it is impossible to distinguish the difference between real data missing and data arriving late.
Integrity reasoning is also useful for state management of the engine itself. For example, Apache Spark Structured Streaming and Kafka Streams use similar delay algorithms (event time - fixed expiration time) to recycle the state storage in the calculation process to reduce memory consumption.
3、 A general solution to stream data integrity
This part will start from the original integrity problem, propose a formal expression model of integrity, and then generalize several commonly used integrity reasoning schemes under this general framework. This top-down expansion allows us to more clearly see the nature of integrity reasoning in the field of stream computing. Many well-known concepts (such as watermarks) are just concrete implementations of a certain framework. A better way is to abstract them from the perspective of models (Why). Thinking from models rather than frameworks can help us avoid being misled by many engine "concepts".
3.1 Formal definition of integrity reasoning
The data processing program running on the flow computing engine is generally defined as a directed acyclic graph (except for Timely DataFlow). Nodes in the graph represent stateful data processing operators, and directed edges represent data channels between operators. An operator in the data flow topology will be mapped to one or more physical execution nodes, which are distributed in the entire server cluster. The elements that enter the stream computing engine are called events. Each event has two attributes: event time and processing time. The processing event refers to the system time of the engine machine when the event is processed. The event time is the time when an event occurs, which is usually determined before the event arrives at the engine. The event timestamp can be obtained from each event. From the definition, the event time of an event is naturally less than the processing time.
The inference ability of the stream computing engine on integrity can be simply described as: the system needs to be able to generate an integrity signal, which can be broadcast to the entire data stream topology in some way. Each operator in the topology needs to synchronize its own data processing progress according to this signal. If the input set of stream computing is defined as: E, the input set since time t (event processing) is E (t), including the data being processed and the buffered data. The engine state is State (t) at this time. State (t) includes the state of each operator, the consumption offset (or file read offset, etc.) of the data source, etc. ET function represents the event time (or other attributes) to obtain the event element. When the operator handles events, integrity means that for any event e ∈ E (t), there are:
Signal(t) < ET(e)
The operator can infer whether the data of the current input set E (t) is complete by combining this signal with the logical time of the event (that is, the event time). Generally, the integrity signal can be expressed as "the minimum event time of all events between (t-n, t)". The minimum value can be calculated by some algorithm F combined with the current data source input, engine state, etc., as shown below:
Signal(t) = Min{ET(e) | e ∈ E(t) - E(t-n)} = F(E(t), State(t))
Ideally, the above integrity constraints should be satisfied in the whole process of the change of the input set E (t) with t, but this very strong guarantee is difficult to achieve in real scenes (i.e. algorithm F), so we generally relax some restrictions (such as time), and we can get integrity inference within a period of time.
In the implementation of the above integrity signal generation algorithm, ① means that the minimum value of the event time sorted in a period of time can be taken as the semaphore; ② Indicates that the minimum event time in the past period can be counted as the semaphore; ③ Indicates that a fixed value can always be subtracted from the event time as a semaphore of integrity. In the fourth part of this paper, we can see that the above three formal generalization expressions correspond to three integrity reasoning schemes in engineering: reordering, low watermark and grace time.
3.2 System design of integrity reasoning
From the perspective of system architecture, when the stream computing engine implements data integrity reasoning, the three necessary modules are: production, dissemination and consumption. The production module is used to generate integrity signals. There will be some simple heuristic algorithms or some adaptive complex algorithms. The algorithms mainly combine some indicators of the input source itself, such as the events in the input events, the source consumption offset, the status of upstream production of the data source, etc. The production module is the most complex part of integrity reasoning. Different engines will have different designs and compromises in terms of performance, complexity, user experience, etc. The propagation module is the process of integrity signal from generation to broadcast to the whole data stream topology. The process may be realized by injecting special elements into the input source, or by some features carried by the stream elements themselves, or by directly transmitting signals to each operator from outside the data stream topology. The integrity signal consumption process is relatively simple. After receiving the signal, the operator is generally used to close a computing window or eliminate the status.
3.3 Engineering realization of integrity reasoning
The integrity reasoning of streaming data in the industry can be roughly divided into two categories according to whether events need to be reordered: one is called In Order Processing (IOP) streaming system, which performs buffering and reordering of streaming elements. Typical representatives of IOP system include Trill (Microsoft open source), Spark Streaming (D-Streams), Aurora, etc. The other type of system is called Out of Order Processing (OOP) system. This type of system does not buffer data to force sorting, but tracks the processing progress of data flow through signals. The content of integrity reasoning discussed in this paper is mainly aimed at such systems, including MillWheel, Flink, Kafka Streams, etc.
In the IOP system, it can provide predictable integrity semantics for the system mainly through buffering and reordering. The arrival of each event ensures that no earlier event will occur, which significantly simplifies the design of the sequential system. The biggest problem with the reordering scheme is that it will greatly increase the calculation delay, because it is usually difficult to obtain unordered prior time or space limits, and reordering will bring some storage pressure, requiring order independent operator behavior, such as summation, averaging, counting, etc., which can not rely on strict order, and reordering will sacrifice the characteristics of this type of operator.
In the OOP system, we use a weak attribute concept to realize integrity reasoning. The main schemes are: Puncture, Low Watermark, Slack Time, Heartbeat, etc. Punctuation is a general mechanism to transfer information through data flow topology. The idea of this scheme is to add some special identifiers to the flow data to identify the data section by section. Therefore, the unbounded and unordered flow data can be logically divided into multiple limited data sets. The original paper of the Punctuation scheme was published in 2003. Because the design site of the Punctuation scheme is too universal and the cost is very high in engineering practice, many subsequent engines borrowed the ideas of the Punctuation scheme and designed some more reasonable schemes in terms of architecture complexity, user experience, complete functions, etc., such as Low Watermark, Slack Time, Heartbeat, The figure below shows some similarities and differences between these three schemes.
Slack Time is a simple integrity measurement mechanism. Generally, the Slack Time is quantized by subtracting a fixed length of time (the longest lag time when the element arrives at the operator) from the event time of the stream element. This fixed Time can be configured by the user based on the actual operator cycle (such as twice the window size). The grace time scheme does not need to inject special messages into the flow, and its implementation is relatively simple, but its disadvantage is that the integrity reasoning ability is insufficient.
Low Watermark. Low watermark is a special message embedded in the data stream, but unlike ordinary messages, it is generally expressed as "the minimum timestamp that may appear in the stream". The idea of low watermark is that when the operator receives the watermark, it will get some additional information about the stream: no data arrives later than the watermark time, so the data in the window can be calculated and output to the downstream. The generation of low watermark includes heuristic algorithm (e.g. statistics of the minimum offset of all consumption partitions in Kafka) and some adaptive algorithms (e.g. combination of data source characteristics).
Heartbeat Heartbeat detection consists of an external signal that carries progress information about the data stream. The signal contains a timestamp. Stream elements larger than the heartbeat timestamp constitute a "complete data set". Heartbeat detection signals are generated by input sources, and can also be inferred by the system by observing environmental parameters (such as network delay, clock offset between input sources, etc.). The advantage of the heartbeat detection scheme is that it is an internal mechanism of the engine hidden from users. Like the low watermark scheme, it is challenging to generate an ideal heartbeat detection signal for integrity reasoning according to different data source characteristics.
Among the schemes mentioned above, in addition to the reordering scheme applicable to the IOP system, there are subtle connections and differences between the four schemes: marking, low watermark, grace time and heartbeat detection. Low watermark, grace time and heartbeat detection can all be regarded as a special case realization of the mark. The grace time simplifies the path of signal propagation: the information carried by the flow element itself can be calculated in combination with the grace period. Heartbeat detection sends the integrity signal obtained from the data source directly to the engine entrance (Ingestion point). Heartbeat detection and low watermark both carry information about the data source, but the difference is that the low watermark transmits the "integrity" information from the data source to the entire topology of the output source, while the heartbeat detection signal only contains the progress information of the data source.
3.4 Integrity reasoning and calculation delay
The idea of stream processing is to process data online, so low processing latency is crucial. While achieving strong integrity, there will inevitably be delay. The reason for delay is mainly composed of two parts: one is due to the "wait" of achieving integrity. For example, in an engine using a low watermark scheme, the window aggregated data will only be sent to the downstream when the low watermark crosses the watermark boundary; Second, some engines (such as Cloud DataFlow) persist integrity signals for fault tolerance, which will lead to increased processing delay of the whole link.
In the calculation of intermediate values of aggregate results that can be accepted by downstream, a solution to balance integrity and delay is to introduce periodic triggers: repeatedly (and ultimately consistently) update the data in the materialized window, similar to the semantics obtained by using materialized views in the database world, so that the downstream can obtain the latest calculation results in time while achieving integrity. However, in the calculation of intermediate results that cannot be accepted by the downstream, only one result can be output. For delayed data, the engine cannot wait indefinitely. When the engine receives the integrity signal and outputs the results, how to deal with the late data if any? General processing strategies include discarding, double counting, or bypass processing. Double counting can achieve better accuracy, but it requires a long time to retain the state. Considering the processing delay and storage cost, only a limited degree of "delay grace" can be achieved. In a word, although there are many schemes to realize integrity reasoning at present, it can be seen that there is still a long way to go after taking into account factors such as processing delay.
4、 Engine Implementation of Stream Data Integrity
At present, many stream computing engines use low watermarks for stream data integrity inference. For example, Apache Flink and Google DataFlow put watermarks in APIs to make them directly user oriented. Compared with processing bounded or ordered data, there is a certain threshold for understanding and encoding in an unordered data stream. From the perspective of convergence (user use) complexity, Kafka Streams and Spark Structured Streaming use a weakened version of the low watermark (grace time) to achieve the final consistent results, or to manage the state storage costs caused by late data. Next, we will combine the integrity reasoning schemes listed in the previous chapter with the most mature stream computing engines in the industry to analyze the thinking and trade-offs of different engines on the functional perfection, ease of use, complexity and other architectural elements of integrity reasoning. From these engines, we can roughly see the development trend of many engines in the management of unordered data.
4.1 MillWheel
Before expanding MillWheel, briefly explain the relationship between MillWheel/Cloud DataFlow/Beam. MillWheel is the underlying stream computing engine of Google Cloud DataFlow (now Google is gradually being replaced by Windmill internally). It solves the problem of data consistency and integrity reasoning, and can achieve robust streaming data processing. Cloud Dataflow's most prominent contribution is to provide a unified model for batch processing and streaming data processing. Beam is mainly composed of a programming model, a general API layer and a portable layer. The underlying layer has no specific execution engine. It is more like SQL in a relational database, trying to become a standard language for streaming batch data processing.
Integrity reasoning process: MillWheel/Cloud DataFlow adopts the core idea of low watermark to achieve integrity reasoning that supports event time processing semantics. In the production stage, the physical node of each operator in the MillWheel/Cloud DataFlow data flow topology will track the data processing progress (processed+to be processed) on the node. While persisting the progress information to the memory and local state, it will regularly send the progress information to the Central Authority, It is defined as the minimum value of processing progress reported by all nodes corresponding to this operator. In the propagation stage, the final low watermark information calculated by the aggregation system will be sent to the physical node of the corresponding operator, and the final watermark of the node=Min (the current node processing progress, the minimum value of the upstream global watermark). In the consumption stage, the nodes in MillWheel/Cloud DataFlow will use the calculated low watermark information to traverse the previously implicitly (e.g. triggered by a window) or explicitly (programmed) registered timers (Timers), and then execute timer callbacks with timestamps smaller than the low watermark to complete the sending of results or the change of status.
Compromise in architecture design: MillWheel/Cloud DataFlow needs to persist low watermarks at the node, while updating watermarks requires centralized reporting, so the IO overhead of these two parts will cause an increase in stream processing delay. In the centralized reporting mode, operators from Cloud DataFlow are dynamically partitioned, and the data source is very complex. A more complex low watermark generation mechanism is required to ensure integrity reasoning. The node is persistent and low watermark. While introducing delay, it brings an advantage that the Failover of MillWheel/Cloud DataFlow application is fast (with finer state granularity). At the same time, the global watermark aggregation can better evaluate the current data processing progress of the engine. For example, a compromise can be made on the calculation delay and accuracy, and a 98% low watermark can be calculated, which corresponds to the processing progress of 98% of the stream data in the system, so that the overall processing efficiency can be faster.
4.2 Apache Flink
The integrity reasoning scheme in Apache Flink is designed based on the DataFlow model and the low watermark.
Integrity reasoning process: In the production stage, the Flink program can generate a watermark in the source node or the dedicated watermark generation node. The source node can calculate the watermark according to the stream data information entering the engine or other information of the data source (such as Kafka partition, offset or timestamp). Some dedicated watermark generation nodes can calculate the watermark according to the timestamp of the stream elements they observe. In the propagation phase, in the entire data stream topology, the watermark is sent to the downstream node as a special kind of metadata message together with the regular stream data. When the downstream node receives a new watermark message, it will take the minimum value of all the input watermarks as the watermark of the current node, and update the current node's watermark at the same time. The node only forwards the watermark larger than the previous watermark to ensure the strict monotony of the integrity signal. In the consumption stage, similar to the MillWheel/Cloud DataFlow, when the watermark arrives at the node, a series of timers will be triggered, and the results will be sent to the downstream. The new watermark value will be broadcast to all downstream nodes, so that the entire distributed application can achieve state synchronization.
Architecture design tradeoff: In Flink, since the node does not persist the low watermark, when the node fails, the entire pipeline must be suspended and restored from the last checkpoint. The low watermark of the failed node will be set as a low value constant. Once the node receives a new low watermark message on all its input edges, the watermark will be reset. The advantage of low watermark without persistence is that Flink has low end-to-end data processing delay, while the disadvantage is that FO takes longer than MillWheel/Cloud DataFlow. Flink is the most mature engine in the design of integrity reasoning.
4.3 Apache Kafka Streams
Although Cloud DataFlow and Flink both use low watermark for stream data integrity reasoning, Apache Kafka Streams engineers did not build an integrity reasoning mechanism based on the low watermark scheme. The main reasons are the following two aspects. First, Apache Kafka Streams proposed the "Continuous Incremental Processing Stream Table" model (similar to the stream table concept in DataFlow, refer to the Streaming System book), If the results in the table are updated step by step through the flow and the intermediate results are issued at the same time, then the concept of when to close the window becomes less important (personally, this argument is problematic in practice, for example, it is difficult to define the identity of different results in the downstream, and it is generally complex to implement the logic of accumulation and withdrawal of delayed processing); Secondly, engineers believe that the low watermark scheme in the DataFlow model is too complex (for example, at least eight Triggers need to be combined with the watermark), and more concise and intuitive integrity solutions need to be provided to users. Based on the above two points, Apache Kafka Streams uses the above-mentioned grace time to solve the integrity problem.
Integrity reasoning process: Apache Kafka Streams does not embed special meta information in the stream, nor does it rely on the system level low watermark timestamp, but allows fine-grained integrity determination by configuring a grace period on each operator. In the production stage, when each event flows through the operator, the operator will use the "event time - Slack Time" as the integrity signal. With the continuous inflow of stream data, the event time will also increase, so that progress inference similar to "low watermark" can be obtained. In the propagation stage, because the signal can be calculated from the information carried by the event itself, there is no signal propagation process. Of course, there are some problems. For example, when the upstream operator filters out a large amount of data, the downstream operator may not be able to advance the current operator's stream processing progress in time because it has not received data for a long time, which increases the processing delay to a certain extent. In the consumption stage, when the timestamp calculated by the grace time is greater than the upper bound of the window, the window will be closed and the state will be released.
Architecture design compromise: If we generalize the low watermark scheme of Cloud DataFlow and Flink, we can clearly see that the grace time here can be regarded as a subset of the watermark scheme, that is, a design like "high watermark". The advantage of the grace time scheme is that it extends the concept of low watermark from the global to the operator level, so we can decouple the waiting time of different operators, such as the 6-hour aggregation window and the 10 minute aggregation. Generally speaking, the waiting time for the disordered data is different. Through different grace time settings, more flexible synchronization control can be achieved. The disadvantage of grace time is mainly due to the fact that "the signal uses the current event information". In addition to the above mentioned disadvantage that the downstream filter operator cannot obtain the processing degree in time, the data flow topology lacks the global synchronization feature, and this inference may cause incorrect results in some scenarios. Suppose that the processing progress of a node upstream of the aggregation operator lags behind due to GC or host pressure, but other nodes process normally. Because the event time received by the aggregation operator is gradually increasing, the window will close when the grace period expires. When the lagging node catches up with the processing progress, the data sent to the downstream aggregation operator can no longer be used (the window has been closed). In order to solve the above problems, Apache Kafka Streams plans to use a similar approach to the DataFlow model to implement progress tracking by injecting metadata with progress information into the stream data. However, the metadata is not necessarily a simple timestamp. It may be a solution like Vector Clock, which realizes global data flow graph synchronization on the basis of fine-grained operator control.
4.4 Apache Spark Streaming
Since the flow processing in Spark originates from batch processing design, its support for end-to-end consistency and time events is not very good. Since Spark 2.1, the new Apache Structured Streaming API (SPARK-18124) has introduced a watermark like data integrity scheme based on grace time. Users can manage the delayed data by specifying the event time column and the grace time of the delayed data, so that the engine can control the memory use of the flow state, such as discarding delayed events and deleting the old state that will never be updated (e.g. aggregation or connection).
Integrity reasoning process: The watermark in Spark Structured Streaming is global. After each batch of calculation is triggered, the watermark will be recalculated. The specific calculation logic is: new watermark=MAX (the maximum timestamp seen before the trigger execution, and the maximum timestamp in the trigger execution data) - grace time. In a scene with multiple input sources, Spark Structured Streaming tracks each input stream, calculates the watermark separately, and then selects the minimum value as the global watermark. Based on the global watermark, Spark Structured Streaming can maintain the state of the arrived data, and update it by aggregating it with the late data. The delayed data less than the watermark will be aggregated, and the data exceeding the watermark will be discarded.
Architecture design compromise: Since the Spark Structured Streaming watermark is designed for state management in computing, its integrity reasoning ability is relatively weak. For example, because global watermarks are used, chained aggregation cannot be performed in a data stream topology (the input of the downstream aggregation operator is the output of the upstream aggregation operator), which may lead to incorrect aggregation results. In addition, in the design of the watermark API, its scope of application is also limited. For example, the time column of the watermark must be consistent with the subsequent grouping column, which greatly limits the function of the aggregation operator.
4.5 Comparison of Engine Integrity Implementation Schemes
Next, from the perspective of generation and propagation of integrity inference signals, we will compare the implementation of the four stream computing engines mentioned above.
5、 Summary and outlook
To get correct results through the stream computing engine, we need to work together on engine consistency and data integrity. The essence of engine consistency is the fault tolerance problem of distributed applications. At present, there are systematic and mature solutions in engineering practice. The development direction of each engine is also similar, that is, "output results (states) in a transactional way". Data integrity guarantees the deterministic data set of unordered and unbounded data in stream computing, which is crucial in scenarios requiring single aggregation results, missing detection, incremental processing, etc. How to realize data integrity reasoning is one of the most difficult problems to be understood by users and the degree of completion is not very good in the field of stream computing. The essence of integrity reasoning is to have an integrity signal generation algorithm Signal, so that for any event e and processing time t, there is an event time of Signal (t)
Not all stream computing engines can get the correct results. Understanding the various constraints to achieve data integrity is crucial to our correct technology selection. MillWheel and Flink use a low watermark to synchronize the processing progress of the entire data stream topology. The low watermark uses a special stream element or a bypass propagation method, so that each operator in the entire data stream topology can clearly understand the current data processing progress. Spark Structured Streaming and Kafka Streams have adopted the grace time scheme, which simplifies the complexity of architecture construction and maintenance, and reduces the cost of users' understanding of integrity. At the same time, these two engines also have some shortcomings in the integrity reasoning function.
In the case of the paradox between the absolutely correct integrity and the delay, the stream computing engine is also trying to optimize the current technical solutions from different directions: for example, the manufacturer of the data source (not the engine) sends the perfect watermark to reduce the engine design cost, and the built-in adaptive engine watermark algorithm is built to improve the deficiencies of the heuristic algorithm. At the same time, compared with engine consistency, the concept of data integrity is closer to users, and many definitions of integrity require the participation of users. In my opinion, the current integrity solutions in many engines still have a high threshold for users of engine frameworks. How to provide framework users with a more friendly way to express the indicator constraints required to achieve correct results, It is a key proposition to be solved. In time, if the above problems can be well solved, the real integration of streaming and batch and the stream computing can become a reality beyond batch processing.
1、 Correctness in stream computing
Correct data means that the results calculated by the stream computing engine correctly reflect the objects in the real physical world. If the user has paid three bills for a period of time, totaling 100 yuan, for the indicator "Total user payment history", when the user's payment behavior is completed, we should observe that there is only one indicator and its value is 100. Due to the unbounded and unordered nature of stream elements, the above logical derivation is not easy to implement in stream computing. Some "incorrect" results may exist, such as the indicator value is less than 100, and there are multiple indicator values. In order to obtain accurate results, there is still a large number of Lamda architectures that combine streaming and batch in engineering practice. They combine the incremental data built in real time with the stock data obtained through batch processing to provide more rapid and quality controllable data for downstream businesses.
Incorrect data may affect decision makers (people or machines) 'judgments on business and market, resulting in slow or wrong business decisions. Although there are many streaming and batching computing practices, these solutions are more to solve the complexity of traditional Lamda computing architecture in platform operation and maintenance and computing resources. The data quality problems caused by streaming computing itself have not been much improved due to the development of the streaming engine. The following section will analyze the causes of incorrect data processing in stream computing and the necessary and sufficient conditions for obtaining correct results.
2、 Necessary and sufficient conditions for correctness
The essence of stream computing is distributed computing through asynchronous messages. However, due to clock synchronization, network delays, server hangs (such as garbage collection) and other reasons, the logical order of stream data generation is usually inconsistent with the physical order of the arriving stream processing system or the operators of the stream processing system. The stream computing engine must be able to process unordered data, otherwise incorrect computing results will result.
2.1 Consistency is not equal to correctness
There are two concepts in the field of data computing: data consistency and data correctness, which are easily confused, but from the perspective of the concept itself, data consistency ≠ data correctness. The definitions of the two concepts are as follows:
1. Correctness: The results of stream computing correctly reflect the objects in the real physical world.
2. Consistency: The data of upstream and downstream systems across all flows reflect the same information.
Consistency is usually combined with the term "exactly once", which means that the stream computing engine can recover from failure to a consistent state, and the final calculation result does not contain duplicate entries or lose any data. In other words, the output of the stream computing engine is like data processing only once without any failure. The requirement of correctness is more strict than that of consistency. If the engine cannot achieve consistent data processing, the correctness cannot be achieved. If the data is correct, it must be consistent, that is, data consistency is a necessary and insufficient condition for data correctness. For example, four records in the stream data source have been processed 5 times in the engine due to engine failure/incorrect processing, and the results may also be incorrect. On the other hand, if the engine achieves consistency, the correct results may not be obtained due to the disorder and delay of the source data.
The correctness of calculation is jointly guaranteed by data integrity and engine consistency. If the stream computing process is considered as a function mapping: output=f (input), in the above model, data integrity is a constraint on unbounded and unordered data sets, that is, the input in is defined, and the engine data consistency defines the data processing process (including output) f. Therefore, the output is determined (for engine end-to-end consistency, refer to the article "The Nature of Stream Computing Engine Data Consistency").
2.2 Necessity of stream data integrity
What is data integrity in stream computing? Since the data in the stream computing engine is unbounded and unordered, we must convert this kind of uncertain data set into a logical "current partition" in some way to analyze the deterministic data fragments in the partition. The "current partition" can be a period of time in the past (e.g. sliding window) or a number of records in the past. From another point of view, whether it is stream computing or batch computing, in order to get the correct calculation results, a deterministic input dataset is required. The batch computing mode is just a special case scenario where a partition covers the entire bounded dataset. At present, the stream computing engine is criticized as "inaccurate" largely because the existing technical solutions are not good at partitioning unbounded and disordered stream data.
Integrity reasoning is a means of representing data readiness. Even if the input streams may arrive out of order, the flow computing engine will not use incomplete input calculations as the final output results. Integrity requires that the calculation engine can track the current calculation progress in a timely manner, and estimate the degree of completion corresponding to the output result and its input stream. This reasoning about data integrity is crucial in many flow computing scenarios: for example, in a flow based alarm system, the flow computing engine must generate a single and correct alarm indicator. It is meaningless to send some results in advance, which requires that the flow computing engine, a distributed system, has the ability to infer that "all the data required for the alarm indicator is ready"; For another example, in the scenario of business missing value detection through streaming CEP, if there is no integrity reasoning, it is impossible to distinguish the difference between real data missing and data arriving late.
Integrity reasoning is also useful for state management of the engine itself. For example, Apache Spark Structured Streaming and Kafka Streams use similar delay algorithms (event time - fixed expiration time) to recycle the state storage in the calculation process to reduce memory consumption.
3、 A general solution to stream data integrity
This part will start from the original integrity problem, propose a formal expression model of integrity, and then generalize several commonly used integrity reasoning schemes under this general framework. This top-down expansion allows us to more clearly see the nature of integrity reasoning in the field of stream computing. Many well-known concepts (such as watermarks) are just concrete implementations of a certain framework. A better way is to abstract them from the perspective of models (Why). Thinking from models rather than frameworks can help us avoid being misled by many engine "concepts".
3.1 Formal definition of integrity reasoning
The data processing program running on the flow computing engine is generally defined as a directed acyclic graph (except for Timely DataFlow). Nodes in the graph represent stateful data processing operators, and directed edges represent data channels between operators. An operator in the data flow topology will be mapped to one or more physical execution nodes, which are distributed in the entire server cluster. The elements that enter the stream computing engine are called events. Each event has two attributes: event time and processing time. The processing event refers to the system time of the engine machine when the event is processed. The event time is the time when an event occurs, which is usually determined before the event arrives at the engine. The event timestamp can be obtained from each event. From the definition, the event time of an event is naturally less than the processing time.
The inference ability of the stream computing engine on integrity can be simply described as: the system needs to be able to generate an integrity signal, which can be broadcast to the entire data stream topology in some way. Each operator in the topology needs to synchronize its own data processing progress according to this signal. If the input set of stream computing is defined as: E, the input set since time t (event processing) is E (t), including the data being processed and the buffered data. The engine state is State (t) at this time. State (t) includes the state of each operator, the consumption offset (or file read offset, etc.) of the data source, etc. ET function represents the event time (or other attributes) to obtain the event element. When the operator handles events, integrity means that for any event e ∈ E (t), there are:
Signal(t) < ET(e)
The operator can infer whether the data of the current input set E (t) is complete by combining this signal with the logical time of the event (that is, the event time). Generally, the integrity signal can be expressed as "the minimum event time of all events between (t-n, t)". The minimum value can be calculated by some algorithm F combined with the current data source input, engine state, etc., as shown below:
Signal(t) = Min{ET(e) | e ∈ E(t) - E(t-n)} = F(E(t), State(t))
Ideally, the above integrity constraints should be satisfied in the whole process of the change of the input set E (t) with t, but this very strong guarantee is difficult to achieve in real scenes (i.e. algorithm F), so we generally relax some restrictions (such as time), and we can get integrity inference within a period of time.
In the implementation of the above integrity signal generation algorithm, ① means that the minimum value of the event time sorted in a period of time can be taken as the semaphore; ② Indicates that the minimum event time in the past period can be counted as the semaphore; ③ Indicates that a fixed value can always be subtracted from the event time as a semaphore of integrity. In the fourth part of this paper, we can see that the above three formal generalization expressions correspond to three integrity reasoning schemes in engineering: reordering, low watermark and grace time.
3.2 System design of integrity reasoning
From the perspective of system architecture, when the stream computing engine implements data integrity reasoning, the three necessary modules are: production, dissemination and consumption. The production module is used to generate integrity signals. There will be some simple heuristic algorithms or some adaptive complex algorithms. The algorithms mainly combine some indicators of the input source itself, such as the events in the input events, the source consumption offset, the status of upstream production of the data source, etc. The production module is the most complex part of integrity reasoning. Different engines will have different designs and compromises in terms of performance, complexity, user experience, etc. The propagation module is the process of integrity signal from generation to broadcast to the whole data stream topology. The process may be realized by injecting special elements into the input source, or by some features carried by the stream elements themselves, or by directly transmitting signals to each operator from outside the data stream topology. The integrity signal consumption process is relatively simple. After receiving the signal, the operator is generally used to close a computing window or eliminate the status.
3.3 Engineering realization of integrity reasoning
The integrity reasoning of streaming data in the industry can be roughly divided into two categories according to whether events need to be reordered: one is called In Order Processing (IOP) streaming system, which performs buffering and reordering of streaming elements. Typical representatives of IOP system include Trill (Microsoft open source), Spark Streaming (D-Streams), Aurora, etc. The other type of system is called Out of Order Processing (OOP) system. This type of system does not buffer data to force sorting, but tracks the processing progress of data flow through signals. The content of integrity reasoning discussed in this paper is mainly aimed at such systems, including MillWheel, Flink, Kafka Streams, etc.
In the IOP system, it can provide predictable integrity semantics for the system mainly through buffering and reordering. The arrival of each event ensures that no earlier event will occur, which significantly simplifies the design of the sequential system. The biggest problem with the reordering scheme is that it will greatly increase the calculation delay, because it is usually difficult to obtain unordered prior time or space limits, and reordering will bring some storage pressure, requiring order independent operator behavior, such as summation, averaging, counting, etc., which can not rely on strict order, and reordering will sacrifice the characteristics of this type of operator.
In the OOP system, we use a weak attribute concept to realize integrity reasoning. The main schemes are: Puncture, Low Watermark, Slack Time, Heartbeat, etc. Punctuation is a general mechanism to transfer information through data flow topology. The idea of this scheme is to add some special identifiers to the flow data to identify the data section by section. Therefore, the unbounded and unordered flow data can be logically divided into multiple limited data sets. The original paper of the Punctuation scheme was published in 2003. Because the design site of the Punctuation scheme is too universal and the cost is very high in engineering practice, many subsequent engines borrowed the ideas of the Punctuation scheme and designed some more reasonable schemes in terms of architecture complexity, user experience, complete functions, etc., such as Low Watermark, Slack Time, Heartbeat, The figure below shows some similarities and differences between these three schemes.
Slack Time is a simple integrity measurement mechanism. Generally, the Slack Time is quantized by subtracting a fixed length of time (the longest lag time when the element arrives at the operator) from the event time of the stream element. This fixed Time can be configured by the user based on the actual operator cycle (such as twice the window size). The grace time scheme does not need to inject special messages into the flow, and its implementation is relatively simple, but its disadvantage is that the integrity reasoning ability is insufficient.
Low Watermark. Low watermark is a special message embedded in the data stream, but unlike ordinary messages, it is generally expressed as "the minimum timestamp that may appear in the stream". The idea of low watermark is that when the operator receives the watermark, it will get some additional information about the stream: no data arrives later than the watermark time, so the data in the window can be calculated and output to the downstream. The generation of low watermark includes heuristic algorithm (e.g. statistics of the minimum offset of all consumption partitions in Kafka) and some adaptive algorithms (e.g. combination of data source characteristics).
Heartbeat Heartbeat detection consists of an external signal that carries progress information about the data stream. The signal contains a timestamp. Stream elements larger than the heartbeat timestamp constitute a "complete data set". Heartbeat detection signals are generated by input sources, and can also be inferred by the system by observing environmental parameters (such as network delay, clock offset between input sources, etc.). The advantage of the heartbeat detection scheme is that it is an internal mechanism of the engine hidden from users. Like the low watermark scheme, it is challenging to generate an ideal heartbeat detection signal for integrity reasoning according to different data source characteristics.
Among the schemes mentioned above, in addition to the reordering scheme applicable to the IOP system, there are subtle connections and differences between the four schemes: marking, low watermark, grace time and heartbeat detection. Low watermark, grace time and heartbeat detection can all be regarded as a special case realization of the mark. The grace time simplifies the path of signal propagation: the information carried by the flow element itself can be calculated in combination with the grace period. Heartbeat detection sends the integrity signal obtained from the data source directly to the engine entrance (Ingestion point). Heartbeat detection and low watermark both carry information about the data source, but the difference is that the low watermark transmits the "integrity" information from the data source to the entire topology of the output source, while the heartbeat detection signal only contains the progress information of the data source.
3.4 Integrity reasoning and calculation delay
The idea of stream processing is to process data online, so low processing latency is crucial. While achieving strong integrity, there will inevitably be delay. The reason for delay is mainly composed of two parts: one is due to the "wait" of achieving integrity. For example, in an engine using a low watermark scheme, the window aggregated data will only be sent to the downstream when the low watermark crosses the watermark boundary; Second, some engines (such as Cloud DataFlow) persist integrity signals for fault tolerance, which will lead to increased processing delay of the whole link.
In the calculation of intermediate values of aggregate results that can be accepted by downstream, a solution to balance integrity and delay is to introduce periodic triggers: repeatedly (and ultimately consistently) update the data in the materialized window, similar to the semantics obtained by using materialized views in the database world, so that the downstream can obtain the latest calculation results in time while achieving integrity. However, in the calculation of intermediate results that cannot be accepted by the downstream, only one result can be output. For delayed data, the engine cannot wait indefinitely. When the engine receives the integrity signal and outputs the results, how to deal with the late data if any? General processing strategies include discarding, double counting, or bypass processing. Double counting can achieve better accuracy, but it requires a long time to retain the state. Considering the processing delay and storage cost, only a limited degree of "delay grace" can be achieved. In a word, although there are many schemes to realize integrity reasoning at present, it can be seen that there is still a long way to go after taking into account factors such as processing delay.
4、 Engine Implementation of Stream Data Integrity
At present, many stream computing engines use low watermarks for stream data integrity inference. For example, Apache Flink and Google DataFlow put watermarks in APIs to make them directly user oriented. Compared with processing bounded or ordered data, there is a certain threshold for understanding and encoding in an unordered data stream. From the perspective of convergence (user use) complexity, Kafka Streams and Spark Structured Streaming use a weakened version of the low watermark (grace time) to achieve the final consistent results, or to manage the state storage costs caused by late data. Next, we will combine the integrity reasoning schemes listed in the previous chapter with the most mature stream computing engines in the industry to analyze the thinking and trade-offs of different engines on the functional perfection, ease of use, complexity and other architectural elements of integrity reasoning. From these engines, we can roughly see the development trend of many engines in the management of unordered data.
4.1 MillWheel
Before expanding MillWheel, briefly explain the relationship between MillWheel/Cloud DataFlow/Beam. MillWheel is the underlying stream computing engine of Google Cloud DataFlow (now Google is gradually being replaced by Windmill internally). It solves the problem of data consistency and integrity reasoning, and can achieve robust streaming data processing. Cloud Dataflow's most prominent contribution is to provide a unified model for batch processing and streaming data processing. Beam is mainly composed of a programming model, a general API layer and a portable layer. The underlying layer has no specific execution engine. It is more like SQL in a relational database, trying to become a standard language for streaming batch data processing.
Integrity reasoning process: MillWheel/Cloud DataFlow adopts the core idea of low watermark to achieve integrity reasoning that supports event time processing semantics. In the production stage, the physical node of each operator in the MillWheel/Cloud DataFlow data flow topology will track the data processing progress (processed+to be processed) on the node. While persisting the progress information to the memory and local state, it will regularly send the progress information to the Central Authority, It is defined as the minimum value of processing progress reported by all nodes corresponding to this operator. In the propagation stage, the final low watermark information calculated by the aggregation system will be sent to the physical node of the corresponding operator, and the final watermark of the node=Min (the current node processing progress, the minimum value of the upstream global watermark). In the consumption stage, the nodes in MillWheel/Cloud DataFlow will use the calculated low watermark information to traverse the previously implicitly (e.g. triggered by a window) or explicitly (programmed) registered timers (Timers), and then execute timer callbacks with timestamps smaller than the low watermark to complete the sending of results or the change of status.
Compromise in architecture design: MillWheel/Cloud DataFlow needs to persist low watermarks at the node, while updating watermarks requires centralized reporting, so the IO overhead of these two parts will cause an increase in stream processing delay. In the centralized reporting mode, operators from Cloud DataFlow are dynamically partitioned, and the data source is very complex. A more complex low watermark generation mechanism is required to ensure integrity reasoning. The node is persistent and low watermark. While introducing delay, it brings an advantage that the Failover of MillWheel/Cloud DataFlow application is fast (with finer state granularity). At the same time, the global watermark aggregation can better evaluate the current data processing progress of the engine. For example, a compromise can be made on the calculation delay and accuracy, and a 98% low watermark can be calculated, which corresponds to the processing progress of 98% of the stream data in the system, so that the overall processing efficiency can be faster.
4.2 Apache Flink
The integrity reasoning scheme in Apache Flink is designed based on the DataFlow model and the low watermark.
Integrity reasoning process: In the production stage, the Flink program can generate a watermark in the source node or the dedicated watermark generation node. The source node can calculate the watermark according to the stream data information entering the engine or other information of the data source (such as Kafka partition, offset or timestamp). Some dedicated watermark generation nodes can calculate the watermark according to the timestamp of the stream elements they observe. In the propagation phase, in the entire data stream topology, the watermark is sent to the downstream node as a special kind of metadata message together with the regular stream data. When the downstream node receives a new watermark message, it will take the minimum value of all the input watermarks as the watermark of the current node, and update the current node's watermark at the same time. The node only forwards the watermark larger than the previous watermark to ensure the strict monotony of the integrity signal. In the consumption stage, similar to the MillWheel/Cloud DataFlow, when the watermark arrives at the node, a series of timers will be triggered, and the results will be sent to the downstream. The new watermark value will be broadcast to all downstream nodes, so that the entire distributed application can achieve state synchronization.
Architecture design tradeoff: In Flink, since the node does not persist the low watermark, when the node fails, the entire pipeline must be suspended and restored from the last checkpoint. The low watermark of the failed node will be set as a low value constant. Once the node receives a new low watermark message on all its input edges, the watermark will be reset. The advantage of low watermark without persistence is that Flink has low end-to-end data processing delay, while the disadvantage is that FO takes longer than MillWheel/Cloud DataFlow. Flink is the most mature engine in the design of integrity reasoning.
4.3 Apache Kafka Streams
Although Cloud DataFlow and Flink both use low watermark for stream data integrity reasoning, Apache Kafka Streams engineers did not build an integrity reasoning mechanism based on the low watermark scheme. The main reasons are the following two aspects. First, Apache Kafka Streams proposed the "Continuous Incremental Processing Stream Table" model (similar to the stream table concept in DataFlow, refer to the Streaming System book), If the results in the table are updated step by step through the flow and the intermediate results are issued at the same time, then the concept of when to close the window becomes less important (personally, this argument is problematic in practice, for example, it is difficult to define the identity of different results in the downstream, and it is generally complex to implement the logic of accumulation and withdrawal of delayed processing); Secondly, engineers believe that the low watermark scheme in the DataFlow model is too complex (for example, at least eight Triggers need to be combined with the watermark), and more concise and intuitive integrity solutions need to be provided to users. Based on the above two points, Apache Kafka Streams uses the above-mentioned grace time to solve the integrity problem.
Integrity reasoning process: Apache Kafka Streams does not embed special meta information in the stream, nor does it rely on the system level low watermark timestamp, but allows fine-grained integrity determination by configuring a grace period on each operator. In the production stage, when each event flows through the operator, the operator will use the "event time - Slack Time" as the integrity signal. With the continuous inflow of stream data, the event time will also increase, so that progress inference similar to "low watermark" can be obtained. In the propagation stage, because the signal can be calculated from the information carried by the event itself, there is no signal propagation process. Of course, there are some problems. For example, when the upstream operator filters out a large amount of data, the downstream operator may not be able to advance the current operator's stream processing progress in time because it has not received data for a long time, which increases the processing delay to a certain extent. In the consumption stage, when the timestamp calculated by the grace time is greater than the upper bound of the window, the window will be closed and the state will be released.
Architecture design compromise: If we generalize the low watermark scheme of Cloud DataFlow and Flink, we can clearly see that the grace time here can be regarded as a subset of the watermark scheme, that is, a design like "high watermark". The advantage of the grace time scheme is that it extends the concept of low watermark from the global to the operator level, so we can decouple the waiting time of different operators, such as the 6-hour aggregation window and the 10 minute aggregation. Generally speaking, the waiting time for the disordered data is different. Through different grace time settings, more flexible synchronization control can be achieved. The disadvantage of grace time is mainly due to the fact that "the signal uses the current event information". In addition to the above mentioned disadvantage that the downstream filter operator cannot obtain the processing degree in time, the data flow topology lacks the global synchronization feature, and this inference may cause incorrect results in some scenarios. Suppose that the processing progress of a node upstream of the aggregation operator lags behind due to GC or host pressure, but other nodes process normally. Because the event time received by the aggregation operator is gradually increasing, the window will close when the grace period expires. When the lagging node catches up with the processing progress, the data sent to the downstream aggregation operator can no longer be used (the window has been closed). In order to solve the above problems, Apache Kafka Streams plans to use a similar approach to the DataFlow model to implement progress tracking by injecting metadata with progress information into the stream data. However, the metadata is not necessarily a simple timestamp. It may be a solution like Vector Clock, which realizes global data flow graph synchronization on the basis of fine-grained operator control.
4.4 Apache Spark Streaming
Since the flow processing in Spark originates from batch processing design, its support for end-to-end consistency and time events is not very good. Since Spark 2.1, the new Apache Structured Streaming API (SPARK-18124) has introduced a watermark like data integrity scheme based on grace time. Users can manage the delayed data by specifying the event time column and the grace time of the delayed data, so that the engine can control the memory use of the flow state, such as discarding delayed events and deleting the old state that will never be updated (e.g. aggregation or connection).
Integrity reasoning process: The watermark in Spark Structured Streaming is global. After each batch of calculation is triggered, the watermark will be recalculated. The specific calculation logic is: new watermark=MAX (the maximum timestamp seen before the trigger execution, and the maximum timestamp in the trigger execution data) - grace time. In a scene with multiple input sources, Spark Structured Streaming tracks each input stream, calculates the watermark separately, and then selects the minimum value as the global watermark. Based on the global watermark, Spark Structured Streaming can maintain the state of the arrived data, and update it by aggregating it with the late data. The delayed data less than the watermark will be aggregated, and the data exceeding the watermark will be discarded.
Architecture design compromise: Since the Spark Structured Streaming watermark is designed for state management in computing, its integrity reasoning ability is relatively weak. For example, because global watermarks are used, chained aggregation cannot be performed in a data stream topology (the input of the downstream aggregation operator is the output of the upstream aggregation operator), which may lead to incorrect aggregation results. In addition, in the design of the watermark API, its scope of application is also limited. For example, the time column of the watermark must be consistent with the subsequent grouping column, which greatly limits the function of the aggregation operator.
4.5 Comparison of Engine Integrity Implementation Schemes
Next, from the perspective of generation and propagation of integrity inference signals, we will compare the implementation of the four stream computing engines mentioned above.
5、 Summary and outlook
To get correct results through the stream computing engine, we need to work together on engine consistency and data integrity. The essence of engine consistency is the fault tolerance problem of distributed applications. At present, there are systematic and mature solutions in engineering practice. The development direction of each engine is also similar, that is, "output results (states) in a transactional way". Data integrity guarantees the deterministic data set of unordered and unbounded data in stream computing, which is crucial in scenarios requiring single aggregation results, missing detection, incremental processing, etc. How to realize data integrity reasoning is one of the most difficult problems to be understood by users and the degree of completion is not very good in the field of stream computing. The essence of integrity reasoning is to have an integrity signal generation algorithm Signal, so that for any event e and processing time t, there is an event time of Signal (t)
Not all stream computing engines can get the correct results. Understanding the various constraints to achieve data integrity is crucial to our correct technology selection. MillWheel and Flink use a low watermark to synchronize the processing progress of the entire data stream topology. The low watermark uses a special stream element or a bypass propagation method, so that each operator in the entire data stream topology can clearly understand the current data processing progress. Spark Structured Streaming and Kafka Streams have adopted the grace time scheme, which simplifies the complexity of architecture construction and maintenance, and reduces the cost of users' understanding of integrity. At the same time, these two engines also have some shortcomings in the integrity reasoning function.
In the case of the paradox between the absolutely correct integrity and the delay, the stream computing engine is also trying to optimize the current technical solutions from different directions: for example, the manufacturer of the data source (not the engine) sends the perfect watermark to reduce the engine design cost, and the built-in adaptive engine watermark algorithm is built to improve the deficiencies of the heuristic algorithm. At the same time, compared with engine consistency, the concept of data integrity is closer to users, and many definitions of integrity require the participation of users. In my opinion, the current integrity solutions in many engines still have a high threshold for users of engine frameworks. How to provide framework users with a more friendly way to express the indicator constraints required to achieve correct results, It is a key proposition to be solved. In time, if the above problems can be well solved, the real integration of streaming and batch and the stream computing can become a reality beyond batch processing.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00