This article is reprinted from the Good Future Technology official account. It uses Flink SQL as a case study to introduce the use of Flink CDC 2.0 and interpret the core design in CDC. The main contents are listed below:
In August 2021, Flink CDC released version 2.0.0. Compared with version 1.0, Flink CDC supports distributed reads and checkpoints in the full read phase and ensures data consistency without locking tables during full + incremental read.
The data reading logic of Flink CDC 2.0 is not complicated, but the design of the FLIP-27: Refactor Source Interface and the lack of understanding of Debezium APIs is complicated. This article focuses on the processing logic of Flink CDC. The design of the FLIP-27 and the API calls of Debezium are not explained.
This article uses CDC version 2.0.0 to introduce the use of Flink CDC 2.0 with Flink SQL cases, introduces the core design of CDC (including split division, split reading, and incremental reading), and explains the code of calling and implementing flink-mysql-cdc interfaces involved in the data processing.
Read full and incremental data from the MySQL table and write data to Kafka in the changelog-json
format. Observe the number of data entries of the RowKind type and the number of affected data entries:
public static void main(String[] args) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings envSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inStreamingMode()
.build();
env.setParallelism(3);
// note: CK is enabled for incremental synchronization.
env.enableCheckpointing(10000);
StreamTableEnvironment tableEnvironment = StreamTableEnvironment.create(env, envSettings);
tableEnvironment.executeSql(" CREATE TABLE demoOrders (\n" +
" `order_id` INTEGER ,\n" +
" `order_date` DATE ,\n" +
" `order_time` TIMESTAMP(3),\n" +
" `quantity` INT ,\n" +
" `product_id` INT ,\n" +
" `purchaser` STRING,\n" +
" primary key(order_id) NOT ENFORCED" +
" ) WITH (\n" +
" 'connector' = 'mysql-cdc',\n" +
" 'hostname' = 'localhost',\n" +
" 'port' = '3306',\n" +
" 'username' = 'cdc',\n" +
" 'password' = '123456',\n" +
" 'database-name' = 'test',\n" +
" 'table-name' = 'demo_orders'," +
// Full data and incremental data synchronization.
" 'scan.startup.mode' = 'initial' " +
" )");
tableEnvironment.executeSql("CREATE TABLE sink (\n" +
" `order_id` INTEGER ,\n" +
" `order_date` DATE ,\n" +
" `order_time` TIMESTAMP(3),\n" +
" `quantity` INT ,\n" +
" `product_id` INT ,\n" +
" `purchaser` STRING,\n" +
" primary key (order_id) NOT ENFORCED " +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'properties.bootstrap.servers' = 'localhost:9092',\n" +
" 'topic' = 'mqTest02',\n" +
" 'format' = 'changelog-json' "+
")");
tableEnvironment.executeSql("insert into sink select * from demoOrders");}
Full data output:
{"data":{"order_id":1010,"order_date":"2021-09-17","order_time":"2021-09-22 10:52:12.189","quantity":53,"product_id":502,"purchaser":"flink"},"op":"+I"}
{"data":{"order_id":1009,"order_date":"2021-09-17","order_time":"2021-09-22 10:52:09.709","quantity":31,"product_id":500,"purchaser":"flink"},"op":"+I"}
{"data":{"order_id":1008,"order_date":"2021-09-17","order_time":"2021-09-22 10:52:06.637","quantity":69,"product_id":503,"purchaser":"flink"},"op":"+I"}
{"data":{"order_id":1007,"order_date":"2021-09-17","order_time":"2021-09-22 10:52:03.535","quantity":52,"product_id":502,"purchaser":"flink"},"op":"+I"}
{"data":{"order_id":1002,"order_date":"2021-09-17","order_time":"2021-09-22 10:51:51.347","quantity":69,"product_id":503,"purchaser":"flink"},"op":"+I"}
{"data":{"order_id":1001,"order_date":"2021-09-17","order_time":"2021-09-22 10:51:48.783","quantity":50,"product_id":502,"purchaser":"flink"},"op":"+I"}
{"data":{"order_id":1000,"order_date":"2021-09-17","order_time":"2021-09-17 17:40:32.354","quantity":30,"product_id":500,"purchaser":"flink"},"op":"+I"}
{"data":{"order_id":1006,"order_date":"2021-09-17","order_time":"2021-09-22 10:52:01.249","quantity":31,"product_id":500,"purchaser":"flink"},"op":"+I"}
{"data":{"order_id":1005,"order_date":"2021-09-17","order_time":"2021-09-22 10:51:58.813","quantity":69,"product_id":503,"purchaser":"flink"},"op":"+I"}
{"data":{"order_id":1004,"order_date":"2021-09-17","order_time":"2021-09-22 10:51:56.153","quantity":50,"product_id":502,"purchaser":"flink"},"op":"+I"}
{"data":{"order_id":1003,"order_date":"2021-09-17","order_time":"2021-09-22 10:51:53.727","quantity":30,"product_id":500,"purchaser":"flink"},"op":"+I"}
Modify table data and capture incremental data:
## Update the value of the 1005
{"data":{"order_id":1005,"order_date":"2021-09-17","order_time":"2021-09-22 02:51:58.813","quantity":69,"product_id":503,"purchaser":"flink"},"op":"-U"}
{"data":{"order_id":1005,"order_date":"2021-09-17","order_time":"2021-09-22 02:55:43.627","quantity":80,"product_id":503,"purchaser":"flink"},"op":"+U"}
## Delete 1000
{"data":{"order_id":1000,"order_date":"2021-09-17","order_time":"2021-09-17 09:40:32.354","quantity":30,"product_id":500,"purchaser":"flink"},"op":"-D"}
In the full phase, data is read in the distributed mode. First, the data in the current table is divided into multiple chunks by primary key and subsequent subtasks read data within the chunk range. The table data is divided into evenly distributed chunks and non-evenly distributed chunks based on whether the primary key columns are auto-increment integers.
Primary key columns are auto-increment and are of the integer type (int, bigint, and decimal). Query the minimum value and maximum value of a primary key column. Data is evenly divided based on the chunkSize. Since the primary key is of the integer type, the end position of the chunk is calculated based on the start position and chunkSize of the current chunk.
Note: The trigger condition for uniform distribution of the latest version no longer depends on whether the primary key column is auto-incrementing. The primary key column is required to be an integer type and calculate the data distribution coefficient based on max(id) - min(id)/rowcount. Only the distribution coefficient that equals or is less than the configured distribution coefficient (evenly-distribution.factor default is 1000.0d) can be evenly divided.
// Calculate the data range of the primary key column.
select min(`order_id`), max(`order_id`) from demo_orders;
// Divide data into chunkSize-sized splits
chunk-0: [min,start + chunkSize)
chunk-1: [start + chunkSize, start + 2chunkSize)
.......
chunk-last: [max,null)
The primary key column is not self-increasing or of a non-integer type. The primary key is non-numeric. You need to sort the undivided data in ascending order by the primary key for each division. The maximum value of the chunkSize parameter is the end position of the current chunk.
Note: The latest version of the non-uniform distribution trigger condition is that the primary key column is of the non-integer type or the calculated distribution coefficient (distributionFactor) is larger than the configured distribution coefficient (evenly-distribution.factor).
// After the unsplit data is sorted, take the chunkSize data to the maximum value, which is used as the end position of the split.
chunkend = SELECT MAX(`order_id`) FROM (
SELECT `order_id` FROM `demo_orders`
WHERE `order_id` >= [starting position of the previous split]
ORDER BY `order_id` ASC
LIMIT [chunkSize]
) AS T
Flink divides table data into multiple chunks, and subtasks read chunk data concurrently without locking. Since there is no lock in the whole process during data split reading, other transactions may modify the data within the split range. However, data consistency cannot be guaranteed. Therefore, Flink uses snapshot record reading + Binary log data correction in the full phase to ensure data consistency.
Execute SQL to query the data records of the split range using JDBC:
## Read SQL for snapshot records
SELECT * FROM `test`.`demo_orders`
WHERE order_id >= [chunkStart]
AND NOT (order_id = [chunkEnd])
AND order_id <= [chunkEnd]
Execute the SHOW MASTER STATUS
query before and after the snapshot read to query the current offset of the Binlog file. After the snapshot is read, query the Binlog data within the range and correct the read snapshot records.
The data organization structure during snapshot reading and Binlog data reading is shown below:
BinlogEvents corrects the SnapshotEvents rule.
Revised data organization structure:
The data in the range of splits [1,11] is used as an example to describe the processing of split data. c, d, and u represent the add, delete, and update operations captured by Debezium.
Data and structure before revision:
Revised data and structure:
After a single slice is processed, the SplitEnumerator will send the start position (ChunkStart, ChunkStartEnd) of the completed slice data and the maximum offset (High watermark) of the binlog. This parameter is used to specify the start offset for incremental reading.
After the full data reading stage, the SplitEnumerator issues a BinlogSplit for incremental data reading. The most important attribute of BinlogSplit reading is the starting offset. If the offset is set small, there may be duplicate data downstream. If the offset is set large, there may be overdue dirty data. The start offset of the Flink CDC incremental read is the smallest binlog offset of all completed full splits. Only the data that meets the specified conditions is sent downstream. Data delivery conditions:
For example, the completed split information retained by the SplitEnumerator is listed below:
Split Index | Chunk Data Range | Maximum Binlog Read by Splits |
0 | [1, 100] | 1000 |
1 | [101,200] | 800 |
2 | [201,300] | 1500 |
During incremental reading, binlog data is read from offset 800. When data at the range of <data:123, offset:1500>
is captured, the snapshot split to which the 123 belongs is found first and the corresponding maximum binlog offset 800 is found afterwards. If the current offset is greater than the maximum offset read by the snapshot, the data is sent. Otherwise, the data is discarded.
The FLIP-27: Refactor Source Interface design is not described in detail. This article focuses on the flink-mysql-cdc interface call and implementation.
SourceCoordinator (as an OperatorCoordinator implementation of Source) runs on the master node and does some initialization work by calling MySqlParallelSource#createEnumerator to create a MySqlSourceEnumerator and calling the start method at startup.
1) Create a MySqlSourceEnumerator, use MySqlHybridSplitAssigner to split full + incremental data, and use MySqlValidator to verify the MySQL version and configuration
2) MySqlValidator verification:
3) MySqlSplitAssigner initialization:
4) Start a periodic scheduling thread. It requires SourceReader to send information that has been completed but not sent an ACK event to the SourceEnumerator.
private void syncWithReaders(int[] subtaskIds, Throwable t) {
if (t != null) {
throw new FlinkRuntimeException("Failed to list obtain registered readers due to:", t);
}
// when the SourceEnumerator restores or the communication failed between
// SourceEnumerator and SourceReader, it may missed some notification event.
// tell all SourceReader(s) to report there finished but unacked splits.
if (splitAssigner.waitingForFinishedSplits()) {
for (int subtaskId : subtaskIds) {
// note: Send FinishedSnapshotSplitsRequestEvent
context.sendEventToSourceReader(
subtaskId, new FinishedSnapshotSplitsRequestEvent());
}
}
}
SourceOperator integrates SourceReader and interacts with SourceCoordinator through OperatorEventGateway.
1. SourceOperator creates MySqlSourceReader by MySqlParallelSource during initialization. The MySqlSourceReader creates a Fetcher pull split data using the SingleThreadFetcherManager. The data is written to elementsQueue in the MySqlRecords format.
MySqlParallelSource#createReader
public SourceReader<T, MySqlSplit> createReader(SourceReaderContext readerContext) throws Exception {
// note: Data storage queue
FutureCompletingBlockingQueue<RecordsWithSplitIds<SourceRecord>> elementsQueue =
new FutureCompletingBlockingQueue<>();
final Configuration readerConfiguration = getReaderConfig(readerContext);
// note: Split Reader factory class
Supplier<MySqlSplitReader> splitReaderSupplier =
() -> new MySqlSplitReader(readerConfiguration, readerContext.getIndexOfSubtask());
return new MySqlSourceReader<>(
elementsQueue,
splitReaderSupplier,
new MySqlRecordEmitter<>(deserializationSchema),
readerConfiguration,
readerContext);
}
2. The created MySqlSourceReader is passed to the SourceCoordinator as an event for registration. After the SourceCoordinator receives the registration event, it saves the reader address and index.
SourceCoordinator#handleReaderRegistrationEvent
// note: SourceCoordinator handle the reader registration event
private void handleReaderRegistrationEvent(ReaderRegistrationEvent event) {
context.registerSourceReader(new ReaderInfo(event.subtaskId(), event.location()));
enumerator.addReader(event.subtaskId());
}
3. After the MySqlSourceReader is started, a request split event is sent to the MySqlSourceEnumerator to collect the allocated split data.
4. After the SourceOperator is initialized, you can call the emitNext to merge datasets obtained by the SourceReaderBase from elementsQueue and send them to the MySqlRecordEmitter. Interface call diagram:
When the MySqlSourceReader starts, a request for the RequestSplitEvent event is sent to the MySqlSourceEnumerator to read range data based on the returned split range. MySqlSourceEnumerator splits the request processing logic in the full data read phase and finally returns a MySqlSnapshotSplit.
1. Handle the split request event, assign the split to the requested reader, and pass the MySqlSplit (full stage MySqlSnapshotSplit and incremental stage MySqlBinlogSplit) by sending the AddSplitEvent time.
MySqlSourceEnumerator#handleSplitRequest
public void handleSplitRequest(int subtaskId, @Nullable String requesterHostname) {
if (!context.registeredReaders().containsKey(subtaskId)) {
// reader failed between sending the request and now. skip this request.
return;
}
// note: stores the subtaskId to which the reader belongs to TreeSet, and preferentially allocates task-0 when processing binlog split
readersAwaitingSplit.add(subtaskId);
assignSplits();
}
// note: Assign a split
private void assignSplits() {
final Iterator<Integer> awaitingReader = readersAwaitingSplit.iterator();
while (awaitingReader.hasNext()) {
int nextAwaiting = awaitingReader.next();
// if the reader that requested another split has failed in the meantime, remove
// it from the list of waiting readers
if (!context.registeredReaders().containsKey(nextAwaiting)) {
awaitingReader.remove();
continue;
}
//note: The split is assigned by the MySqlSplitAssigner
Optional<MySqlSplit> split = splitAssigner.getNext();
if (split.isPresent()) {
final MySqlSplit mySqlSplit = split.get();
// note: Send AddSplitEvent and return split information for Reader
context.assignSplit(mySqlSplit, nextAwaiting);
awaitingReader.remove();
LOG.info("Assign split {} to subtask {}", mySqlSplit, nextAwaiting);
} else {
// there is no available splits by now, skip assigning
break;
}
}
}
2. The logic of MySqlHybridSplitAssigner for processing full data splits and incremental data splits.
MySqlHybridSplitAssigner*#getNext*
@Override
public Optional<MySqlSplit> getNext() {
if (snapshotSplitAssigner.noMoreSplits()) {
// binlog split assigning
if (isBinlogSplitAssigned) {
// no more splits for the assigner
return Optional.empty();
} else if (snapshotSplitAssigner.isFinished()) {
// we need to wait snapshot-assigner to be finished before
// assigning the binlog split. Otherwise, records emitted from binlog split
// might be out-of-order in terms of same primary key with snapshot splits.
isBinlogSplitAssigned = true;
//note: After the snapshot split, create a BinlogSplit.
return Optional.of(createBinlogSplit());
} else {
// binlog split is not ready by now
return Optional.empty();
}
} else {
// note: SnapshotSplit is created by the MySqlSnapshotSplitAssigner
// snapshot assigner still have remaining splits, assign split from it
return snapshotSplitAssigner.getNext();
}
}
3. MySqlSnapshotSplitAssigner processes the full split logic. The splits are generated by ChunkSplitter and stored in Iterator.
@Override
public Optional<MySqlSplit> getNext() {
if (!remainingSplits.isEmpty()) {
// return remaining splits firstly
Iterator<MySqlSnapshotSplit> iterator = remainingSplits.iterator();
MySqlSnapshotSplit split = iterator.next();
iterator.remove();
//note: The allocated splits are stored in the assignedSplits collection
assignedSplits.put(split.splitId(), split);
return Optional.of(split);
} else {
// note: remainingTables stores the name of the table to be read in the initialization phase
TableId nextTable = remainingTables.pollFirst();
if (nextTable != null) {
// split the given table into chunks (snapshot splits)
// note: ChunkSplitter is created in the initialization phase, and generateSplits is called to divide splits
Collection<MySqlSnapshotSplit> splits = chunkSplitter.generateSplits(nextTable);
// note: Retain all slice split information
remainingSplits.addAll(splits);
// note: The table that has been split
alreadyProcessedTables.add(nextTable);
// note: Call this method recursively
return getNext();
} else {
return Optional.empty();
}
}
}
4. hunkSplitter divides the table into evenly distributed or unevenly distributed splits. The read table must contain a physical primary key.
Public Collection<MySqlSnapshotSplit> generateSplits(TableId tableId) {
Table schema = mySqlSchema.getTableSchema(tableId).getTable();
List<Column> primaryKeys = schema.primaryKeyColumns();
// note: Must have a primary key
if (primaryKeys.isEmpty()) {
throw new ValidationException(
String.format(
"Incremental snapshot for tables requires primary key,"
+ " but table %s doesn't have primary key.",
tableId));
}
// use first field in primary key as the split key
Column splitColumn = primaryKeys.get(0);
final List<ChunkRange> chunks;
try {
// note: Divide data into multiple splits by primary key column
chunks = splitTableIntoChunks(tableId, splitColumn);
} catch (SQLException e) {
throw new FlinkRuntimeException("Failed to split chunks for table " + tableId, e);
}
//note: Primary key data type conversion and ChunkRange is packaged into MySqlSnapshotSplit.
// convert chunks into splits
List<MySqlSnapshotSplit> splits = new ArrayList<>();
RowType splitType = splitType(splitColumn);
for (int i = 0; i < chunks.size(); i++) {
ChunkRange chunk = chunks.get(i);
MySqlSnapshotSplit split =
createSnapshotSplit(
tableId, i, splitType, chunk.getChunkStart(), chunk.getChunkEnd());
splits.add(split);
}
return splits;
}
5. splitTableIntoChunks divides splits based on physical primary keys.
private List<ChunkRange> splitTableIntoChunks(TableId tableId, Column
splitColumn)
throws SQLException {
final String splitColumnName = splitColumn.name();
// select min, max
final Object[] minMaxOfSplitColumn = queryMinMax(jdbc, tableId, splitColumnName);
final Object min = minMaxOfSplitColumn[0];
final Object max = minMaxOfSplitColumn[1];
if (min == null || max == null || min.equals(max)) {
// empty table, or only one row, return full table scan as a chunk
return Collections.singletonList(ChunkRange.all());
}
final List<ChunkRange> chunks;
if (splitColumnEvenlyDistributed(splitColumn)) {
// use evenly-sized chunks which is much efficient
// note: Evenly divided by primary key
chunks = splitEvenlySizedChunks(min, max);
} else {
// note: Non-uniform division by primary key
// use unevenly-sized chunks which will request many queries and is not efficient.
chunks = splitUnevenlySizedChunks(tableId, splitColumnName, min, max);
}
return chunks;
}
/** Checks whether split column is evenly distributed across its range. */
private static boolean splitColumnEvenlyDistributed(Column splitColumn) {
// only column is auto-incremental are recognized as evenly distributed.
// TODO: we may use MAX,MIN,COUNT to calculate the distribution in the future.
if (splitColumn.isAutoIncremented()) {
DataType flinkType = MySqlTypeUtils.fromDbzColumn(splitColumn);
LogicalTypeRoot typeRoot = flinkType.getLogicalType().getTypeRoot();
// currently, we only support split column with type BIGINT, INT, DECIMAL
return typeRoot == LogicalTypeRoot.BIGINT
|| typeRoot == LogicalTypeRoot.INTEGER
|| typeRoot == LogicalTypeRoot.DECIMAL;
} else {
return false;
}
}
/**
* Split the table into blocks of uniform size according to the minimum and maximum values of the split column, and scroll the blocks in {@link #chunkSize} step size.
* Split table into evenly sized chunks based on the numeric min and max value of split column,
* and tumble chunks in {@link #chunkSize} step size.
*/
private List<ChunkRange> splitEvenlySizedChunks(Object min, Object max) {
if (ObjectUtils.compare(ObjectUtils.plus(min, chunkSize), max) > 0) {
// there is no more than one chunk, return full table as a chunk
return Collections.singletonList(ChunkRange.all());
}
final List<ChunkRange> splits = new ArrayList<>();
Object chunkStart = null;
Object chunkEnd = ObjectUtils.plus(min, chunkSize);
// chunkEnd <= max
while (ObjectUtils.compare(chunkEnd, max) <= 0) {
splits.add(ChunkRange.of(chunkStart, chunkEnd));
chunkStart = chunkEnd;
chunkEnd = ObjectUtils.plus(chunkEnd, chunkSize);
}
// add the ending split
splits.add(ChunkRange.of(chunkStart, null));
return splits;
}
/** Split the table into blocks of uneven size by calculating the maximum value of the next block.
* Split table into unevenly sized chunks by continuously calculating next chunk max value. */
private List<ChunkRange> splitUnevenlySizedChunks(
TableId tableId, String splitColumnName, Object min, Object max) throws SQLException {
final List<ChunkRange> splits = new ArrayList<>();
Object chunkStart = null;
Object chunkEnd = nextChunkEnd(min, tableId, splitColumnName, max);
int count = 0;
while (chunkEnd != null && ObjectUtils.compare(chunkEnd, max) <= 0) {
// we start from [null, min + chunk_size) and avoid [null, min)
splits.add(ChunkRange.of(chunkStart, chunkEnd));
// may sleep a while to avoid DDOS on MySQL server
maySleep(count++);
chunkStart = chunkEnd;
chunkEnd = nextChunkEnd(chunkEnd, tableId, splitColumnName, max);
}
// add the ending split
splits.add(ChunkRange.of(chunkStart, null));
return splits;
}
private Object nextChunkEnd(
Object previousChunkEnd, TableId tableId, String splitColumnName, Object max)
throws SQLException {
// chunk end might be null when max values are removed
Object chunkEnd =
queryNextChunkMax(jdbc, tableId, splitColumnName, chunkSize, previousChunkEnd);
if (Objects.equals(previousChunkEnd, chunkEnd)) {
// we don't allow equal chunk start and end,
// should query the next one larger than chunkEnd
chunkEnd = queryMin(jdbc, tableId, splitColumnName, chunkEnd);
}
if (ObjectUtils.compare(chunkEnd, max) >= 0) {
return null;
} else {
return chunkEnd;
}
}
After the MySqlSourceReader receives the split allocation request, it creates a SplitFetcher thread to add and execute the AddSplitsTask task to the taskQueue. Then, it executes the FetchTask task to read data using the Debezium API. The read data is stored in elementsQueue. The SourceReaderBase obtains data from the queue and sends it to the MySqlRecordEmitter.
1. Create a SplitFetcher to add an AddSplitsTask to the taskQueue when processing the Split Assignment event:
SingleThreadFetcherManager#addSplits
public void addSplits(List<SplitT> splitsToAdd) {
SplitFetcher<E, SplitT> fetcher = getRunningFetcher();
if (fetcher == null) {
fetcher = createSplitFetcher();
// Add the splits to the fetchers.
fetcher.addSplits(splitsToAdd);
startFetcher(fetcher);
} else {
fetcher.addSplits(splitsToAdd);
}
}
// Create a SplitFetcher
protected synchronized SplitFetcher<E, SplitT> createSplitFetcher() {
if (closed) {
throw new IllegalStateException("The split fetcher manager has closed.");
}
// Create SplitReader.
SplitReader<E, SplitT> splitReader = splitReaderFactory.get();
int fetcherId = fetcherIdGenerator.getAndIncrement();
SplitFetcher<E, SplitT> splitFetcher =
new SplitFetcher<>(
fetcherId,
elementsQueue,
splitReader,
errorHandler,
() -> {
fetchers.remove(fetcherId);
elementsQueue.notifyAvailable();
});
fetchers.put(fetcherId, splitFetcher);
return splitFetcher;
}
public void addSplits(List<SplitT> splitsToAdd) {
enqueueTask(new AddSplitsTask<>(splitReader, splitsToAdd, assignedSplits));
wakeUp(true);
}
2. Execute the SplitFetcher thread. Execute the AddSplitsTask thread for the first time to add splits. Then, execute the FetchTask thread to pull data.
SplitFetcher#runOnce
void runOnce() {
try {
if (shouldRunFetchTask()) {
runningTask = fetchTask;
} else {
runningTask = taskQueue.take();
}
if (!wakeUp.get() && runningTask.run()) {
LOG.debug("Finished running task {}", runningTask);
runningTask = null;
checkAndSetIdle();
}
} catch (Exception e) {
throw new RuntimeException(
String.format(
"SplitFetcher thread %d received unexpected exception while polling the records",
id),
e);
}
maybeEnqueueTask(runningTask);
synchronized (wakeUp) {
// Set the running task to null. It is necessary for the shutdown method to avoid
// unnecessarily interrupt the running task.
runningTask = null;
// Set the wakeUp flag to false.
wakeUp.set(false);
LOG.debug("Cleaned wakeup flag.");
}
}
3. AddSplitsTask calls the MySqlSplitReader handleSplitsChanges method to add the allocated split information to the split queue. In the next fetch() call, fetch the slice from the queue and read the split data.
AddSplitsTask#run
public boolean run() {
for (SplitT s : splitsToAdd) {
assignedSplits.put(s.splitId(), s);
}
splitReader.handleSplitsChanges(new SplitsAddition<>(splitsToAdd));
return true;
}
MySqlSplitReader#handleSplitsChanges
public void handleSplitsChanges(SplitsChange<MySqlSplit> splitsChanges) {
if (!(splitsChanges instanceof SplitsAddition)) {
throw new UnsupportedOperationException(
String.format(
"The SplitChange type of %s is not supported.",
splitsChanges.getClass()));
}
//note: Add a split to the queue.
splits.addAll(splitsChanges.splits());
}
4. MySqlSplitReader executes fetch(). DebeziumReader reads data to the event queue and returns the data in the MySqlRecords format after the data is corrected.
MySqlSplitReader#fetch
@Override
public RecordsWithSplitIds<SourceRecord> fetch() throws IOException {
// note: creates a reader and reads data
checkSplitOrStartNext();
Iterator<SourceRecord> dataIt = null;
try {
// note: corrects the read data
dataIt = currentReader.pollSplitRecords();
} catch (InterruptedException e) {
LOG.warn("fetch data failed.", e);
throw new IOException(e);
}
// note: The returned data is encapsulated as MySqlRecords for transmission
return dataIt == null
? finishedSnapshotSplit()
: MySqlRecords.forRecords(currentSplitId, dataIt);
}
private void checkSplitOrStartNext() throws IOException {
// the binlog reader should keep alive
if (currentReader instanceof BinlogSplitReader) {
return;
}
if (canAssignNextSplit()) {
// note: reads MySqlSplit from the split queue
final MySqlSplit nextSplit = splits.poll();
if (nextSplit == null) {
throw new IOException("Cannot fetch from another split - no split remaining");
}
currentSplitId = nextSplit.splitId();
// note: distinguishes between full split reading and the incremental split reading
if (nextSplit.isSnapshotSplit()) {
if (currentReader == null) {
final MySqlConnection jdbcConnection = getConnection(config);
final BinaryLogClient binaryLogClient = getBinaryClient(config);
final StatefulTaskContext statefulTaskContext =
new StatefulTaskContext(config, binaryLogClient, jdbcConnection);
// note: creates a SnapshotSplitReader and uses the Debezium API to read the allocated data and the binlog value of the range
currentReader = new SnapshotSplitReader(statefulTaskContext, subtaskId);
}
} else {
// point from snapshot split to binlog split
if (currentReader != null) {
LOG.info("It's turn to read binlog split, close current snapshot reader");
currentReader.close();
}
final MySqlConnection jdbcConnection = getConnection(config);
final BinaryLogClient binaryLogClient = getBinaryClient(config);
final StatefulTaskContext statefulTaskContext =
new StatefulTaskContext(config, binaryLogClient, jdbcConnection);
LOG.info("Create binlog reader");
// note: Create a BinlogSplitReader and use the Debezium API to perform incremental data reading
currentReader = new BinlogSplitReader(statefulTaskContext, subtaskId);
}
// note: Reader is executed to read data
currentReader.submitSplit(nextSplit);
}
}
DebeziumReader includes full split reading and incremental split reading. After data is read, it is stored in the ChangeEventQueue and corrected during pollSplitRecords.
1. SnapshotSplitReader full split reading. Data reading in the full phase queries table data within the split range by executing the Select statement. When SHOW MASTER STATUS is executed before and after the write to the queue, the current offset is written.
public void submitSplit(MySqlSplit mySqlSplit) {
......
executor.submit(
() -> {
try {
currentTaskRunning = true;
// note: The current offset of binlogs before and after data is inserted
// 1. execute snapshot read task。
final SnapshotSplitChangeEventSourceContextImpl sourceContext =
new SnapshotSplitChangeEventSourceContextImpl();
SnapshotResult snapshotResult =
splitSnapshotReadTask.execute(sourceContext);
// note: prepares for incremental reading, including the start offset
final MySqlBinlogSplit appendBinlogSplit = createBinlogSplit(sourceContext);
final MySqlOffsetContext mySqlOffsetContext =
statefulTaskContext.getOffsetContext();
mySqlOffsetContext.setBinlogStartPoint(
appendBinlogSplit.getStartingOffset().getFilename(),
appendBinlogSplit.getStartingOffset().getPosition());
// note: reads from the start offset
// 2. execute binlog read task
if (snapshotResult.isCompletedOrSkipped()) {
// we should only capture events for the current table,
Configuration dezConf =
statefulTaskContext
.getDezConf()
.edit()
.with(
"table.whitelist",
currentSnapshotSplit.getTableId())
.build();
// task to read binlog for current split
MySqlBinlogSplitReadTask splitBinlogReadTask =
new MySqlBinlogSplitReadTask(
new MySqlConnectorConfig(dezConf),
mySqlOffsetContext,
statefulTaskContext.getConnection(),
statefulTaskContext.getDispatcher(),
statefulTaskContext.getErrorHandler(),
StatefulTaskContext.getClock(),
statefulTaskContext.getTaskContext(),
(MySqlStreamingChangeEventSourceMetrics)
statefulTaskContext
.getStreamingChangeEventSourceMetrics(),
statefulTaskContext
.getTopicSelector()
.getPrimaryTopic(),
appendBinlogSplit);
splitBinlogReadTask.execute(
new SnapshotBinlogSplitChangeEventSourceContextImpl());
} else {
readException =
new IllegalStateException(
String.format(
"Read snapshot for mysql split %s fail",
currentSnapshotSplit));
}
} catch (Exception e) {
currentTaskRunning = false;
LOG.error(
String.format(
"Execute snapshot read task for mysql split %s fail",
currentSnapshotSplit),
e);
readException = e;
}
});
}
2. SnapshotSplitReader incremental split reading. The focus of split reading in the incremental phase is to determine when the BinlogSplitReadTask stops. The reading ends at the offset when the slicing phase ends.
MySqlBinlogSplitReadTask#handleEvent
protected void handleEvent(Event event) {
// note: Event delivery queue
super.handleEvent(event);
// note: The binlog reading must be terminated in the full read phase
// check do we need to stop for read binlog for snapshot split.
if (isBoundedRead()) {
final BinlogOffset currentBinlogOffset =
new BinlogOffset(
offsetContext.getOffset().get(BINLOG_FILENAME_OFFSET_KEY).toString(),
Long.parseLong(
offsetContext
.getOffset()
.get(BINLOG_POSITION_OFFSET_KEY)
.toString()));
// note: currentBinlogOffset > HW Stop Reading
// reach the high watermark, the binlog reader should finished
if (currentBinlogOffset.isAtOrBefore(binlogSplit.getEndingOffset())) {
// send binlog end event
try {
signalEventDispatcher.dispatchWatermarkEvent(
binlogSplit,
currentBinlogOffset,
SignalEventDispatcher.WatermarkKind.BINLOG_END);
} catch (InterruptedException e) {
logger.error("Send signal event error.", e);
errorHandler.setProducerThrowable(
new DebeziumException("Error processing binlog signal event", e));
}
// Terminate binlog reading
// tell reader the binlog task finished
((SnapshotBinlogSplitChangeEventSourceContextImpl) context).finished();
}
}
}
3. The original data in the queue is corrected when the SnapshotSplitReader executes the pollSplitRecords. Please see RecordUtils#normalizedSplitRecords for more information about the processing logic.
public Iterator<SourceRecord> pollSplitRecords() throws InterruptedException {
if (hasNextElement.get()) {
// data input: [low watermark event][snapshot events][high watermark event][binlogevents][binlog-end event]
// data output: [low watermark event][normalized events][high watermark event]
boolean reachBinlogEnd = false;
final List<SourceRecord> sourceRecords = new ArrayList<>();
while (!reachBinlogEnd) {
// note: Handle DataChangeEvent events written in queues
List<DataChangeEvent> batch = queue.poll();
for (DataChangeEvent event : batch) {
sourceRecords.add(event.getRecord());
if (RecordUtils.isEndWatermarkEvent(event.getRecord())) {
reachBinlogEnd = true;
break;
}
}
}
// snapshot split return its data once
hasNextElement.set(false);
// ************ Correct data ***********
return normalizedSplitRecords(currentSnapshotSplit, sourceRecords, nameAdjuster)
.iterator();
}
// the data has been polled, no more data
reachEnd.compareAndSet(false, true);
return null;
}
4. BinlogSplitReader data reading. The read logic is simple. The focus is on the setting of the starting offset, which is the HW of all splits.
5. The original data in the queue is corrected when the BinlogSplitReader executes the pollSplitRecords to ensure data consistency. Binlog reads in the incremental phase are unbounded, and all data is sent to the event queue. About BinlogSplitReader, you can use shouldEmit() to determine whether to send the data.
BinlogSplitReader#pollSplitRecords
public Iterator<SourceRecord> pollSplitRecords() throws InterruptedException {
checkReadException();
final List<SourceRecord> sourceRecords = new ArrayList<>();
if (currentTaskRunning) {
List<DataChangeEvent> batch = queue.poll();
for (DataChangeEvent event : batch) {
if (shouldEmit(event.getRecord())) {
sourceRecords.add(event.getRecord());
}
}
}
return sourceRecords.iterator();
}
Event delivery conditions:
/**
*
* Returns the record should emit or not.
*
* <p>The watermark signal algorithm is the binlog split reader only sends the binlog event that
* belongs to its finished snapshot splits. For each snapshot split, the binlog event is valid
* since the offset is after its high watermark.
*
* <pre> E.g: the data input is :
* snapshot-split-0 info : [0, 1024) highWatermark0
* snapshot-split-1 info : [1024, 2048) highWatermark1
* the data output is:
* only the binlog event belong to [0, 1024) and offset is after highWatermark0 should send,
* only the binlog event belong to [1024, 2048) and offset is after highWatermark1 should send.
* </pre>
*/
private boolean shouldEmit(SourceRecord sourceRecord) {
if (isDataChangeRecord(sourceRecord)) {
TableId tableId = getTableId(sourceRecord);
BinlogOffset position = getBinlogPosition(sourceRecord);
// aligned, all snapshot splits of the table has reached max highWatermark
// note: Send when the value of the newly received event post is greater than maxwm
if (position.isAtOrBefore(maxSplitHighWatermarkMap.get(tableId))) {
return true;
}
Object[] key =
getSplitKey(
currentBinlogSplit.getSplitKeyType(),
sourceRecord,
statefulTaskContext.getSchemaNameAdjuster());
for (FinishedSnapshotSplitInfo splitInfo : finishedSplitsInfo.get(tableId)) {
/**
* note: Send the data when the current data value belongs to a snapshot spilt and the offset is greater than HWM
*/
if (RecordUtils.splitKeyRangeContains(
key, splitInfo.getSplitStart(), splitInfo.getSplitEnd())
&& position.isAtOrBefore(splitInfo.getHighWatermark())) {
return true;
}
}
// not in the monitored splits scope, do not emit
return false;
}
// always send the schema change event and signal event
// we need record them to state of Flink
return true;
}
SourceReaderBase obtains a collection of DataChangeEvent data read by a split from a queue and converts the data type from the DataChangeEvent of Debezium to the RowData type of Flink.
1. SourceReaderBase processing Split Data:
org.apache.flink.connector.base.source.reader.SourceReaderBase#pollNext
public InputStatus pollNext(ReaderOutput<T> output) throws Exception {
// make sure we have a fetch we are working on, or move to the next
RecordsWithSplitIds<E> recordsWithSplitId = this.currentFetch;
if (recordsWithSplitId == null) {
recordsWithSplitId = getNextFetch(output);
if (recordsWithSplitId == null) {
return trace(finishedOrAvailableLater());
}
}
// we need to loop here, because we may have to go across splits
while (true) {
// Process one record.
// note: read a single piece of data from the iterator through MySqlRecords
final E record = recordsWithSplitId.nextRecordFromSplit();
if (record != null) {
// emit the record.
recordEmitter.emitRecord(record, currentSplitOutput, currentSplitContext.state);
LOG.trace("Emitted record: {}", record);
// We always emit MORE_AVAILABLE here, even though we do not strictly know whether
// more is available. If nothing more is available, the next invocation will find
// this out and return the correct status.
// That means we emit the occasional 'false positive' for availability, but this
// saves us doing checks for every record. Ultimately, this is cheaper.
return trace(InputStatus.MORE_AVAILABLE);
} else if (!moveToNextSplit(recordsWithSplitId, output)) {
// The fetch is done and we just discovered that and have not emitted anything, yet.
// We need to move to the next fetch. As a shortcut, we call pollNext() here again,
// rather than emitting nothing and waiting for the caller to call us again.
return pollNext(output);
}
// else fall through the loop
}
}
private RecordsWithSplitIds<E> getNextFetch(final ReaderOutput<T> output) {
splitFetcherManager.checkErrors();
LOG.trace("Getting next source data batch from queue");
// note: obtain data from elementsQueue
final RecordsWithSplitIds<E> recordsWithSplitId = elementsQueue.poll();
if (recordsWithSplitId == null || !moveToNextSplit(recordsWithSplitId, output)) {
return null;
}
currentFetch = recordsWithSplitId;
return recordsWithSplitId;
}
2. MySqlRecords returns a single data collection:
com.ververica.cdc.connectors.mysql.source.split.MySqlRecords#nextRecordFromSplit
public SourceRecord nextRecordFromSplit() {
final Iterator<SourceRecord> recordsForSplit = this.recordsForCurrentSplit;
if (recordsForSplit != null) {
if (recordsForSplit.hasNext()) {
return recordsForSplit.next();
} else {
return null;
}
} else {
throw new IllegalStateException();
}
}
3. MySqlRecordEmitter converts data to Rowdata by RowDataDebeziumDeserializeSchema.
com.ververica.cdc.connectors.mysql.source.reader.MySqlRecordEmitter#emitRecord
public void emitRecord(SourceRecord element, SourceOutput<T> output, MySqlSplitState splitState)
throws Exception {
if (isWatermarkEvent(element)) {
BinlogOffset watermark = getWatermark(element);
if (isHighWatermarkEvent(element) && splitState.isSnapshotSplitState()) {
splitState.asSnapshotSplitState().setHighWatermark(watermark);
}
} else if (isSchemaChangeEvent(element) && splitState.isBinlogSplitState()) {
HistoryRecord historyRecord = getHistoryRecord(element);
Array tableChanges =
historyRecord.document().getArray(HistoryRecord.Fields.TABLE_CHANGES);
TableChanges changes = TABLE_CHANGE_SERIALIZER.deserialize(tableChanges, true);
for (TableChanges.TableChange tableChange : changes) {
splitState.asBinlogSplitState().recordSchema(tableChange.getId(), tableChange);
}
} else if (isDataChangeRecord(element)) {
// note: data process
if (splitState.isBinlogSplitState()) {
BinlogOffset position = getBinlogPosition(element);
splitState.asBinlogSplitState().setStartingOffset(position);
}
debeziumDeserializationSchema.deserialize(
element,
new Collector<T>() {
@Override
public void collect(final T t) {
output.collect(t);
}
@Override
public void close() {
// do nothing
}
});
} else {
// unknown element
LOG.info("Meet unknown element {}, just skip.", element);
}
}
4. RowDataDebeziumDeserializeSchema serialization process:
com.ververica.cdc.debezium.table.RowDataDebeziumDeserializeSchema#deserialize
public void deserialize(SourceRecord record, Collector<RowData> out) throws Exception {
Envelope.Operation op = Envelope.operationFor(record);
Struct value = (Struct) record.value();
Schema valueSchema = record.valueSchema();
if (op == Envelope.Operation.CREATE || op == Envelope.Operation.READ) {
GenericRowData insert = extractAfterRow(value, valueSchema);
validator.validate(insert, RowKind.INSERT);
insert.setRowKind(RowKind.INSERT);
out.collect(insert);
} else if (op == Envelope.Operation.DELETE) {
GenericRowData delete = extractBeforeRow(value, valueSchema);
validator.validate(delete, RowKind.DELETE);
delete.setRowKind(RowKind.DELETE);
out.collect(delete);
} else {
GenericRowData before = extractBeforeRow(value, valueSchema);
validator.validate(before, RowKind.UPDATE_BEFORE);
before.setRowKind(RowKind.UPDATE_BEFORE);
out.collect(before);
GenericRowData after = extractAfterRow(value, valueSchema);
validator.validate(after, RowKind.UPDATE_AFTER);
after.setRowKind(RowKind.UPDATE_AFTER);
out.collect(after);
}
}
After the MySqlSourceReader processes a full Split, it sends the completed Split information to the MySqlSourceEnumerator, including the split ID and HighWatermar, and then continues to send the split request.
com.ververica.cdc.connectors.mysql.source.reader.MySqlSourceReader#onSplitFinished
protected void onSplitFinished(Map<String, MySqlSplitState> finishedSplitIds) {
for (MySqlSplitState mySqlSplitState : finishedSplitIds.values()) {
MySqlSplit mySqlSplit = mySqlSplitState.toMySqlSplit();
finishedUnackedSplits.put(mySqlSplit.splitId(), mySqlSplit.asSnapshotSplit());
}
/**
* note: send the split read completion event
*/
reportFinishedSnapshotSplitsIfNeed();
// Continue to send split requests after the previous split is
context.sendSplitRequest();
}
private void reportFinishedSnapshotSplitsIfNeed() {
if (!finishedUnackedSplits.isEmpty()) {
final Map<String, BinlogOffset> finishedOffsets = new HashMap<>();
for (MySqlSnapshotSplit split : finishedUnackedSplits.values()) {
// note: Send slice ID and the maximum offset finishedOffsets
finishedOffsets.put(split.splitId(), split.getHighWatermark());
}
FinishedSnapshotSplitsReportEvent reportEvent =
new FinishedSnapshotSplitsReportEvent(finishedOffsets);
context.sendSourceEventToCoordinator(reportEvent);
LOG.debug(
"The subtask {} reports offsets of finished snapshot splits {}.",
subtaskId,
finishedOffsets);
}
}
After all splits are read in the full phase, the MySqlHybridSplitAssigner creates a BinlogSplit for subsequent incremental reads. When a BinlogSplit is created, the smallest BinlogOffset is filtered from all completed full splits. Note: The minimum offset createBinlogSplit in the 2.0.0 branch always starts from 0. The latest master branch has fixed this BUG.
private MySqlBinlogSplit createBinlogSplit() {
final List<MySqlSnapshotSplit> assignedSnapshotSplit =
snapshotSplitAssigner.getAssignedSplits().values().stream()
.sorted(Comparator.comparing(MySqlSplit::splitId))
.collect(Collectors.toList());
Map<String, BinlogOffset> splitFinishedOffsets =
snapshotSplitAssigner.getSplitFinishedOffsets();
final List<FinishedSnapshotSplitInfo> finishedSnapshotSplitInfos = new ArrayList<>();
final Map<TableId, TableChanges.TableChange> tableSchemas = new HashMap<>();
BinlogOffset minBinlogOffset = null;
// note: filters the minimum offset from all assignedSnapshotSplit
for (MySqlSnapshotSplit split : assignedSnapshotSplit) {
// find the min binlog offset
BinlogOffset binlogOffset = splitFinishedOffsets.get(split.splitId());
if (minBinlogOffset == null || binlogOffset.compareTo(minBinlogOffset) < 0) {
minBinlogOffset = binlogOffset;
}
finishedSnapshotSplitInfos.add(
new FinishedSnapshotSplitInfo(
split.getTableId(),
split.splitId(),
split.getSplitStart(),
split.getSplitEnd(),
binlogOffset));
tableSchemas.putAll(split.getTableSchemas());
}
final MySqlSnapshotSplit lastSnapshotSplit =
assignedSnapshotSplit.get(assignedSnapshotSplit.size() - 1).asSnapshotSplit();
return new MySqlBinlogSplit(
BINLOG_SPLIT_ID,
lastSnapshotSplit.getSplitKeyType(),
minBinlogOffset == null ? BinlogOffset.INITIAL_OFFSET : minBinlogOffset,
BinlogOffset.NO_STOPPING_OFFSET,
finishedSnapshotSplitInfos,
tableSchemas);
}
Flink CDC Series – Part 5: Implement Real-Time Writing of MySQL Data to Apache Doris
Adaptive Batch Scheduler Automatically Decide Parallelism of Flink Batch Jobs
150 posts | 43 followers
FollowApache Flink Community China - June 8, 2021
Apache Flink Community China - April 13, 2022
Apache Flink Community - July 5, 2024
Apache Flink Community China - May 14, 2021
Alibaba Cloud Storage - February 27, 2020
Apache Flink Community China - June 15, 2021
150 posts | 43 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreResource management and task scheduling for large-scale batch processing
Learn MoreMore Posts by Apache Flink Community