This topic describes how to troubleshoot issues that may occur on a merge subtask generated by a full and incremental synchronization task used to synchronize data to MaxCompute.
Description of a merge subtask
How it works
When a full and incremental synchronization task used to synchronize data to MaxCompute is run, a batch synchronization subtask generated by the synchronization task is first started to synchronize full data from a source table to a destination base table and a real-time synchronization subtask generated by the synchronization task is started to read the change logs of a source database and write the incremental data to a log table. The full data and incremental data are merged in T+1 mode. T represents the day on which the synchronization task is configured. In the early morning of the T+1 day, the merge subtask generated by the synchronization task is started to merge the incremental data that is written to the T partition in the log table with the full data that is written to the T-1 partition in the base table into the T partition in the base table. The following figure shows the data synchronization process in which partitioned tables are used.
Check subtasks and merge subtask
After the running of a full and incremental synchronization task used to synchronize data to MaxCompute is complete, Data Integration generates three subtasks in DataStudio, as shown in the following figure.
CheckStreamXDone: a subtask that determines the consumption offset of incremental data and checks whether all incremental data in partitions of the previous day is synchronized. The merge subtask generated by the synchronization task can start to merge full data with incremental data only if synchronization of all incremental data is complete.
FullDataInitSuccessCheckDone: a subtask that checks whether full synchronization is complete. Full synchronization is performed once when you run the synchronization task for the first time or add source tables to the synchronization task to synchronize full data of all source tables to destination MaxCompute tables. This subtask ensures that all existing data in source tables is synchronized to desired partitions of destination tables after you run the synchronization task for the first time or add source tables to the synchronization task.
MergeInto: a merge subtask that merges the incremental data in the log table with the full data in the base table to generate new full partition data.
Subtask running
The DataWorks scheduling system schedules the check subtasks and merge subtask after midnight every day. The MergeInto subtask depends on the FullDataInitSuccessCheckDone subtask, and the FullDataInitSuccessCheckDone subtask depends on the CheckStreamXDone subtask. You can view scheduling dependencies between the subtasks on the Cycle Task page in Operation Center.
The instances generated for the three subtasks are run after midnight every day based on the following sequence: CheckStreamXDone, FullDataInitSuccessCheckDone, and MergeInto.
The MergeInto subtask is configured with the self-dependency. The instance generated for the MergeInto subtask in the current cycle can start to run only after the instance generated for the same subtask in the previous cycle is successfully run. The instance of the MergeInto subtask is run on the T+1 day to merge incremental data with full partition data of the T-1 day to ensure data integrity. The instance generated for the merge subtask on the T day generates the full partition data of the T-1 day. If the instance fails to run, the full partition data of the T-1 day cannot be generated, and the instance generated for the merge subtask on the T+1 day cannot be run.
Running of the merge subtask
When the merge subtask is run, it generates separate SQL subtasks for source tables. Each SQL subtask reads incremental data from their respective source table based on the incremental change logs in the log table, and merge the incremental data with full data in partitions of the base table of the previous day to generate full partition data of the current day.
SQL subtasks generated by the merge subtask are run in parallel. A failed SQL subtask can automatically rerun if the SQL subtask meets the rerun conditions. If an SQL subtask finally fails to run, the merge subtask fails to run.
Troubleshoot common issues of a merge subtask
A merge subtask fails to run
Base table partition not exists.
Cause: Full data in partitions of the previous day is not generated, which rarely occurs. Possible causes for the error:
You create a data backfill instance for the merge subtask to manually backfill data for the subtask. However, the instance generated for the merge subtask is not run on the previous day. You must make sure that the instance generated for the merge subtask is successfully run on the previous day.
Some data fails to be synchronized during the full synchronization that is performed when you run the synchronization task for the first time or add source tables to the synchronization task. If full data of a source table fails to be synchronized, you can remove the table from the synchronization task and re-add the table to the synchronization task to synchronize full data from the table again.
If the error is reported due to other causes, unsupported scenarios may be triggered. In this case, contact on-duty engineers for further troubleshooting.
Run job failed,instance:XXXX.
Cause: An ODPS SQL subtask failed to run. After you search for the subtask by instance ID, the error log that is similar to the following log is displayed:
Instance: XXX, Status: FAILED result: ODPS-0110061: Failed to run ddltask - Persist ddl plans failed. , Logview: http://Logview.odps.aliyun.com/Logview/?h=http://service.ap-southeast-1.maxcompute.aliyun-inc.com/api&p=sgods&i=20220807101011355goyu43wa&token=NFBwc2tzaEpJNGF0OVFINmJuREZrem1OamQ4PSxPRFBTX09CTzo1OTMwMzI1NTY1MTk1MzAzLDE2NjAxMjYyMTEseyJTdGF0ZW1lbnQiOlt7IkFjdGlvbiI6WyJvZHBzOlJlYWQiXSwiRWZmZWN0IjoiQWxsb3ciLCJSZXNvdXJjZSI6WyJhY3M6b2RwczoqOnByb2plY3RzL3Nnb2RzL2luc3RhbmNlcy8yMDIyMDgwNzEwMTAxMTM1NWdveXU0M3dhIl19XSwiVmVyc2lvbiI6IjEifQ== ]
In most cases, an error message in the ODPS-XXXXXXX format indicates that an internal ODPS execution error occurs. You can refer to SQL errors (ODPS-01CCCCX) to look up the detailed information about the error specific to your case and its solution. If the documentation cannot resolve your issue or you have other questions, contact MaxCompute technical support.
Request rejected by flow control. You have exceeded the limit for the number of tasks you can run concurrently in this project. Please try later.
Cause: An excessively large number of SQL subtasks that are generated by the merge subtask are run at the same time. This triggers throttling of MaxCompute on the project that contains the table to which the merge subtask writes data.
Solution:
Reduce the number of parallel threads for the merge subtask by modifying the concurrency parameter.
Change the scheduling settings of the merge subtask to space out the points in time when the SQL subtasks generated by the merge subtask are run to prevent high parallelism.
Contact MaxCompute technical support to resolve the issue.
Partition data is not generated
Cause: The instance generated for the merge subtask in the current cycle fails to run or does not finish running. In this case, go to the Cycle Instance page in Operation Center to check the details of the instance generated for the merge subtask in the current cycle:
If the status of the instance is Running, wait until the instance finishes running on the current day.
If the status of the instance is Failed, view run logs of the merge subtask to identify the cause of failure. Fix the issue and right-click the instance in the directed acyclic graph (DAG) of the instance and select Rerun to rerun the instance.
If the status of the instance is Not Run, you can use one of the following methods to resolve the issue:
Check whether the CheckStreamXDone subtask finishes running. If the CheckStreamXDone subtask is still running, check whether latency occurs on the subtask. If latency occurs, resolve the latency first. Then, the merge subtask can be triggered to run. You can view logs of the CheckStreamXDone subtask to check whether the latency occurs. Sample logs:
2023-01-06 00:15:04,692 INFO [DwcheckStreamXDoneNode.java:168] - Time for the current offset of data: 1672921729000 2023-01-06 00:15:04,692 WARN [DwcheckStreamXDoneNode.java:183] - Rerun the subtask. 2023-01-06 00:20:04,873 INFO [DwcheckStreamXDoneNode.java:168] - Time for the current offset of data: 1672921729000 2023-01-06 00:20:04,873 WARN [DwcheckStreamXDoneNode.java:183] - Rerun the subtask.
The instance generated for the merge subtask in one of previous scheduling cycles does not finish running or fails to run. In this case, you can go to Operation Center to view the details of the ancestor instances on which the instance that is scheduled to run on the current day depends, and find the latest ancestor instance that does not finish running or fails to run.
If the ancestor instance fails to run, you can view run logs of the instance to identify the cause of failure. Fix the issue and right-click the instance in the DAG of the instance and select Rerun to rerun the instance.
If the ancestor instance does not finish running, you can check whether the instance generated for the CheckStreamXDone subtask finishes running and whether latency occurs on the CheckStreamXDone subtask.
Issues are identified in the current data synchronization task, and then you recreate or rerun the one-click data synchronization task for synchronizing data from a database. As a result, newly generated instances for the merge subtask cannot be run because the self-dependency is configured for the merge subtask. To solve this issue, find the first instance that is generated for the merge subtask after you rerun the current synchronization task, and right-click the instance and choose Emergency Operations > Delete Dependencies to remove the dependency of the current instance on other ancestor instances generated for the merge subtask. This way, the newly generated instances for the merge subtask can be triggered to run.
The merge subtask takes a long time to complete
You can go to Operation Center to view the run logs of the running SQL subtasks that are generated by the merge subtask. Run logs of the SQL subtasks:
2022-08-07 18:10:58,919 INFO [LogUtils.java:20] - Wait instance 20220807101058817gbb6ghx5 to finish...
2022-08-07 18:10:58,938 INFO [LogUtils.java:20] - Wait instance 20220807101058818g46v43wa to finish...
To view logs of an SQL subtask in the LogView portal, search for the SQL subtask by instance ID such as instance20220807101058817gbb6ghx5 in the run logs of the SQL subtask in Operation Center. The LogView portal is displayed. In the LogView portal, you can view the running details of the SQL subtask. Possible causes for slow running of SQL subtasks:
The amount of data in the base table is large and an excessively large number of mappers and reducers are started to process the data in the base table. To solve this issue, you can adjust the settings of related MaxCompute project-level parameters to resolve the issue.
An excessively large number of SQL subtasks are started. As a result, resources are insufficient when the SQL subtasks are committed. In the LogView portal, you can view that the SQL subtasks are in the waiting state. In this case, you must resolve the resource insufficiency issue. You can contact on-duty MaxCompute engineers for support if necessary.
If you cannot locate the preceding logs, an ODPS SQL subtask may be stuck when it is committed. To solve this issue, you can find and view the logs of the last ODPS SQL subtask in the LogView portal or contact MaxCompute on-duty engineers to analyze the logs.
MaxCompute resources are used up by the merge subtask
You can specify the number of parallel threads for the merge subtask to control the number of ODPS SQL subtasks that can be run at the same time. Open the merge subtask in DataStudio. On the configuration tab of the merge subtask, add the concurrency parameter in the code editor to control the parallelism for the merge subtask. After you modify the subtask configurations, commit and deploy the subtask again. Information in the following figure provides an example.
NoteThis is a subtask-level modification. The modification is overwritten when a table is added to or removed from the synchronization task or when the one-click synchronization task is rerun. In this case, you must add the concurrency parameter for the merge subtask again and commit and deploy the subtask.
Change the scheduling settings of the merge subtask. Open the merge subtask in DataStudio. On the configuration tab of the merge subtask, change the scheduling time of the merge subtask to space out the points in time when the SQL subtasks generated by the merge subtask are run to prevent high parallelism.