You can backfill data of a historical or future period of time for an auto triggered task to write the data to time-based partitions. Scheduling parameters that are used in the task code are automatically replaced with specific values based on the data timestamp that you configure to backfill data for the task. The data that corresponds to the data timestamp is written to specific partitions based on the business code. The partitions to which the data is written are related to the logic and content of the task code. This topic describes how to backfill data for an auto triggered task and manage data backfill instances generated for the task on the Data Backfill page.
Limits
Instance cleanup principles
Data backfill instances cannot be manually deleted. The system deletes data backfill instances after their validity period elapses. The validity period of data backfill instances is approximately 30 days. If you do not need to use a data backfill instance, you can freeze it.
Instances that run on the shared resource group for scheduling are retained for one month (30 days), and logs for the instances are retained for one week (7 days).
Instances that run on exclusive resource groups for scheduling or serverless resource groups are retained for one month (30 days), and logs for the instances are also retained for one month (30 days).
The system regularly clears excess run logs every day when the size of run logs generated for all the auto triggered task instances that finish running exceeds 3 MB.
Limits on permissions
For root tasks or their descendant tasks for which you want to backfill data, if you do not have required permissions on the workspaces to which the tasks belong, you cannot backfill data for these tasks. If a task in a workspace is an intermediate task for data backfilling, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks. Proceed with caution. If data needs to be backfilled for both ancestor tasks and descendant tasks of a task, the task is considered an intermediate task.
Precautions
Instance running
When DataWorks backfills data of a specified time range for a task, if an instance generated for the task fails on a day, the status of the other data backfill instances of the task for that day is also set to failed. In this case, DataWorks does not run the instances generated for this task on the next day. DataWorks runs the instances generated for a task on a day only after all instances generated for the task on the previous day are successfully run.
If you backfill data of a specific day for a task scheduled by hour or minute, whether all instances generated to run on that day for the task are run in parallel depends on whether you configure the self-dependency for the task.
If both an auto triggered task instance and a data backfill instance are triggered to run for a task, you must stop the data backfill instance to ensure that the auto triggered task instance can be run as expected.
You can add tasks that do not require data backfilling to a blacklist. If a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.
Scheduling resources
If a large number of data backfill instances are run or a high data backfilling parallelism is configured, scheduling resources may be insufficient. Make sure that your configurations meet your business requirements.
To prevent data backfill instances from occupying large amounts of resources and affecting the running of auto triggered task instances, you must abide by the following rules that are formulated for data backfill instances:
If you backfill data for a task whose data timestamp is the previous day, the priority of a data backfill task created for the task is determined by the priority of the baseline to which the task belongs.
If you backfill data for a task whose data timestamp is the day before the previous day, you must abide by the following rules to downgrade the priority of the task:
If the priority of the task is 7 or 8, downgrade the priority of the task to 3.
If the priority of the task is 3 or 5, downgrade the priority of the task to 2.
If the priority of the task is 1, keep the priority unchanged.
Create a data backfill task
Step 1: Go to the Data Backfill page
Go to the Operation Center page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.
In the left-side navigation pane, choose
.
If you want to backfill data for a single auto triggered task, you can also perform the following operations: In the left-side navigation pane of the Operation Center page, choose
. On the page that appears, find the desired auto triggered task and click Backfill Data in the Actions column.Step 2: Create a data backfill task
On the Data Backfill page, click Create Data Backfill Task and configure parameters based on your business requirements.
Configure parameters in the Basic Information section.
DataWorks automatically generates a data backfill task name. You can change the name based on your business requirements.
Configure parameters in the Tasks That Require Data Backfill section.
You can backfill data for tasks on which you have required permissions by using one of the following methods: Manually Select, Select by Link, Select by Workspace, and Specify Task and All Descendant Tasks. You can also select other tasks for which you want to backfill data based on the tasks. The parameters that you can configure vary based on the method that you select.
Manually Select
Select one or more tasks as root tasks. This way, you can manually select specific descendant tasks of the root tasks for which you want to backfill data.
NoteThe original plans of backfilling data for the current task, backfilling data for the current task and its descendant tasks, and backfilling data in advanced mode are compatible with this method.
You can select up to 500 root tasks and up to 2,000 total tasks. The total tasks consist of root tasks and their descendant tasks.
The following table describes the parameters.
Parameter
Description
Task Selection Method
Select Manually Select.
Add Root Tasks
You can search for and add a root task by task name or ID. You can also click Batch Add and specify specific conditions such as resource group, scheduling cycle, and workspace, to add multiple root tasks at a time.
NoteYou can select only tasks in the workspaces to which you are added as a member.
Selected Root Tasks
The tasks for which you want to backfill data. The list displays the added root tasks. You can select descendant tasks for which you want to backfill data based on the root tasks.
NoteYou can filter descendant tasks based on dependency levels. Direct descendant tasks of root tasks are listed at Level 1.
You can select up to 500 root tasks and up to 2,000 total tasks.
Task Blacklist
If you do not need to backfill data for a task, you can add the task to a blacklist. Data is not backfilled for tasks that are added to the blacklist.
NoteYou can add only root tasks to the blacklist. If data does not need to be backfilled for descendant tasks of root tasks, remove the descendant tasks from the Selected Root Tasks list.
If a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.
Select by Link
Select a start task as the root task and one or more end tasks. Then, the system automatically determines that all tasks from the start task to the end task require data backfilling.
The following table describes the parameters.
Parameter
Description
Task Selection Method
Select Select by Link.
Select Tasks
Enter a task name or task ID to search for and add a start task and use the same method to add one or more end tasks. Then, the system identifies intermediate tasks based on the start task and end tasks. An intermediate task serves as a direct or indirect descendant task of a start task and serves as a direct or indirect ancestor task of an end task.
Intermediate Tasks
The list of intermediate tasks that are automatically identified by the system based on the start task and end tasks.
NoteThe list can display up to 2,000 tasks. Extra tasks are not displayed in the list, but data is backfilled for all the tasks as expected.
Task Blacklist
If you do not need to backfill data for a task, you can add the task to a blacklist. Data is not backfilled for tasks that are added to the blacklist.
NoteIf a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.
Select by Workspace
Select a task as the root task, and determine the tasks for which you want to backfill data based on the workspaces to which descendant tasks of the root task belong.
NoteThe original plan of backfilling data for massive nodes is compatible with this method.
You cannot configure a task blacklist.
The following table describes the parameters.
Parameter
Description
Task Selection Method
Select Select by Workspace.
Add Root Tasks
You can search for and add a root task by task name or ID. Data is backfilled for tasks in the workspaces to which descendant tasks of the root task belong.
NoteYou can select only tasks in the workspaces to which you are added as a member.
Include Root Node
Specifies whether to backfill data for the root task.
Workspaces for Data Backfill
Select the workspaces in which tasks require data backfilling based on the workspaces to which descendant tasks of the root task belong.
NoteYou can select only workspaces that reside in the current region.
After you select a workspace, data will be backfilled for all tasks in the workspace by default. You can specify a custom task blacklist or whitelist based on your business requirements.
Add to Whitelist
Add other tasks for which you want to backfill data to the whitelist, in addition to the tasks in the workspaces that you selected.
Task Blacklist
Add the tasks that do not require data backfilling in the selected workspaces to the blacklist.
Specify Task and All Descendant Tasks
Select a root task. Then, the system automatically determines that the root task and all its descendant tasks require data backfilling.
ImportantYou can view the tasks that are triggered to run only if the data backfill task is running. Proceed with caution.
The following table describes the parameters.
Parameter
Description
Task Selection Method
Select Specify Task and All Descendant Tasks.
Add Root Task
You can search for and add a root task by task name or ID. Data will be backfilled for the selected root task and all its descendant tasks.
NoteYou can select only tasks in the workspaces to which you are added as a member.
If no task depends on the selected root task, data is backfilled for only the root task after you submit the data backfill task.
Task Blacklist
If you do not need to backfill data for a task, you can add the task to a blacklist. Data is not backfilled for tasks that are added to the blacklist.
NoteIf a task in the blacklist is an intermediate task for the data backfill operation, a dry-run is performed for the task to ensure that descendant tasks of the task can be run as expected. A dry-run directly returns success and does not generate data. However, an exception may occur on the data output of the descendant tasks.
Configure parameters in the Data Backfill Policy section.
Configure information, including the running time of the data backfill task, whether to allow parallelism, whether to trigger an alert, and the resource group to be used, based on your business requirements.
The following table describes the parameters.
Parameter
Description
Data Timestamp
Specifies the data timestamp of data to be backfilled for selected tasks. The value of this parameter is accurate to the day.
If you want to backfill data of multiple non-consecutive time ranges, click Add Multiple Data Timestamp Ranges to specify multiple data timestamps.
If the specified data timestamp is later than the current date, you can select Run Retroactive Instances Scheduled to Run after the Current Time. The system immediately runs the data backfill instance after the data timestamp elapses.
For example, if the current date is
March 12, 2024
, the data timestamp isMarch 17, 2024
, and you select Run Retroactive Instances Scheduled to Run after the Current Time, the system runs the data backfill instance onMarch 18, 2024
.
NoteIn batch processing, the most common scenario is to process data that was generated on the previous day on the current day. The previous day is the data timestamp. In the process of backfilling data for a task, DataWorks generates instances for the task based on the data timestamp that you selected. This way, you can backtrack the data at the specified time.
We recommend that you do not set this parameter to a long time range. Otherwise, data backfill instances may be delayed due to insufficient resources.
Time Range
Specifies the time period during which the selected tasks need to be run. Instances whose scheduling time is within the time period can be generated and run. You can configure this parameter to allow tasks that are scheduled by hour or minute to backfill only data in the specified time period. Default value:
00:00 to 23:59
.NoteInstances whose scheduling time is not within the time period are not generated. For example, if tasks scheduled by day depend on tasks scheduled by hour, isolated task instances may be generated, and task running is blocked.
We recommend that you modify this parameter only if data that is within a specific time period needs to be backfilled for tasks that are scheduled by hour or minute.
Parallelism
If you want to backfill data of multiple data timestamps for a task, you can set this parameter to Yes and specify the number of groups. Valid values:
Yes: The system will generate data backfill instances based on the specified number of groups and run the data backfill instances for different data timestamps in parallel.
No: Data backfill instances are run in sequence based on the data timestamps.
NoteIf you backfill data of a specific day for a task scheduled by hour or minute, whether instances for the task are run in parallel depends on whether you configure the self-dependency for the task.
The number of groups that you can specify ranges
from 2 to 10
. The following rules apply when multiple data backfill instances are run in parallel:If the number of data timestamps is less than the number of groups, all the data backfill instances are run in parallel.
For example, the data timestamps are
January 11 to January 13
, and you set the number of groups to 4. In this case, a data backfill instance is generated for each of the three data timestamps. The three data backfill instances are run in parallel.If the number of data timestamps is greater than the number of groups, the system runs some tasks in sequence and the other tasks in parallel based on the data timestamps.
For example, the data timestamps are
January 11 to January 13
, and you set the number of groups to 2. In this case, two data backfill instances are generated and run in parallel. One of the data backfill instances has two data timestamps, and tasks correspond to the two data timestamps are run in sequence.
Alert for Data Backfill
Specifies whether to enable the alerting feature for data backfill.
Yes: An alert is generated for data backfill if the trigger condition is met.
No: The alerting feature is disabled for data backfill.
Trigger Condition
The trigger condition of an alert for data backfill. Valid values:
Alert on Failure or Success: An alert is generated regardless of whether data backfill is successful or fails.
Alert on Success: An alert is generated if data backfill is successful.
Alert on Failure: An alert is generated if data backfill fails.
NoteThis parameter is required only if you select Yes for the Alert for Data Backfill parameter.
Alert Notification Method
The notification method for an alert. The alert recipient is the initiator of the data backfill operation. Valid values: Text Message and Email, Text Message, and Email.
NoteThis parameter is required only if you select Yes for the Alert for Data Backfill parameter.
You can click Check Contact Information to check whether the mobile phone number or email address of the alert recipient is registered. If not, you can refer to Configure and view alert contacts to configure the information.
Order
The sequence based on which data backfill instances are run. Valid values: Ascending by Business Date and Descending by Business Date.
Resource Group for Scheduling
Specifies whether to select another resource group for scheduling to run a data backfill instance.
Follow Task Configuration: The resource group for scheduling that is configured for the current auto triggered task is used to run the data backfill instance.
Specify Resource Group for Scheduling: Select a resource group for scheduling to run the data backfill instance. This prevents the data backfill instance from competing for resources with auto triggered task instances.
NoteMake sure that network connections are established for the resource group. Otherwise, tasks may fail to run. If the specified resource group is not associated with the desired workspace, the resource group that is used to run the auto triggered task is used.
Execution Period
Specifies the time period during which data is backfilled. Valid values:
Follow Task Configuration: Data is backfilled when the scheduling time of data backfill instances arrives.
Specify Time Period: Data is backfilled within a specified time period. Specify a time period based on the number of tasks for which you want to backfill data.
NoteData is not backfilled for the tasks that are in the Not Run state when the time period elapses. Data is continuously backfilled for the tasks that are in the Running state when the time period elapses.
Configure parameters in the Data Backfill Task Verification section.
Configure the Terminate Task Running upon Verification Failure parameter to determine whether to terminate task running if the data backfill task verification fails. The system checks the basic information about the data backfill task and also checks potential risk items.
Basic information: Check the number of tasks involved in the data backfill operation, the number of generated instances, whether a task dependency loop is formed, whether task isolation occurs, and whether you have required permissions on workspaces.
Risk items: Check whether a task dependency loop is formed or whether task isolation occurs. If the risk detection fails, a task running exception will occur. You can enable the system to terminate the data backfill task when the risk detection fails.
Click Submit. A data backfill task is created.
Step 3: Run the data backfill task
When the scheduling time of the data backfill task arrives and no exception occurs, the data backfill task is automatically triggered to run.
A data backfill task cannot be run if one of the following conditions is met:
The verification feature is enabled for the data backfill task and the verification fails. For more information, see Step 4 in the "Step1: Create a data backfill task" section in this topic.
The extension-based check feature is enabled for the data backfill task, and the check fails. For more information, see Overview.
Manage data backfill instances
Search for data backfill instances
In the left-side navigation pane, choose
.On the page that appears, click Show Search Options and specify filter conditions, such as Retroactive Instance Name, Status, and Node Type, to search for data backfill instances. You can also terminate multiple running data backfill tasks at a time.
View information about data backfill instances
In this area, you can view the following information about a data backfill instance:
Node Name: the name of the data backfill instance. Click the icon before the name of the data backfill instance and view the information about the instance in this area, such as the date when the data backfill instance is run, the status of the data backfill instance, and details of the tasks for which the instance is generated.
Check Status: the check status of the data backfill instance.
Running Status: the status of the data backfill instance. Valid values: Succeeded, Run failed, Waiting for resources, and Pending. You can troubleshoot issues based on an abnormal state.
Nodes: the number of tasks for which the data backfill instance is generated.
Data Timestamp: the date when the data backfill instance is run.
View Task Analysis Results: You can view the estimated number of instances that are generated, the running date, and risk detection results and handle blocked tasks at the earliest opportunity.
Actions: You can perform the operations that are described in the following table on data backfill instances.
Operation
Description
Stop
Terminate multiple data backfill instances that are in the running state at a time. After you perform this operation, the status of the related instances is set to failed.
NoteYou cannot terminate data backfill instances that are not run, are successfully run, or failed to run.
Batch Rerun
Rerun multiple data backfill instances at a time.
NoteOnly data backfill instances that are successfully run or failed to run can be rerun.
If you perform this operation, selected data backfill instances are immediately rerun at the same time. The scheduling dependencies between the instances are not considered. If you want to rerun data backfill instances in sequence, you can select Rerun Descendant Nodes or perform the data backfill operation again.
Reuse
Reuse a group of tasks for which data is backfilled. This allows you to quickly select tasks for which you want to backfill data.
Manage data backfill tasks
In this area, you can view the following information about each task for which the data backfill instance is generated:
Name: the name of the task for which the data backfill instance is generated. You can click the task name to open the directed acyclic graph (DAG) of the task and view the details of the task.
Scheduling Time: the scheduling time of the task.
Start run time: the time when the task starts to run.
End Time: the time when the task finishes running.
Runtime: the time consumed to run the task.
Actions: You can perform the operations that are described in the following table on the task.
Operation
Description
DAG
View the DAG of the task to identify ancestor and descendant tasks of the task. For more information, see Appendix: Use the features provided in a DAG.
Stop
Stop the task. You can stop tasks that are in the running state. After you perform this operation, the status of the task is set to failed.
NoteYou cannot stop a task that is not run, is successfully run, or failed to run.
This operation will result in the failure on the running of an instance that is generated for the task and blocks the running of descendant instances of the instance. Exercise caution when you perform this operation.
Rerun
Rerun the task.
NoteYou can rerun only tasks that failed to run or are successfully run.
More
Rerun Descendant Nodes
Rerun the descendant tasks of the task.
Set Status to Successful
Set the status of the task to successful.
Freeze
Freeze the task to pause the scheduling of the task.
NoteYou cannot freeze a task that is in the waiting for resources, waiting for scheduling time to arrive, or running state. If the code of the task is being run or data quality of the task is being checked, the task cannot be frozen.
Unfreeze
Unfreeze the task to resume the scheduling of the task.
View Lineage
View the lineage of the task.
You can select multiple tasks and click Stop or Rerun to terminate or rerun the selected tasks at a time.
Instance status
Status | Icon |
Successful | |
Not run | |
Failed | |
Running | |
Waiting | |
Frozen |