Synchronize data from a Simple Log Service Logstore to DLF 2.0 in real time - DataWorks

This topic describes how to synchronize data from a Simple Log Service Logstore to Data Lake Formation (DLF) 2.0 in real time.

Limits

Only serverless resource groups are supported. For more information about serverless resource groups, see the topics in the Use serverless resource groups directory.

Prerequisites

A Simple Log Service data source is added. For more information, see Simple Log Service data source.
A DLF 2.0 data source is added. For more information, see DLF 2.0 data source.
A serverless resource group is created, and network connections between the resource group and the data sources are established. For more information, see Network connectivity solutions.

Create a synchronization task

Create a synchronization task.
1. Go to the Data Integration page.
  Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.
2. On the Synchronization Task page, select LogHub from the Source drop-down list and Data Lake Formation 2.0 (DLF 2.0) from the Destination drop-down list, and click Create Synchronization Task.
Configure basic information and network settings for the synchronization task.
1. In the Basic Settings section, configure basic information.
  - Source And Destination: Select LogHub as the source type and Data Lake Formation 2.0 (DLF 2.0) as the destination type.
  - New Node Name: Specify a name for the synchronization task based on your business requirements.
  - Synchronization Method: Select Single Logstore Realtime Sync.
  - Synchronization Mode: Structural migration and Incremental synchronization are selected by default and cannot be cleared.
2. In the Network and Resource Configuration section, configure the data sources and a resource group and test network connectivity.
  - Resource Group: Select the serverless resource group that you prepared.
  - Source: Select the Simple Log Service data source that you prepared.
  - Destination: Select the DLF 2.0 data source that you prepared.
  After you complete the preceding configuration, click Test Connectivity to test the network connectivity between the resource group and the data sources.
  If the network connectivity test is successful, click Next.
Configure the source.
In the wizard in the upper part of the page that appears, click SLS to configure information about the source.
1. Select the Logstore from which you want to synchronize data.
  - Logstore: Select the Logstore from which you want to synchronize data.
  - Data Sampling: Click Data Sampling. In the Preview Data Output dialog box, configure the Start Time and Sampled Data Records parameters, click Start Collection to enable the system to collect data from the Logstore, and then preview the log information that is displayed.
2. Configure output fields.
  After you select a Logstore, the system automatically loads data in the Logstore and generates field names based on the data. You can change the data type of an output field, delete an output field, or manually add an output field.
  Note
  If an output field does not exist in the Simple Log Service data source, NULL is written to the destination.

Configure the destination.

In the wizard in the upper part of the page that appears, click Data Lake Formation 2.0 (DLF 2.0) to configure information about the destination.

Configure basic information about the destination.

Parameter	Description
Metadata Catalog	The default value of this parameter is the DLF catalog that is configured when you add the DLF 2.0 data source. The default value cannot be changed.
Writing Format	The default value of this parameter is the data format that is configured when you add the DLF 2.0 data source. The default value cannot be changed. Example: PAIMON.
Destination Database	The default value of this parameter is the name of the database that is selected when you add the DLF 2.0 data source. The default value cannot be changed.
Destination Table	The generation method of a destination table. Valid values: Create tables automatically and Use Existing Table.
Table Name	If you set the Destination Table parameter to Create tables automatically, a destination table is automatically generated. You can edit the name or schema of the table and click Save to save the modification. Then, you can preview mappings between the fields in the source Logstore and the fields in the destination table. If you set the Destination Table parameter to Use Existing Table, you can select a table name from the Table Name drop-down list. Then, you can click View Table Schema to view the detailed schema information about the selected table. After you select a destination table, you can preview mappings between the fields in the source Logstore and the fields in the destination table.

Configure mappings between fields in the source Logstore and fields in the destination table.
After you save the modified table schema or set the Destination Table parameter to Use Existing Table, the system automatically establishes mappings between fields in the source Logstore and fields in the destination table based on the same-name mapping rule. You can adjust the mappings based on your business requirements. One field in the source Logstore can map to multiple fields in the destination table. However, multiple fields in the source Logstore cannot map to one field in the destination table. If a field in the source Logstore does not have a mapped field in the destination table, the data in the field is not written to the destination table.
After the preceding configuration is complete, click Perform Simulated Running. In the Preview Data Output dialog box, preview the data that is synchronized to the destination table.
If some data fails to be written to the destination table, you can view the cause for the write failure in the Preview Data Output dialog box. For example, if a data type fails to be converted, data may fail to be written to the destination table.

Optional. Configure advanced parameters.

In the upper-right corner of the configuration page, click Configure Advanced Parameters. In the Configure Advanced Parameters panel, configure the parameters to perform behavior control on the synchronization task. The system provides default values for advanced parameters based on the configurations of the synchronization task. You can also specify values for advanced parameters based on your business requirements. The following table describes the advanced parameters.

Parameter	Value scope	Description
Auto configure runtime parameters	Default value: true Valid values: true and false	If you set this parameter to true, the system automatically assigns values to all runtime configuration items based on the configurations of the synchronization task.
Worker Number	Minimum value: 1 Maximum value: 100	The total number of workers that are started for the synchronization task.
Worker Concurrent	Minimum value: 1 Maximum value: 100	The number of threads that are started by each worker.
Flush Interval (Second)	Default value: 60 Minimum value: 60 Maximum value: 180	The interval at which data is flushed. Unit: seconds. A large value can improve data synchronization efficiency but also increases the latency of data in the destination table.
Threshold for the number of failures in the failover restart strategy	Default value: 3 Minimum value: 1 Maximum value: 100	The maximum number of failures that are allowed for restarting the synchronization task after failovers occur.
Failover restart strategy time window (minutes)	Default value: 30 Minimum value: 1 Maximum value: 60	The time window for restarting the synchronization task after failovers occur. Unit: minutes.

Perform O&M operations on the synchronization task

Start the synchronization task

After the configuration of the synchronization task is complete, you are navigated to the Tasks section of the Synchronization Task page. You can find the synchronization task and click Start in the Actions column to start the synchronization task.

View the running status of the synchronization task

After you complete the configuration of the synchronization task, you can find the task in the Tasks section of the Synchronization Task page, and click the task name or click the blank area next to each stage displayed in the Execution Overview column to view the running details of the synchronization task. The running details page displays the following information about the synchronization task:

Basic information: You can view the basic information about the synchronization task, such as the data sources and resource group.
Running status: The synchronization task contains the schema migration and real-time synchronization stages. You can view the running status of the synchronization task in each stage.
Details: You can view the details of the synchronization task in the schema migration stage and real-time synchronization stage on the Schema Migration tab and the Real-time Data Synchronization tab.
- Schema Migration: This tab displays information such as whether a destination table is a newly created table or an existing table. For a newly created table, the DDL statement that is used to create the table is displayed.
- Real-time Data Synchronization: This tab displays statistics about real-time synchronization, including real-time synchronization details, DDL records, and alert information.

Rerun the synchronization task

Directly rerun the data synchronization task.
In the Tasks section of the Synchronization Task page, find the synchronization task and choose More > Rerun in the Actions column to rerun the synchronization task without modifying the configurations of the synchronization task.
Modify the configurations of the synchronization task and then rerun the synchronization task.
In the Tasks section of the Synchronization Task page, find the synchronization task, modify the configurations of the synchronization task, and then click Complete. Click Apply Updates that is displayed in the Actions column of the synchronization task to rerun the synchronization task for the latest configurations to take effect.