DataWorks Data Integration – The Central Hub of Big Data Migration to the Cloud and Removal from the Cloud

By Wang Zhe, Product Manager of DataWorks

This article is a part of the One-stop Big Data Development and Governance DataWorks Use Collection.

1. An Introduction to the Basis of Data Integration

Data integration is the central hub for the big data platform to migrate to the cloud and remove from the cloud. Its main function aims to get through the data in different business systems, realize the free offline or real-time flow of data, complete data migration to the cloud, and remove it from the cloud using network solutions.

Data integration completed publishing DataX 1.0 and 2.0 in 2011 and officially published DataX 3.0 in 2014. In 2018, the three versions were integrated, which unified the internal functions of public cloud, Apsara Stack, and Alibaba to establish the Data Integration service of Alibaba, which can optimize development efficiency and reduce O&M costs. In 2019, the exclusive resource group was published, and data integration was formally commercialized. In 2020, real-time simultaneous was published, allowing a full incremental solution and separate testing of resource group connectivity.

Data integration has two major functions. The first function works a as data middle platform to connect to the external central hub. It is responsible for the connection of data flow among the connected cloud systems. The second is a one-stop data transmission service across clouds and networks. In data integration, all data can be accessed to provide a one-stop solution.

(1) Business Scenarios

Business Scenario 1 – Migrate to the Cloud: It can be used when you need to quickly and securely migrate data to the cloud storage for further business analysis, such as offline MySQL, Oracle-to-cloud MaxCompute, or MC-Hologres and E-MapReduce. One-time full or continuous increment is supported during cloud migration, including offline or real-time modes.

Business Scenario 2 – Real-Time Data Warehouses: It supports the aggregation of batch or streaming data into real-time warehouses. It can also be used for data aggregation and analysis related to IoT and the collection and analysis of log data streams.

Business Scenario 3 – Disaster Recovery Backup: It can be used when your original data needs to be transferred to the storage system for disaster recovery or long-term backup. For example, the data on MaxCompute can be written into OSS and made into files, and the compression method can be used to save storage space. When needed, OSS files can also be restored to original tables and other forms.

Business Scenario 4 – Platform Convergence: You can synchronize the data of various products on the cloud, such as third-party cloud AWS, GCP, and other products. Alibaba Cloud internal data can be synchronized among various products, such as ApsaraDB for RDS, MaxCompute, OSS, PolarDB, AnalyticDB, and Datahub.

(2) Business Value

Coverage of Regions and Industries: In terms of geographical coverage, eight regions have been opened in China, including the Hong Kong area. Eleven regions have been opened internationally, which is virtually the same as Alibaba Cloud. In terms of industry distribution, it has covered common fields, including government, finance, insurance, energy, power, manufacturing, Internet, and retail. It has extensive experience in business scenarios.

(3) Core Capabilities

The first core capability is offline/real-time full coverage. Since 2020, data integration has supported real-time data synchronization, covering almost all data circulation scenarios. It includes offline data synchronization, active data extraction, passive real-time data receiving, data changes perception, and real-time synchronization.

The second core capability aims to support a wide range of disparate data sources. It supports more than 50 data sources offline and more than 10 data sources in real-time. It can implement any combination of reading and writing and covers ApsaraDB RDS, MPP, NoSQL, NAS, and message flow.

The third core capability aims to support complex networks, regardless of where data sources are located, such as public networks, IDCs, and VPCs. Data integration is equipped with mature solutions to connect to data sources.

The fourth core capability is the synchronization solution, including full database migration, batch cloud migration, incremental synchronization, database and table sharding, and one-click real-time full increment. These have been productized and can be completed by setting up a simple configuration.

The fifth core capability is security control. In tasks and data sources, development and production environments can be isolated, data source permission security control can be achieved, personnel and projects can be specified to share data sources, and minimum granularity control can be achieved.

The sixth core capability is O&M monitoring, which can implement network traffic control, dirty data control, and resource group usage monitoring and support task alert settings, including phone calls, SMS messages, emails, and DingTalk.

2. Technical Architecture of Data Integration

(1) Technical Architecture

Data Integration is deeply integrated with DataWorks.

The underlying layer is the execution layer, which can be connected to the Alisa engine of DataWorks, including the Yarn engine and Kubernetes, which can be connected immediately. A center upward aims to control the data source plug-ins, which can be expanded upward to enhance the reading and writing capabilities of various data.

You can perform rendering, scheduling, control, execution, and resource group management for jobs at the control layer. You can also perform detailed feedback on measurement and billing and deep integration with the permission control of DataWorks. These tasks rely on DataWorks for code management.

The access layer is further up. First of all, visual development is realized using the IDE visual environment provided by DataWorks. At the same time, it can also support the development of OpenAPI by working with DataWorks. In addition to using the interface of DataWorks, you can use OpenAPI to package data integration capabilities in your functions or products, so the upstream and downstream are connected to the capabilities of major DataWorks modules.

(2) Engine Architecture

The data integration adopts a new engine architecture. No matter what kind of data source, it can form a one-to-one or many-to-one synchronization process with other data sources as long as it is connected to this engine. Any two data sources can be combined and matched, and dozens of possibilities are added for each additional data source. This capability provides support for data process scalability.

3. Resource Groups and Network Capabilities

(1) Resource Groups Comparison

The data integration consists of three resource groups, exclusive resource groups, shared resource groups, and custom resource groups. The differences can be seen in eight dimensions.

The first is the attribution of computing resources. The computing resources of public resource groups are shared by all tenants, and the overall maintenance is undertaken by DataWorks. Exclusive resource groups are also maintained by DataWorks but are exclusive to users. Custom resource groups are your IDC machines and are maintained by users.

The second is network support. Public resource groups support the Internet, Alibaba Cloud classic network, and Alibaba Cloud internal network but do not support Alibaba Cloud VPC. Exclusive resource groups and custom resource groups support the Internet, Alibaba Cloud VPC, and Alibaba Cloud internal network but do not support the Alibaba Cloud classic network.

The third is the DataWorks version. Shared resource groups and exclusive resource groups are supported in all versions. Custom resource groups only support the professional or later versions.

The fourth aims to support data sources. Shared resource groups and custom resource groups support some data sources. Exclusive resource groups support all data sources.

The fifth is safety and reliability. Shared resource groups are open to users. Users that preempt more resources will cause delays and other problems to users that preempt fewer resources, so the security and reliability are low. Exclusive resource groups have high security and reliability. Custom resource groups depend on the user machine environment.

The sixth is the efficiency of task execution. Public resource groups have low task execution efficiency due to resource preemption, which is not guaranteed. The task execution efficiency of exclusive resource groups is high. Custom resource groups depend on the user machine environment.

The seventh is the applicable scenarios. Shared resource groups are suitable for non-important and non-urgent scattered tasks or testing tasks. Exclusive resource groups are suitable for a large number of important production-level tasks. A custom resource group is generally the user's existing resources, or all data sources are inside the IDC. In this case, the cost of using custom resources is low from one IDC to another.

The eighth is the billing method. Shared resource groups are charged on a pay-as-you-go basis. The exclusive resource groups are available on an annual or monthly subscription basis and have different computing resource specifications. Custom resource groups are only charged for the professional version of DataWorks.

In summary, we recommend using exclusive resource groups.

(2) Network Solutions

Different resource groups have different network solutions and capabilities, and there are usually four categories of situations.

1. The data source can access the public network. All three resource groups can connect with this group of data sources to connect the network to realize data synchronization.

2. If the data source is in a VPC, shared resource groups cannot be connected. You can only use exclusive or custom resource groups. This is the most common case. There are two situations:

Situation 1: The data source VPC and DataWorks are in the same region. In the exclusive resource group, the corresponding solution is to bind a VPC first. If the availability zones are different, you can add a router in the console to connect the network.
Situation 2: The data source VPC and DataWorks are in different regions. You can create a VPC in the region where DataWorks is located, connect the two VPCs with an express connect or VPN, bind the resource group to the VPC in the region where DataWorks is located, and add a router in the console to connect to the destination VPC.

3. The data source is in the IDC. Connect through an exclusive resource group by creating a VPC in the region where DataWorks resides. Then, bind the VPC in the region where DataWorks resides using an express connect or VPN to connect the VPC to the IDC. Add a router in the console to connect to the destination VPC.

4. The data source is in the classic network. We recommend moving the data source to the VPC. We have corresponding solutions, but this situation is relatively rare. Currently, only public resource groups can directly connect to the classic network.

The solution is more flexible for custom resource groups because the machine environment is in the user's place, but you may encounter various situations. For example, it needs to have the ability to access the Internet to set a machine in a specified VPC, IDC, or classic network as a resource group. Clients need to use their Express Connect, VPN, or other connection means to access a data source in a VPC, IDC, or classic network or access other VPCs, IDCs, or classic networks elsewhere.

In summary, we recommend using exclusive resource groups. The interface is easy to operate, and the solutions are abundant.

4. Offline Synchronization

(1) The Core Functions of Offline Synchronization Have Four Main Features

A variety of data sources
Various Solutions: It supports full and incremental synchronization, full database and batch data synchronization, and database and table sharding.
Fine-Grained Permission Control: It includes data source permission security control, development, and production environment isolation. It is convenient for development process control. It has clear permissions and realizes synchronous network traffic control.
Supports Complex Scheduling: It is deeply integrated with DataWorks and supports tens of millions of scale timing and conditional scheduling. It supports baseline setting and alarm processing.

(2) Synchronization Principle

Offline synchronization from the source to the destination is achieved using the data integration engine. Based on the JDBC protocol of the database or the SDK of the data source, the tasks of pulling and dividing the data of the source database are carried out. A task is divided into multiple tasks, and the data can be read concurrently to speed up data synchronization. Each task has a corresponding read-write subthread. During reading, the source database data can be extracted from the divided area and put into the cache. Finally, the writer at the write end injects the data into the target end by calling the JDBC or SDK interface.

(3) Data Sources Supported by the Synchronization Engine

Offline synchronization supports data sources, such as 50 + ApsaraDB RDS, unstructured storage, big data storage, and MSMQ. The reads and writes of these data sources are crossed and can be combined into a variety of synchronization processes to match the needs in production.

5. Real-Time Synchronization

(1) Real-Time Synchronization Features

Real-time synchronization has five major features:

It supports various data sources. Read supports MySQL, Oracle, SQLServer, OceanBase, Kafka, LogHub, DataHub, and PolarDB. Write supports MaxCompute, MC-Hologres (supports dynamic addition of columns), DataHub, and Kafka. It supports the star process combination. Each input source can be combined with an output source to form a synchronization process. It also supports one task with multiple writes or multiple outputs.
It supports simple data processing, including data filtering and Groovy function string replacement.
It supports real-time database monitoring and extraction, including database sharding and table sharding, single table, multiple tables in a database, and DDL messages.
Monitoring and O&M Alerting can monitor service latency, failover, dirty data, heartbeat check, and failure information and implement alerting through emails, phone calls, and DingTalk.
The solution system integrates real-time and offline and can support the full increment of common databases to MaxCompute and MC-Hologres (first full and then continuous increment).

(2) Real-Time Synchronization Architecture

Input data sources are extracted from real-time databases, including database sharding and table sharding, single table or multiple tables in a database, DDL, or real-time message stream docking. Then, data processing is performed, including data filtering, Groovy functions, and string replacement. Finally, it supports multiple outputs, referring to the implementation of one read and multiple writes. Reading a data source can be written to multiple data sources at the same time. Then, the data is finally output.

(3) Dirty Data Collection Mechanism

Dirty data can be collected in a unified manner. After collection, it can be written from the interface to the collection place, including local logs, Loghub, and MaxCompute.

(4) Real-Time Synchronization of Core Technology

MySQL, Oracle, Datahub, Loghub, and Kafka can be synchronized to Datahub, Kafka, and other destinations in real-time. You can complete development by dragging. It also supports simple data filtering, replacement, and Groovy custom processing capabilities.

6. Real-Time Synchronization Solution

The solution does not refer to the configuration or synchronization of a single task but rather a series of usage scenarios. It automatically helps you establish tasks or start tasks and integrates these processes into solutions.

(1) Full Database Migration

The full database migration is the full database synchronization. You can synchronize all tables in the ApsaraDB RDS to MaxCompute at one time, helping users improve efficiency and reduce usage costs.

(2) Batch Cloud Migration

Batch cloud migration allows users to upload multiple data sources to MaxCompute at the same time. Users can configure rules, such as table name conversion, field name conversion, field type conversion, destination table new fields, destination table field assignment, data filtering, and destination table name prefixes to meet the needs of business scenarios.

(3) Solution System

The solution system can easily configure the existing databases and complete the full migration of existing data and the real-time synchronization of subsequent increments.

The following situations are supported:

Create tables in the destination database automatically
Establish offline synchronization tasks automatically
Establish real-time tasks automatically
Start offline tasks automatically
Start real-time tasks automatically
Establish and start incremental and full fusion tasks automatically
Full-process distribution monitoring and display and support distribution retries

7. Billing Method

The public cloud billing method is related to resource groups and is based on four major points:

Charge by the Number of Instances: The premise of tiered pricing by instance concurrency uses shared resource groups. Users can use subscription resource packs to save costs.
Charge by the Annual or Monthly Subscription: The premise of this mode is to use exclusive resource groups, which are only charged based on the machine specifications and duration, regardless of the number of running instances.
The Version Fee: If you use a custom resource group, you can activate the professional version or later versions.
Network Cost: The premise is if you use the Internet and Alibaba Cloud for data transmission, you can be charged based on the network traffic. The unified standard of Alibaba Cloud in China is 0.8 CNY/GB.

Community