DataX is a widely used tool/platform for offline data synchronization in Alibaba Group. It implements efficient data synchronization between various heterogeneous data sources, including MySQL, SQL Server, Oracle, PostgreSQL, HDFS, Hive, HBase, OTS, and ODPS.
As a data synchronization framework, DataX abstracts the synchronization of different data sources as a Reader plug-in for reading data from the data source and a Writer plug-in for writing data to the target end. In theory, DataX can support the data synchronization of any data source. At the same time, the DataX plug-in system serves as an ecosystem. When a new data source is added, it can communicate with the existing data sources.
Install DataX:
After downloading the file, decompress it to a local directory and enter the bin directory to run the synchronization task:
$ cd {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}
Self-test script: python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
(1) Download the DataX source code:
$ git clone git@github.com:alibaba/DataX.git
(2) Package the code through Maven:
$ cd {DataX_source_code_home}
$ mvn -U clean package assembly:assembly -Dmaven.test.skip=true
After being packaged, the following log is presented:
[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------
[INFO] Total time: 08:12 min
[INFO] Finished at: 2015-12-13T16:26:48+08:00
[INFO] Final Memory: 133M/960M
[INFO] -----------------------------------------------------------------
After packaging, the DataX packet is located in {DataX_source_code_home}/target/datax/datax/
. The structure is listed below:
$ cd {DataX_source_code_home}
$ ls ./target/datax/datax/
bin conf job lib log log_perf plugin script tmp
Configuration Example: Read the data from the stream and print to the console.
You can view the configuration template using the following command: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}
.
$ cd {YOUR_DATAX_HOME}/bin
$ python datax.py -r streamreader -w streamwriter
DataX (UNKNOWN_DATAX_VERSION), From Alibaba !
Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md
Please refer to the streamwriter document:
https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column": [],
"sliceRecordCount": ""
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
Configure the JSON file according to the template:
#stream2stream.json
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,你好,世界-DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
$ cd {YOUR_DATAX_DIR_BIN}
$ python datax.py ./stream2stream.json
The synchronization ends. The following log is presented:
...
2015-12-17 11:20:25.263 [job-0] INFO JobContainer -
Task start time : 2015-12-17 11:20:15
Task end time : 2015-12-17 11:20:25
Total time consumption : 10s
Average task traffic : 205B/s
Record write speed : 5rec/s
Total read records : 50
Write and Read Failures : 0
Google Groups: DataX-user
Six Technical Directions of Next-Generation Enterprise Databases
Alibaba Cloud MaxCompute - December 7, 2018
Alibaba Clouder - July 20, 2020
Alibaba Clouder - January 7, 2021
ApsaraDB - January 25, 2022
Alibaba Cloud Storage - February 27, 2020
Alibaba Clouder - January 6, 2021
Alibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreTSDB is a stable, reliable, and cost-effective online high-performance time series database service.
Learn MoreProtect, backup, and restore your data assets on the cloud with Alibaba Cloud database services.
Learn MoreAnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency.
Learn MoreMore Posts by ApsaraDB