DataX is an open source tool provided by Alibaba Cloud. This topic describes how to use DataX to configure a synchronization task to import full data from a table in ApsaraDB for Cassandra (Cassandra) to a wide table in Lindorm.
Prerequisites
- You have read and understood the limits of Lindorm CQL. For more information, see Limits.
- The IP address of your client is added to the whitelist of the Lindorm instance. For more information, see Configure whitelists.
- A keyspace and a wide table are created in LindormTable.
- For more information about how to create a keyspace, see CREATE KEYSPACE.
- For more information about how to use the CREATE TABLE statement to create a Lindorm wide table, see CREATE TABLE.
Usage notes
If DataX is deployed on an Elastic Compute Service (ECS) instance and is connected to the Lindorm instance through VPCs, make sure that your Lindorm instance and the ECS instance meet the following requirements to ensure network connectivity:
- Your Lindorm instance and ECS instance are deployed in the same region. We recommend that you deploy the two instances in the same zone to reduce network latency.
- Your Lindorm instance and ECS instance are deployed in the same VPC.
Procedure
The following steps provide an example on how to import data from a Cassandra to Lindorm by using DataX. In this example, DataX is deployed on an ECS instance.
- Run the following command to download the installation package of DataX:
wget https://github.com/alibaba/DataX/archive/refs/tags/datax_v202303.tar.gz
- Decompress the downloaded installation package.
tar zxvf datax_v202303.tar.gz
- Run the following command to create a directory named job in the DataX_datax_v202303 project and then create a synchronization task file named JOB.json:
mkdir job touch JOB.json
- Open the synchronization task file JOB.json.
vi JOB.json
- Configure the parameters described in the following table in the synchronization task file. The following example shows how to configure these parameters in the task file.
{ "job": { "setting": { "speed": { "channel": 1 } }, "content": [ { "reader": { "name": "cassandrareader", "parameter": { "host": "ld-bp17j28j2y7pm****-proxy-lindorm-pub.lindorm.rds.aliyuncs.com", "port": 9042, "username": "TestUser01", "password": "testPassword", "useSSL": false, "consistancyLevel": "LOCAL_ONE", "timeout": 600000, "fetchsize": 1, "keyspace": "db", "table": "tt", "column": [ "id", "n", "id1" ], "where": "id > ${split_task_min} and id < ${split_task_max}" } }, "writer": { "name": "cassandrawriter", "parameter": { "host": "ld-bp17j28j2y7pm****-proxy-lindorm-pub.lindorm.rds.aliyuncs.com", "port": 9042, "username": "TestUser01", "password": "testPassword", "useSSL": false, "keyspace": "t1", "table": "tt", "column": [ "id", "n", "id1" ] } } } ] } }
Parameter Required Description channel Yes The concurrency of the synchronization task. You can configure this parameter to accelerate the execution of the synchronization task. host Yes - reader.parameter.host: the endpoint that is used to connect to the Cassandra cluster. You can obtain the endpoint in the Cassandra console. For more information, see Obtain endpoints.
- writer.parameter.host: the domain name in the Cassandra-compatible endpoint that is used to connect to LindormTable. You can obtain the domain name in the Lindorm console. For more information, see Prerequisites.
port Yes Set the port number to 9042. - reader.parameter.port: the port number that is used to connect to the Cassandra cluster.
- writer.parameter.port: the port number in the Cassandra-compatible endpoint that is used to connect to LindormTable.
useSSL Yes Specifies whether to enable SSL encryption. - true: enables SSL encryption.
- false: disables SSL encryption.
keyspace Yes - reader.parameter.keyspace: the name of the source keyspace in the Cassandra cluster. This keyspace contains the source table from which you want to import data.
- writer.parameter.keyspace: the name of the destination namespace in the Lindorm instance. The namespace contains the destination wide table to which you want to import data.
table Yes - reader.parameter.table: the name of the source table from which you want to import data.
- writer.parameter.table: the name of the destination wide table.
column Yes - reader.parameter.column: the names of the columns that you want to import from the source table.
- writer.parameter.column: the names of the columns in the destination wide table.
where No The conditions based on which the synchronization task is split. You can specify this parameter to split a synchronization task into multiple tasks. - Run the following command to create the executable of DataX in the job directory of the DataX_datax_v202303 project:
mvn -U package assembly:assembly -Dmaven.test.skip=true
- Optional:Optimize the parameters of the synchronization task. You can use the following code block to replace the code block in the same format in the datax.py script located in the target/datax/datax/bin/ directory. This way, you can accelerate the execution of the synchronization task.
DEFAULT_JVM = "-Xms8g -Xmx8g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=%s/log" % (DATAX_HOME) DEFAULT_PROPERTY_CONF = "-Dfile.encoding=UTF-8 -Dcom.datastax.driver.NATIVE_TRANSPORT_MAX_FRAME_SIZE_IN_MB=1900 -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Djava.security.egd=file:///dev/urandom -Ddatax.home=%s -Dlogback.configurationFile=%s" % ( DATAX_HOME, LOGBACK_FILE)
- Run the following command to execute a single synchronization task:
In the preceding command, JOB.json specifies the name of the synchronization task file.#Format: python target/datax/datax/bin/datax.py --help python target/datax/datax/bin/datax.py job/JOB.json -p "-Dsplit_task_min=100 -Dsplit_task_max=1000"
Note You can also create multiple synchronization task files in the job directory based on the CPU, memory, and network of the source and destination clusters. In this case, you can run the command to execute multiple synchronization tasks at the same time.
Examples
If you need to refer to the sample code when you perform the steps or want to accelerate the execution of the mvn -U command in Step 6, see the source code files in datax.tar.gz.