Import data from ApsaraDB for Cassandra - - Alibaba Cloud Documentation Center

DataX is an open source tool provided by Alibaba Cloud. This topic describes how to use DataX to configure a synchronization task to import full data from a table in ApsaraDB for Cassandra (Cassandra) to a wide table in Lindorm.

Prerequisites

You have read and understood the limits of Lindorm CQL. For more information, see Limits.
The IP address of your client is added to the whitelist of the Lindorm instance. For more information, see Configure whitelists.
A keyspace and a wide table are created in LindormTable.
- For more information about how to create a keyspace, see CREATE KEYSPACE.
- For more information about how to use the CREATE TABLE statement to create a Lindorm wide table, see CREATE TABLE.

Usage notes

If DataX is deployed on an Elastic Compute Service (ECS) instance and is connected to the Lindorm instance through VPCs, make sure that your Lindorm instance and the ECS instance meet the following requirements to ensure network connectivity:

Your Lindorm instance and ECS instance are deployed in the same region. We recommend that you deploy the two instances in the same zone to reduce network latency.
Your Lindorm instance and ECS instance are deployed in the same VPC.

Procedure

The following steps provide an example on how to import data from a Cassandra to Lindorm by using DataX. In this example, DataX is deployed on an ECS instance.

Run the following command to download the installation package of DataX:

wget https://github.com/alibaba/DataX/archive/refs/tags/datax_v202303.tar.gz

Decompress the downloaded installation package.
```
tar zxvf datax_v202303.tar.gz
```
Run the following command to create a directory named job in the DataX_datax_v202303 project and then create a synchronization task file named JOB.json:
```
mkdir job
touch JOB.json
```
Open the synchronization task file JOB.json.
```
vi JOB.json
```

Configure the parameters described in the following table in the synchronization task file. The following example shows how to configure these parameters in the task file.

{
  "job": {
    "setting": {
      "speed": {
        "channel": 1
      }
    },
    "content": [
      {
        "reader": {
          "name": "cassandrareader",
          "parameter": {
            "host": "ld-bp17j28j2y7pm****-proxy-lindorm-pub.lindorm.rds.aliyuncs.com",
            "port": 9042,
            "username": "TestUser01",
            "password": "testPassword",
            "useSSL": false,
            "consistancyLevel": "LOCAL_ONE",
            "timeout": 600000,
            "fetchsize": 1,
            "keyspace": "db",
            "table": "tt",
            "column": [
              "id",
              "n",
              "id1"
            ],
            "where": "id > ${split_task_min} and id < ${split_task_max}"
      }
    },
      "writer": {
        "name": "cassandrawriter",
        "parameter": {
          "host": "ld-bp17j28j2y7pm****-proxy-lindorm-pub.lindorm.rds.aliyuncs.com",
          "port": 9042,
          "username": "TestUser01",
          "password": "testPassword",
          "useSSL": false,
          "keyspace": "t1",
          "table": "tt",
          "column": [
            "id",
            "n",
            "id1"
          ]
      }
    }
    }
    ]
  }
}


Parameter	Required	Description
channel	Yes	The concurrency of the synchronization task. You can configure this parameter to accelerate the execution of the synchronization task.
host	Yes	reader.parameter.host: the endpoint that is used to connect to the Cassandra cluster. You can obtain the endpoint in the Cassandra console. For more information, see Obtain endpoints. writer.parameter.host: the domain name in the Cassandra-compatible endpoint that is used to connect to LindormTable. You can obtain the domain name in the Lindorm console. For more information, see Prerequisites.
port	Yes	Set the port number to 9042. reader.parameter.port: the port number that is used to connect to the Cassandra cluster. writer.parameter.port: the port number in the Cassandra-compatible endpoint that is used to connect to LindormTable.
useSSL	Yes	Specifies whether to enable SSL encryption. true: enables SSL encryption. false: disables SSL encryption.
keyspace	Yes	reader.parameter.keyspace: the name of the source keyspace in the Cassandra cluster. This keyspace contains the source table from which you want to import data. writer.parameter.keyspace: the name of the destination namespace in the Lindorm instance. The namespace contains the destination wide table to which you want to import data.
table	Yes	reader.parameter.table: the name of the source table from which you want to import data. writer.parameter.table: the name of the destination wide table.
column	Yes	reader.parameter.column: the names of the columns that you want to import from the source table. writer.parameter.column: the names of the columns in the destination wide table.
where	No	The conditions based on which the synchronization task is split. You can specify this parameter to split a synchronization task into multiple tasks.

Run the following command to create the executable of DataX in the job directory of the DataX_datax_v202303 project:
```
mvn -U package assembly:assembly -Dmaven.test.skip=true
```

Optional:Optimize the parameters of the synchronization task.

You can use the following code block to replace the code block in the same format in the datax.py script located in the target/datax/datax/bin/ directory. This way, you can accelerate the execution of the synchronization task.

DEFAULT_JVM = "-Xms8g -Xmx8g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=%s/log" % (DATAX_HOME)
DEFAULT_PROPERTY_CONF = "-Dfile.encoding=UTF-8 -Dcom.datastax.driver.NATIVE_TRANSPORT_MAX_FRAME_SIZE_IN_MB=1900 -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Djava.security.egd=file:///dev/urandom -Ddatax.home=%s -Dlogback.configurationFile=%s" % (
DATAX_HOME, LOGBACK_FILE)

Run the following command to execute a single synchronization task:
```
#Format: python target/datax/datax/bin/datax.py --help
python target/datax/datax/bin/datax.py job/JOB.json -p "-Dsplit_task_min=100 -Dsplit_task_max=1000"
```
In the preceding command, JOB.json specifies the name of the synchronization task file.
Note You can also create multiple synchronization task files in the job directory based on the CPU, memory, and network of the source and destination clusters. In this case, you can run the command to execute multiple synchronization tasks at the same time.

Examples

If you need to refer to the sample code when you perform the steps or want to accelerate the execution of the mvn -U command in Step 6, see the source code files in datax.tar.gz.