By Fu Shuai
Before data migration, you must ensure that your Kafka cluster works properly. In this article, we use Alibaba Cloud E-MapReduce (EMR) to build a Kafka cluster automatically. For details, see Kafka Quick Start.
The EMR Kafka version information used in this article is as follows:
The network type of this Kafka cluster is VPC in China East 1 (Hangzhou). The ECS compute resource of the master instance group is configured with a public IP and an internal network IP. The specific configuration is shown in the following figure.
Activate MaxCompute and create a project. In this article, we've created a project named bigdata_DOC in China East 1 (Hangzhou), and enabled the related DataWorks services, as shown in the following figure. For more information, see Activate MaxCompute.
Kafka is a message-oriented middleware for distributed publishing and subscription, which is widely used for its high performance and high throughput and can process millions of messages per second. Kafka is applicable to stream data processing, and is mainly used in scenarios such as user behavior tracking and log collection.
A typical Kafka cluster contains several Producers, Brokers, Consumers, and a ZooKeeper cluster. The Kafka cluster manages its own configuration and performs service collaboration through ZooKeeper.
A Topic is a collection of the most commonly used messages in a Kafka cluster, and is a logical concept of message storage. The physical disk does not store the Topic, instead the specific messages in the Topic are stored on the disks of each node in the cluster according to the partitions. Multiple Producers can send messages to a Topic, and multiple Consumers can pull (consume) messages to it.
When a message is added to a partition, an offset (numbering from 0) is assigned, which is the unique number of the message in a partition.
To ensure that you can log on to the Header host of the EMR cluster, and MaxCompute and DataWorks can communicate with the Header host smoothly, first configure the security group for the Header host of the EMR cluster to enable TCP ports 22 and 9092.
In the EMR Hadoop console, go to Cluster Management > Host List page to confirm the address of the EMR cluster Header host, and remotely connect and log on through SSH.
Use the kafka-topics.sh --zookeeper emr-header-1:2181/kafka-1.0.1 --partitions 10 --replication-factor 3 --topic testkafka –create
command to create the testkafka Topic used for the test. You can view the created Topic by using the kafka-topics.sh --list --zookeeper emr-header-1:2181/kafka-1.0.1
command.
[root@emr-header-1 ~]# kafka-topics.sh --zookeeper emr-header-1:2181/kafka-1.0.1 --partitions 10 --replication-factor 3 --topic testkafka --create
Created topic "testkafka".
[root@emr-header-1 ~]# kafka-topics.sh --list --zookeeper emr-header-1:2181/kafka-1.0.1
__consumer_offsets
_emr-client-metrics
_schemas
connect-configs
connect-offsets
connect-status
testkafka
You can use the kafka-console-producer.sh --broker-list emr-header-1:9092 --topic testkafka
command to simulate the Producer writing data to the testkafka Topic. Kafka is used to process streaming data, so you can write data to it continuously. To ensure the test results, we recommend that you write more than 10 data records.
[root@emr-header-1 ~]# kafka-console-producer.sh --broker-list emr-header-1:9092 --topic testkafka
123
abc
To verify that the data was successfully written to Kafka, you can open an SSH window at the same time, and use the kafka-console-consumer.sh --bootstrap-server emr-header-1:9092 --topic testkafka --from-beginning
command to simulate the Consumer. As shown in the following figure, you can see the written data when the operation was successful.
[root@emr-header-1 ~]# kafka-console-consumer.sh --bootstrap-server emr-header-1:9092 --topic testkafka --from-beginning
123
abc
To ensure that MaxCompute can successfully receive Kafka data, you must first create a table on MaxCompute. In this example, a non-partitioned table is used to facilitate the test.
Log on to DataWorks to create a table. For more information, see Table Management.
You can click DDL mode to create a table. The table creation statement for this example is as follows:
CREATE TABLE testkafka (
`key` string,
`value` string,
`partition1` string,
`timestamp1` string,
`offset` string,
`t123` string,
`event_id` string,
`tag` string
);
Each of these columns corresponds to one of the default columns of Kafka Reader for DataWorks data integration, and you can name it yourself.
Currently, the default DataWorks resource group cannot fully support the Kafka plug-in. You need to use a custom resource group to synchronize data. For more information about custom resource groups, see Add Task Resources.
In this article, to save resources, we use the Header host of the EMR cluster as the custom resource group. After completion, please wait until the server status changes to available.
In your service process, right-click the data integration, and choose Create Data Integration Node > Data Synchronization.
After creating a data synchronization node, you need to choose Kafka as the data source and ODPS as the data destination, and use the default data source odps_first. You also need to choose the newly created testkafka as the destination table. After completing the preceding configuration, click the button in the box below to switch to script mode.
The script configuration is as follows.
{
"type": "job",
"steps": [
{
"stepType": "kafka",
"parameter": {
"server": "47.xxx.xxx.xxx:9092",
"kafkaConfig": {
"group.id": "console-consumer-83505"
},
"valueType": "ByteArray",
"column": [
"__key__",
"__value__",
"__partition__",
"__timestamp__",
"__offset__",
"'123'",
"event_id",
"tag.desc"
],
"topic": "testkafka",
"keyType": "ByteArray",
"waitTime": "10",
"beginOffset": "0",
"endOffset": "3"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"truncate": true,
"compress": false,
"datasource": "odps_first",
"column": [
"key",
"value",
"partition1",
"timestamp1",
"offset",
"t123",
"event_id",
"tag"
],
"emptyAsNull": false,
"table": "testkafka"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"throttle": false,
"concurrent": 1,
"dmu": 1
}
}
}
You can use the kafka-consumer-groups.sh --bootstrap-server emr-header-1:9092 --list
command on the Header host to view the group.id parameter, as well as the Group name for the Consumer.
[root@emr-header-1 ~]# kafka-consumer-groups.sh --bootstrap-server emr-header-1:9092 --list
Note: This will not show information about old ZooKeeper-based consumers.
Taking console-consumer-83505 as an example, you can use the kafka-consumer-groups.sh --bootstrap-server emr-header-1:9092 --describe --group console-consumer-83505
command on the Header host to confirm the beginOffset and endOffset parameters.
[root@emr-header-1 ~]# kafka-consumer-groups.sh --bootstrap-server emr-header-1:9092 --describe --group console-consumer-83505
Note: This will not show information about old ZooKeeper-based consumers.
Consumer group "console-consumer-83505" has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
testkafka 6 0 0 0 - - -
test 6 3 3 0 - - -
testkafka 0 0 0 0 - - -
testkafka 1 1 1 0 - - -
testkafka 5 0 0 0 - - -
After the script configuration is completed, first switch the task resource group to the resource group you just created, and then click Run.
After it completes, you can view the results in the operational log. A log showing successful operation is as follows:
You can run SQL statements by creating a new data development task to see if data synchronized from Kafka already exists in the current table. In this example, use the select * from testkafka
; statement, and click Run.
In this example, multiple data records are input in the testkafka Topic to ensure accuracy of the result. You can check whether the data is consistent with what you entered.
137 posts | 19 followers
FollowAlibaba Clouder - March 31, 2021
Alibaba Cloud MaxCompute - December 8, 2020
Alibaba Cloud Native - July 5, 2023
Alibaba Clouder - March 29, 2021
Alibaba EMR - July 9, 2021
Alibaba Clouder - March 1, 2019
137 posts | 19 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreA fully-managed Apache Kafka service to help you quickly build data pipelines for your big data analytics.
Learn MoreMore Posts by Alibaba Cloud MaxCompute