Tablestore Sink Connector會根據訂閱的主題輪詢地從Kafka中拉取訊息,並對訊息記錄進行解析,然後將資料大量匯入到Tablestore的資料表。
前提條件
已安裝Kafka,並且已啟動ZooKeeper和Kafka。更多資訊,請參見Kafka官方文檔。
已開通Table Store服務,建立執行個體以及建立資料表。具體操作,請參見使用流程。
說明您也可以通過Tablestore Sink Connector自動建立目標資料表,此時需要配置auto.create為true。
已擷取AccessKey。具體操作,請參見擷取AccessKey。
步驟一:部署Tablestore Sink Connector
通過以下任意一種方式擷取Tablestore Sink Connector。
通過GitHub下載源碼並編譯。源碼的GitHub路徑為Tablestore Sink Connector源碼。
通過Git工具執行以下命令下載Tablestore Sink Connector源碼。
git clone https://github.com/aliyun/kafka-connect-tablestore.git
進入到下載的源碼目錄後,執行以下命令進行Maven打包。
mvn clean package -DskipTests
編譯完成後,產生的壓縮包(例如kafka-connect-tablestore-1.0.jar)會存放在target目錄。
直接下載編譯完成的kafka-connect-tablestore壓縮包。
將壓縮包複製到各個節點的$KAFKA_HOME/libs目錄下。
步驟二:啟動Tablestore Sink Connector
Tablestore Sink Connector具有standalone模式和distributed模式兩種工作模式。請根據實際選擇。
standalone模式的配置步驟如下:
根據實際修改worker設定檔connect-standalone.properties和connector設定檔connect-tablestore-sink-quickstart.properties。
worker設定檔connect-standalone.properties的配置樣本
worker配置中包括Kafka串連參數、序列化格式、提交位移量的頻率等配置項。此處以Kafka官方樣本為例介紹。更多資訊,請參見Kafka Connect。
# Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # These are defaults. This file just demonstrates how to override some settings. bootstrap.servers=localhost:9092 # The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will # need to configure these based on the format they want their data in when loaded from or stored into Kafka key.converter=org.apache.kafka.connect.json.JsonConverter value.converter=org.apache.kafka.connect.json.JsonConverter # Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply # it to key.converter.schemas.enable=true value.converter.schemas.enable=true offset.storage.file.filename=/tmp/connect.offsets # Flush much faster than normal, which is useful for testing/debugging offset.flush.interval.ms=10000 # Set to a list of filesystem paths separated by commas (,) to enable class loading isolation for plugins # (connectors, converters, transformations). The list should consist of top level directories that include # any combination of: # a) directories immediately containing jars with plugins and their dependencies # b) uber-jars with plugins and their dependencies # c) directories immediately containing the package directory structure of classes of plugins and their dependencies # Note: symlinks will be followed to discover dependencies or plugins. # Examples: # plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors, #plugin.path=
connector設定檔connect-tablestore-sink-quickstart.properties的配置樣本
connector配置中包括連接器類、Table Store串連參數、資料對應等配置項。更多資訊,請參見配置說明。
# 設定連接器名稱。 name=tablestore-sink # 指定連接器類。 connector.class=TableStoreSinkConnector # 設定最大任務數。 tasks.max=1 # 指定匯出資料的Kafka的Topic列表。 topics=test # 以下為Tablestore串連參數配置。 # Tablestore執行個體的Endpoint。 tablestore.endpoint=https://xxx.xxx.ots.aliyuncs.com # 填寫AccessKey ID和AccessKey Secret。 tablestore.access.key.id =xxx tablestore.access.key.secret=xxx # Tablestore執行個體名稱。 tablestore.instance.name=xxx # 用於指定Table Store目標表名稱的格式字串,其中<topic>作為原始Topic的預留位置。預設值為<topic>。 # Examples: # table.name.format=kafka_<topic>,主題為test的訊息記錄將寫入表名為kafka_test的資料表。 # table.name.format= # 主鍵模式,預設值為kafka。 # 將以<topic>_<partition>(Kafka主題和分區,用"_"分隔)和<offset>(訊息記錄在分區中的位移量)作為Tablestore資料表的主鍵。 # primarykey.mode= # 自動建立目標表,預設值為false。 auto.create=true
進入到$KAFKA_HOME目錄後,執行以下命令啟動standalone模式。
bin/connect-standalone.sh config/connect-standalone.properties config/connect-tablestore-sink-quickstart.properties
distributed模式的配置步驟如下:
根據實際修改worker設定檔connect-distributed.properties。
worker配置中包括Kafka串連參數、序列化格式、提交位移量的頻率等配置項,還包括儲存各connectors相關資訊的Topic,建議您提前手動建立相應Topic。此處以Kafka官方樣本為例介紹。更多資訊,請參見Kafka Connect。
offset.storage.topic:用於儲存各connectors相關offset的Compact Topic。
config.storage.topic:用於儲存connector和task相關配置的Compact Topic,此Topic的Parition數必須設定為1。
status.storage.topic:用於儲存Kafka Connect狀態資訊的Compact Topic。
## # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. ## # This file contains some of the configurations for the Kafka Connect distributed worker. This file is intended # to be used with the examples, and some settings may differ from those used in a production system, especially # the `bootstrap.servers` and those specifying replication factors. # A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. bootstrap.servers=localhost:9092 # unique name for the cluster, used in forming the Connect cluster group. Note that this must not conflict with consumer group IDs group.id=connect-cluster # The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will # need to configure these based on the format they want their data in when loaded from or stored into Kafka key.converter=org.apache.kafka.connect.json.JsonConverter value.converter=org.apache.kafka.connect.json.JsonConverter # Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply # it to key.converter.schemas.enable=true value.converter.schemas.enable=true # Topic to use for storing offsets. This topic should have many partitions and be replicated and compacted. # Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create # the topic before starting Kafka Connect if a specific topic configuration is needed. # Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value. # Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able # to run this example on a single-broker cluster and so here we instead set the replication factor to 1. offset.storage.topic=connect-offsets offset.storage.replication.factor=1 #offset.storage.partitions=25 # Topic to use for storing connector and task configurations; note that this should be a single partition, highly replicated, # and compacted topic. Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create # the topic before starting Kafka Connect if a specific topic configuration is needed. # Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value. # Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able # to run this example on a single-broker cluster and so here we instead set the replication factor to 1. config.storage.topic=connect-configs config.storage.replication.factor=1 # Topic to use for storing statuses. This topic can have multiple partitions and should be replicated and compacted. # Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create # the topic before starting Kafka Connect if a specific topic configuration is needed. # Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value. # Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able # to run this example on a single-broker cluster and so here we instead set the replication factor to 1. status.storage.topic=connect-status status.storage.replication.factor=1 #status.storage.partitions=5 # Flush much faster than normal, which is useful for testing/debugging offset.flush.interval.ms=10000 # These are provided to inform the user about the presence of the REST host and port configs # Hostname & Port for the REST API to listen on. If this is set, it will bind to the interface used to listen to requests. #rest.host.name= #rest.port=8083 # The Hostname & Port that will be given out to other workers to connect to i.e. URLs that are routable from other servers. #rest.advertised.host.name= #rest.advertised.port= # Set to a list of filesystem paths separated by commas (,) to enable class loading isolation for plugins # (connectors, converters, transformations). The list should consist of top level directories that include # any combination of: # a) directories immediately containing jars with plugins and their dependencies # b) uber-jars with plugins and their dependencies # c) directories immediately containing the package directory structure of classes of plugins and their dependencies # Examples: # plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors, #plugin.path=
進入到$KAFKA_HOME目錄後,執行以下命令啟動distributed模式。
重要您需要在每個節點上均啟動worker進程。
bin/connect-distributed.sh config/connect-distributed.properties
通過REST API管理connectors。更多資訊,請參見REST API。
在config路徑下建立connect-tablestore-sink-quickstart.json檔案並填寫以下樣本內容。
connector設定檔以JSON格式指定參數索引值對,包括連接器類、Table Store串連參數、資料對應等配置項。更多資訊,請參見配置說明。
{ "name": "tablestore-sink", "config": { "connector.class":"TableStoreSinkConnector", "tasks.max":"1", "topics":"test", "tablestore.endpoint":"https://xxx.xxx.ots.aliyuncs.com", "tablestore.access.key.id":"xxx", "tablestore.access.key.secret":"xxx", "tablestore.instance.name":"xxx", "table.name.format":"<topic>", "primarykey.mode":"kafka", "auto.create":"true" } }
執行以下命令啟動一個Tablestore Sink Connector。
curl -i -k -H "Content-type: application/json" -X POST -d @config/connect-tablestore-sink-quickstart.json http://localhost:8083/connectors
其中
http://localhost:8083/connectors
為Kafka REST服務的地址,請根據實際修改。
步驟三:生產新的記錄
進入到$KAFKA_HOME目錄後,執行以下命令啟動一個控制台生產者。
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
配置項說明請參見下表。
配置項
樣本值
描述
--broker-list
localhost:9092
Kafka叢集broker地址和連接埠。
--topic
test
主題名稱。啟動Tablestore Sink Connector時預設會自動建立Topic,您也可以選擇手動建立。
向主題test中寫入一些新的訊息。
Struct類型訊息
{ "schema":{ "type":"struct", "fields":[ { "type":"int32", "optional":false, "field":"id" }, { "type":"string", "optional":false, "field":"product" }, { "type":"int64", "optional":false, "field":"quantity" }, { "type":"double", "optional":false, "field":"price" } ], "optional":false, "name":"record" }, "payload":{ "id":1, "product":"foo", "quantity":100, "price":50 } }
Map類型訊息
{ "schema":{ "type":"map", "keys":{ "type":"string", "optional":false }, "values":{ "type":"int32", "optional":false }, "optional":false }, "payload":{ "id":1 } }
登入Table Store控制台查看資料。
Table Store執行個體中將自動建立一張資料表,表名為test,表中資料如下圖所示。其中第一行資料為Map類型訊息匯入結果,第二行資料為Struct類型訊息匯入結果。