如何同步資料到資料表 - Tablestore

Tablestore Sink Connector會根據訂閱的主題輪詢地從Kafka中拉取訊息，並對訊息記錄進行解析，然後將資料大量匯入到Tablestore的資料表。

前提條件

已安裝Kafka，並且已啟動ZooKeeper和Kafka。更多資訊，請參見Kafka官方文檔。
已開通Table Store服務，建立執行個體以及建立資料表。具體操作，請參見使用流程。
說明
您也可以通過Tablestore Sink Connector自動建立目標資料表，此時需要配置auto.create為true。
已擷取AccessKey。具體操作，請參見擷取AccessKey。

步驟一：部署Tablestore Sink Connector

通過以下任意一種方式擷取Tablestore Sink Connector。
- 通過GitHub下載源碼並編譯。源碼的GitHub路徑為Tablestore Sink Connector源碼。
  1. 通過Git工具執行以下命令下載Tablestore Sink Connector源碼。
```
git clone https://github.com/aliyun/kafka-connect-tablestore.git
```
  2. 進入到下載的源碼目錄後，執行以下命令進行Maven打包。
```
mvn clean package -DskipTests
```
    編譯完成後，產生的壓縮包（例如kafka-connect-tablestore-1.0.jar）會存放在target目錄。
- 直接下載編譯完成的kafka-connect-tablestore壓縮包。
將壓縮包複製到各個節點的$KAFKA_HOME/libs目錄下。

步驟二：啟動Tablestore Sink Connector

Tablestore Sink Connector具有standalone模式和distributed模式兩種工作模式。請根據實際選擇。

standalone模式的配置步驟如下：

根據實際修改worker設定檔connect-standalone.properties和connector設定檔connect-tablestore-sink-quickstart.properties。

worker設定檔connect-standalone.properties的配置樣本

worker配置中包括Kafka串連參數、序列化格式、提交位移量的頻率等配置項。此處以Kafka官方樣本為例介紹。更多資訊，請參見Kafka Connect。

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# These are defaults. This file just demonstrates how to override some settings.
bootstrap.servers=localhost:9092

# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply
# it to
key.converter.schemas.enable=true
value.converter.schemas.enable=true

offset.storage.file.filename=/tmp/connect.offsets
# Flush much faster than normal, which is useful for testing/debugging
offset.flush.interval.ms=10000

# Set to a list of filesystem paths separated by commas (,) to enable class loading isolation for plugins
# (connectors, converters, transformations). The list should consist of top level directories that include 
# any combination of: 
# a) directories immediately containing jars with plugins and their dependencies
# b) uber-jars with plugins and their dependencies
# c) directories immediately containing the package directory structure of classes of plugins and their dependencies
# Note: symlinks will be followed to discover dependencies or plugins.
# Examples: 
# plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
#plugin.path=

connector設定檔connect-tablestore-sink-quickstart.properties的配置樣本

connector配置中包括連接器類、Table Store串連參數、資料對應等配置項。更多資訊，請參見配置說明。

# 設定連接器名稱。
name=tablestore-sink
# 指定連接器類。
connector.class=TableStoreSinkConnector
# 設定最大任務數。
tasks.max=1
# 指定匯出資料的Kafka的Topic列表。
topics=test

# 以下為Tablestore串連參數配置。
# Tablestore執行個體的Endpoint。
tablestore.endpoint=https://xxx.xxx.ots.aliyuncs.com
# 填寫AccessKey ID和AccessKey Secret。
tablestore.access.key.id =xxx
tablestore.access.key.secret=xxx
# Tablestore執行個體名稱。
tablestore.instance.name=xxx

# 用於指定Table Store目標表名稱的格式字串，其中<topic>作為原始Topic的預留位置。預設值為<topic>。
# Examples:
# table.name.format=kafka_<topic>，主題為test的訊息記錄將寫入表名為kafka_test的資料表。
# table.name.format=

# 主鍵模式，預設值為kafka。
# 將以<topic>_<partition>（Kafka主題和分區，用"_"分隔）和<offset>（訊息記錄在分區中的位移量）作為Tablestore資料表的主鍵。
# primarykey.mode=

# 自動建立目標表，預設值為false。
auto.create=true

進入到$KAFKA_HOME目錄後，執行以下命令啟動standalone模式。

bin/connect-standalone.sh config/connect-standalone.properties config/connect-tablestore-sink-quickstart.properties

distributed模式的配置步驟如下：

根據實際修改worker設定檔connect-distributed.properties。

worker配置中包括Kafka串連參數、序列化格式、提交位移量的頻率等配置項，還包括儲存各connectors相關資訊的Topic，建議您提前手動建立相應Topic。此處以Kafka官方樣本為例介紹。更多資訊，請參見Kafka Connect。

offset.storage.topic：用於儲存各connectors相關offset的Compact Topic。
config.storage.topic：用於儲存connector和task相關配置的Compact Topic，此Topic的Parition數必須設定為1。
status.storage.topic：用於儲存Kafka Connect狀態資訊的Compact Topic。

##
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##

# This file contains some of the configurations for the Kafka Connect distributed worker. This file is intended
# to be used with the examples, and some settings may differ from those used in a production system, especially
# the `bootstrap.servers` and those specifying replication factors.

# A list of host/port pairs to use for establishing the initial connection to the Kafka cluster.
bootstrap.servers=localhost:9092

# unique name for the cluster, used in forming the Connect cluster group. Note that this must not conflict with consumer group IDs
group.id=connect-cluster

# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply
# it to
key.converter.schemas.enable=true
value.converter.schemas.enable=true

# Topic to use for storing offsets. This topic should have many partitions and be replicated and compacted.
# Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
#offset.storage.partitions=25

# Topic to use for storing connector and task configurations; note that this should be a single partition, highly replicated,
# and compacted topic. Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
config.storage.topic=connect-configs
config.storage.replication.factor=1

# Topic to use for storing statuses. This topic can have multiple partitions and should be replicated and compacted.
# Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
status.storage.topic=connect-status
status.storage.replication.factor=1
#status.storage.partitions=5

# Flush much faster than normal, which is useful for testing/debugging
offset.flush.interval.ms=10000

# These are provided to inform the user about the presence of the REST host and port configs 
# Hostname & Port for the REST API to listen on. If this is set, it will bind to the interface used to listen to requests.
#rest.host.name=
#rest.port=8083

# The Hostname & Port that will be given out to other workers to connect to i.e. URLs that are routable from other servers.
#rest.advertised.host.name=
#rest.advertised.port=

# Set to a list of filesystem paths separated by commas (,) to enable class loading isolation for plugins
# (connectors, converters, transformations). The list should consist of top level directories that include 
# any combination of: 
# a) directories immediately containing jars with plugins and their dependencies
# b) uber-jars with plugins and their dependencies
# c) directories immediately containing the package directory structure of classes of plugins and their dependencies
# Examples: 
# plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
#plugin.path=

進入到$KAFKA_HOME目錄後，執行以下命令啟動distributed模式。
重要
您需要在每個節點上均啟動worker進程。
```
bin/connect-distributed.sh config/connect-distributed.properties
```

通過REST API管理connectors。更多資訊，請參見REST API。

在config路徑下建立connect-tablestore-sink-quickstart.json檔案並填寫以下樣本內容。

connector設定檔以JSON格式指定參數索引值對，包括連接器類、Table Store串連參數、資料對應等配置項。更多資訊，請參見配置說明。

{
  "name": "tablestore-sink",
  "config": {
    "connector.class":"TableStoreSinkConnector",
    "tasks.max":"1",
    "topics":"test",
    "tablestore.endpoint":"https://xxx.xxx.ots.aliyuncs.com",
    "tablestore.access.key.id":"xxx",
    "tablestore.access.key.secret":"xxx",
    "tablestore.instance.name":"xxx",
    "table.name.format":"<topic>",
    "primarykey.mode":"kafka",
    "auto.create":"true"
  }
}

執行以下命令啟動一個Tablestore Sink Connector。
```
curl -i -k  -H "Content-type: application/json" -X POST -d @config/connect-tablestore-sink-quickstart.json http://localhost:8083/connectors
```
其中http://localhost:8083/connectors為Kafka REST服務的地址，請根據實際修改。

步驟三：生產新的記錄

進入到$KAFKA_HOME目錄後，執行以下命令啟動一個控制台生產者。

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

配置項說明請參見下表。

配置項	樣本值	描述
--broker-list	localhost:9092	Kafka叢集broker地址和連接埠。
--topic	test	主題名稱。啟動Tablestore Sink Connector時預設會自動建立Topic，您也可以選擇手動建立。

向主題test中寫入一些新的訊息。

Struct類型訊息

{
    "schema":{
        "type":"struct",
        "fields":[
            {
                "type":"int32",
                "optional":false,
                "field":"id"
            },
            {
                "type":"string",
                "optional":false,
                "field":"product"
            },
            {
                "type":"int64",
                "optional":false,
                "field":"quantity"
            },
            {
                "type":"double",
                "optional":false,
                "field":"price"
            }
        ],
        "optional":false,
        "name":"record"
    },
    "payload":{
        "id":1,
        "product":"foo",
        "quantity":100,
        "price":50
    }
}

Map類型訊息

{
    "schema":{
        "type":"map",
        "keys":{
            "type":"string",
            "optional":false
        },
        "values":{
            "type":"int32",
            "optional":false
        },
        "optional":false
    },
    "payload":{
        "id":1
    }
}

登入Table Store控制台查看資料。
Table Store執行個體中將自動建立一張資料表，表名為test，表中資料如下圖所示。其中第一行資料為Map類型訊息匯入結果，第二行資料為Struct類型訊息匯入結果。