How to import data from Kafka to Simple Log Service - Simple Log Service

This topic describes how to import data from Kafka to Simple Log Service. After you import data to Simple Log Service, you can query, analyze, and transform data in Simple Log Service.

Prerequisites

A Kafka cluster is available.
A project and a Logstore are created. For more information, see Create a project and Create a Logstore.

Supported versions

Only Kafka 2.2.0 and later are supported.

Create a data import configuration

Log on to the Simple Log Service console.
In the Quick Data Import section, click Import Data. On the Data Import tab of the dialog box that appears, click Kafka - Data Import.
Select the project and Logstore. Then, click Next.

Configure the parameters for the data import configuration.

In the Import Configuration step, configure the following parameters.

Parameter	Description
Job Name	The ID of the import job.
Display Name	The name of the import job.
Job Description	The description of the import job.
Endpoint	The address that is used to connect to the Kafka cluster. You can obtain the address from the bootstrap.servers field that is configured for the Kafka cluster. Separate multiple addresses with commas (,). If you use a Kafka cluster that is provided by an Alibaba Cloud ApsaraMQ for Kafka instance, you must enter the IP address or domain name of the instance endpoint. If you use a Kafka cluster that is deployed on an Alibaba Cloud Elastic Compute Service (ECS) instance, you must enter an IP address of the ECS instance. If you use other Kafka clusters, you must enter the public IP address or domain name of a broker in the Kafka cluster.
Topics	The Kafka topics. Separate multiple topics with commas (,).
Consumer Group	If you use a Kafka cluster that is provided by an Alibaba Cloud ApsaraMQ for Kafka instance and do not enable the flexible group creation feature, you must select a consumer group. For more information about the feature, see Use the flexible group creation feature. For more information about how to create a consumer group, see Create a consumer group.
Starting Position	The position from which you want the system to start importing data. Valid values: Earliest: The system starts to import data from the first Kafka data entry that exists. Latest: The system starts to import data from the most recent Kafka data entry that is generated.
Data Format	The format of the data that you want to import. Valid values: Simple Mode: If the data that you want to import is in the single-line format, you can select Simple Mode. JSON String: If the data that you want to import is in the JSON format, you can select JSON String. The import job parses the imported data into key-value pairs and parses only the first layer of the data.
Parse Array Elements	After you turn on Parse Array Elements, the system splits data in the JSON array format into multiple pieces of data based on array elements and then imports the data.
Encoding Format	The encoding format or character set of the data that you want to import. Valid values: UTF-8 and GBK.
VPC-based Instance ID	If your ApsaraMQ for Kafka instance or ECS instance resides in a virtual private cloud (VPC), you can specify the ID of the VPC to allow Simple Log Service to read data from the Kafka cluster over an internal network of Alibaba Cloud. Data read over an internal network of Alibaba Cloud provides higher security and network stability. Important Make sure that the Kafka cluster can be accessed from the 100.104.0.0/16 CIDR block.
Time Configuration
Time Field	The time field that is used to record the log time. You can enter the name of the column that represents time in the Kafka data.
Regular Expression to Extract Time	If you set Data Format to Simple Mode, you must specify a regular expression to extract time from the Kafka data. For example, if a Kafka data entry is `message with time 2022-08-08 14:20:20`, you can set Regular Expression to Extract Time to `\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d`.
Time Field Format	The time format that is used to parse the value of the time field. You can specify a time format that is supported by Java SimpleDateFormat. Example: yyyy-MM-dd HH:mm:ss. For more information about the time format syntax, see Class SimpleDateFormat. For more information about the common time formats, see Time formats. You can specify an epoch time format. Valid values: epoch, epochMillis, epochMacro, and epochNano.
Time Zone	The time zone of the time field. If you set Time Field Format to an epoch time format, you do not need to configure Time Zone.
Default Time Source	If no time extraction information is provided or time extraction fails, the system uses the time source that you specify. Valid values: Current System Time and Kafka Message Timestamp.
Advanced Settings
Log Context	After you turn on Log Context, you can use the contextual query feature. You can view the context of the data that you want to import in a source Kafka partition.
Communication Protocol	The information about the communication protocol that is used to connect to the Kafka cluster. If you want to import data over the Internet, we recommend that you encrypt your connections between Simple Log Service and the Kafka cluster and implement user authentication. The following sample code provides an example. The protocol field supports the following values: plaintext, ssl, sasl_plaintext, and sasl_ssl. The recommended value is sasl_ssl, which requires connection encryption and user authentication. If you set protocol to sasl_plaintext or sasl_ssl, you must also configure the sasl node. The mechanism field below the sasl node supports the following values: PLAIN, SCRAM-SHA-256, and SCRAM-SHA-512. This field specifies a username-password authentication mechanism. `{ "protocol":"sasl_plaintext", "sasl":{ "mechanism":"PLAIN", "username":"xxx", "password":"yyy" } }`
Private Domain Resolution	If you use a Kafka cluster that is deployed on an ECS instance and the brokers in the cluster are connected to each other over an internal endpoint, you must specify the endpoint and IP address of the ECS instance for each broker. Example: `{ "hostname#1":"192.168.XX.XX", "hostname#2":"192.168.XX.XX", "hostname#3":"192.168.XX.XX" }`

Click Preview to preview the import result.
After you confirm the result, click Next.

Preview data, configure indexes, and then click Next.
By default, full-text indexing is enabled for Log Service. You can also configure field indexes based on collected logs in manual mode or automatic mode. To configure field indexes in automatic mode, click Automatic Index Generation. This way, Log Service automatically creates field indexes. For more information, see Create indexes.
Important
If you want to query and analyze logs, you must enable full-text indexing or field indexing. If you enable both full-text indexing and field indexing, the system uses only field indexes.
Click Query Log. On the query and analysis page, check whether Kafka data is imported.
Wait for approximately 1 minute. If the required Kafka data exists, the data is imported.

View a data import configuration

After you create a data import configuration, you can view the configuration details and related statistical reports in the Simple Log Service console.

In the Projects section, click the project to which the data import configuration belongs.
Find and click the Logstore to which the data import configuration belongs, choose Data Collection > Data Import, and then click the name of the data import configuration.
On the Import Configuration Overview page, view the basic information and statistical reports of the data import configuration.

What to do next

On the Import Configuration Overview page, you can perform the following operations on the data import configuration:

Modify the data import configuration
To modify the data import configuration, click Edit Configurations. For more information, see Create a data import configuration.
Delete the data import configuration
To delete the data import configuration, click Delete Configuration.
Warning
After the data import configuration is deleted, it cannot be restored.
Stop an import job
To stop a data import job, click Stop.

FAQ

Problem description	Possible cause	Solution
A broker connection error occurs during preview. Error code: Broker transport failure.	The address that is specified to connect to the Kafka cluster is invalid. The IP addresses that are used by the import job to access the Kafka cluster are not added to the whitelist of the cluster. As a result, the import job cannot access the cluster. Your Kafka cluster is deployed on Alibaba Cloud, but the VPC-based Instance ID parameter is not configured.	Make sure that the specified address for the Kafka cluster is valid. Add the IP addresses that are used by the import job to access the Kafka cluster to the whitelist of the cluster. For more information, see IP address whitelists. If data is imported from a Kafka cluster over an internal network of Alibaba Cloud, make sure that the VPC-based Instance ID parameter is configured.
A timeout error occurs during preview. Error code: preview request timed out.	The Kafka topics that are specified in the data import configuration do not contain data.	If the Kafka topics do not contain data, write data to the topics and preview the data again.
Garbled characters exist in the imported data.	The encoding format that is specified in the data import configuration does not meet requirements.	Update the data import configuration based on the actual encoding format of the Kafka data. To handle existing garbled characters, create a Logstore and a data import configuration.
The log time displayed in Simple Log Service is different from the actual time of the imported data.	No time field is specified in the data import configuration, or the specified time format or time zone is invalid.	Specify a time field or specify a valid time format and time zone. For more information, see Create a data import configuration.
After data is imported, the data cannot be queried or analyzed.	The data is not within the query time range. No indexes are configured. Configured indexes failed to take effect.	Check whether the time of the data that you want to query is within the query time range that you specify. If no, adjust the query time range and query the data again. Check whether indexes are configured for the Logstore to which the data is imported. If no, configure indexes first. For more information, see Create indexes and Reindex logs for a Logstore. If indexes are configured for the Logstore and the volume of imported data is displayed as expected on the Data Processing Insight dashboard, the possible cause is that the indexes do not take effect. In this case, reindex the data. For more information, see Reindex logs for a Logstore.
The number of imported data entries is less than expected.	The size of some Kafka messages exceeds 3 MB. You can check the sizes of Kafka messages on the Data Processing Insight dashboard.	Make sure that the size of each Kafka message does not exceed 3 MB in size.
A large latency exists during the import.	The bandwidth limit of the Kafka cluster is reached. The network is unstable when data is imported over the Internet. The number of partitions for a Kafka topic is excessively small. The number of shards in the Logstore is excessively small. For more information about other possible causes, see Limits on performance.	Check whether the traffic of the Kafka cluster, especially a Kafka cluster deployed on Alibaba Cloud, reaches the bandwidth limit. If the traffic reaches or approaches the bandwidth limit, scale out the bandwidth resources of the cluster. If the number of partitions for a Kafka topic is excessively small, increase the number of partitions and monitor the latency. If the number of shards in the Logstore is excessively small, increase the number of shards and monitor the latency. For more information, see Manage shards.

Error handling

Item

Description

A network connection error occurs.

The import job is periodically retried. After the network connection is restored, the import job continues to consume data from the offset of the previous data import interruption.

A Kafka topic does not exist.

If a Kafka topic that contains data to import does not exist, the import job skips the topic. This does not affect the data import of other normal topics.

After the topic is re-created, the import job consumes the data in the topic as expected, and a latency of approximately 10 minutes exists.