All Products
Search
Document Center

Realtime Compute for Apache Flink:Ingest log data into data warehouses in real time

Last Updated:Dec 12, 2024

Realtime Compute for Apache Flink allows you to ingest log data into data warehouses in real time. This topic describes how to create a draft that synchronizes data from an ApsaraMQ for Kafka instance to a Hologres instance in the development console of Realtime Compute for Apache Flink.

Background information

For example, a topic named users is created for an ApsaraMQ for Kafka instance and this topic contains 100 JSON data records. These JSON data records are the log data that is written to ApsaraMQ for Kafka by using a log file collection tool or application. The following figure shows the data distribution.数据分布

In this topic, the CREATE TABLE AS statement provided by Realtime Compute for Apache Flink is used to synchronize log data with one click and synchronize table schema changes in real time.

Prerequisites

Step 1: Configure IP address whitelists

To allow Realtime Compute for Apache Flink to access the ApsaraMQ for Kafka instance and Hologres instance, you must add the CIDR block of the vSwitch to which the Realtime Compute for Apache Flink workspace belongs to the whitelists of the ApsaraMQ for Kafka instance and Hologres instance.

  1. Obtain the CIDR block of the vSwitch to which the Realtime Compute for Apache Flink workspace belongs.

    1. Log on to the management console of Realtime Compute for Apache Flink.

    2. On the Fully Managed Flink tab, find your workspace and choose More > Workspace Details in the Actions column.

    3. In the Workspace Details dialog box, view the CIDR block about the vSwitch to which the Realtime Compute for Apache Flink workspace belongs.

      网段信息

  2. Add the CIDR block of the vSwitch to which the Realtime Compute for Apache Flink workspace belongs to the IP address whitelist of the ApsaraMQ for Kafka instance.

    For more information, see Configure whitelists.Kafka白名单

  3. Add the CIDR block of the vSwitch to which the Realtime Compute for Apache Flink workspace belongs to the IP address whitelist of the Hologres instance.

    For more information, see Configure an IP address whitelist.Holo白名单

Step 2: Prepare test data of the ApsaraMQ for Kafka instance

Use a Faker source table of Realtime Compute for Apache Flink as a data generator and write the data to the ApsaraMQ for Kafka instance. You can perform the following steps to write data to an ApsaraMQ for Kafka instance in the development console of Realtime Compute for Apache Flink.

  1. Create a topic named users in the ApsaraMQ for Kafka console.

    For more information, see the "Step 1: Create a topic" section of the Step 3: Create resources topic.

  2. Create a draft that writes data to a specific ApsaraMQ for Kafka instance.

    1. Log on to the management console of Realtime Compute for Apache Flink.

    2. Find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane, choose Development > ETL. On the page that appears, click New.

    4. In the New Draft dialog box, select a template to create a draft. On the SQL Scripts tab of the New Draft dialog box, click Blank Stream Draft. Then, click Next. In the New Draft dialog box, configure the parameters of the draft. The following table describes the parameters.

      Parameter

      Example

      Description

      Name

      kafka-data-input

      The name of the draft that you want to create.

      Note

      The draft name must be unique in the current project.

      Location

      Development

      The folder in which the code file of the draft is stored. By default, the code file of a draft is stored in the Development folder.

      You can also click the 新建文件夹 icon to the right of an existing folder to create a subfolder.

      Engine Version

      vvr-8.0.5-flink-1.17

      Select the engine version of the draft from the Engine Version drop-down list.

    5. Click Create.

    6. Copy the following code of a draft to the code editor.

      CREATE TEMPORARY TABLE source (
        id INT,
        first_name STRING,
        last_name STRING,
        `address` ROW<`country` STRING, `state` STRING, `city` STRING>,
        event_time TIMESTAMP
      ) WITH (
        'connector' = 'faker',
        'number-of-rows' = '100',
        'rows-per-second' = '10',
        'fields.id.expression' = '#{number.numberBetween ''0'',''1000''}',
        'fields.first_name.expression' = '#{name.firstName}',
        'fields.last_name.expression' = '#{name.lastName}',
        'fields.address.country.expression' = '#{Address.country}',
        'fields.address.state.expression' = '#{Address.state}',
        'fields.address.city.expression' = '#{Address.city}',
        'fields.event_time.expression' = '#{date.past ''15'',''SECONDS''}'
      );
      
      CREATE TEMPORARY TABLE sink (
        id INT,
        first_name STRING,
        last_name STRING,
        `address` ROW<`country` STRING, `state` STRING, `city` STRING>,
        `timestamp` TIMESTAMP METADATA
      ) WITH (
        'connector' = 'kafka',
        'properties.bootstrap.servers' = 'alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000',
        'topic' = 'users',
        'format' = 'json'
      );
      
      INSERT INTO sink SELECT * FROM source;
    7. Modify the following parameter configurations based on your business requirements.

      Parameter

      Example

      Description

      properties.bootstrap.servers

      alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000

      The IP addresses or endpoints of Kafka brokers.

      Format: host:port,host:port,host:port. Separate multiple host:port pairs with commas (,).

      topic

      users

      The name of the Kafka topic.

  3. Start the deployment for the draft.

    1. In the left-side navigation pane, choose Development > ETL. On the page that appears, click Deploy.

    2. In the Deploy draft dialog box, click Confirm.

    3. On the Deployments page, configure resources for the deployment of the draft. For more information, see Configure resources for a deployment.

    4. In the left-side navigation pane, choose O&M > Deployments. On the Deployments page, find the deployment that you want to manage and click Start in the Actions column. For more information about the parameters that you must configure when you start your deployment, see Start a deployment.

    5. On the Deployments page, view the status and information about the deployment.image

      The Faker data source provides bounded streams. Therefore, the deployment becomes complete about one minute after the deployment remains in the RUNNING state. When the deployment is complete, data in the deployment is written to the users topic of the ApsaraMQ for Kafka instance. The following sample code shows the format of the JSON data that is written to the ApsaraMQ for Kafka instance.

      {
        "id": 765,
        "first_name": "Barry",
        "last_name": "Pollich",
        "address": {
          "country": "United Arab Emirates",
          "state": "Nevada",
          "city": "Powlowskifurt"
        }
      }

Step 3: Create a Hologres catalog

If you want to perform single-table synchronization, you must create a destination table in a destination catalog. You can create a destination catalog in the development console of Realtime Compute for Apache Flink. In this topic, a Hologres catalog is used as the destination catalog. This section describes how to create a Hologres catalog.

Create a Hologres catalog named holo,For more information, see the "Create a Hologres catalog" section of the Manage Hologres catalogs topic.holo catalog

Important

You must make sure that a database named flink_test_db is created in the instance to which you want to synchronize data. Otherwise, an error is returned when you create a catalog.

Step 4: Create a data synchronization draft and start a data synchronization deployment

  1. Log on to the development console of Realtime Compute for Apache Flink and create a data synchronization draft.

    1. Log on to the management console of Realtime Compute for Apache Flink.

    2. Find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane, choose Development > ETL. On the page that appears, click New.

    4. In the New Draft dialog box, select a template to create a draft. On the SQL Scripts tab of the New Draft dialog box, click Blank Stream Draft. Then, click Next. In the New Draft dialog box, configure the parameters of the draft. The following table describes the parameters.

      Parameter

      Example

      Description

      Name

      flink-quickstart-test

      The name of the draft that you want to create.

      Note

      The draft name must be unique in the current project.

      Location

      Development

      The folder in which the code file of the draft is stored. By default, the code file of a draft is stored in the Development folder.

      You can also click the 新建文件夹 icon to the right of an existing folder to create a subfolder.

      Engine Version

      vvr-8.0.5-flink-1.17

      Select the engine version of the draft from the Engine Version drop-down list.

    5. Click Create.

  2. Copy the following code of a draft to the code editor.

    You can use one of the following methods to synchronize data of the users topic from the ApsaraMQ for Kafka instance to the sync_kafka_users table of the flink_test_db database in Hologres. You can use one of the following methods to specify the data types of the input and output values:

    • Use the CREATE TABLE AS statement

      If you execute the CREATE TABLE AS statement to synchronize data, you do not need to manually create the table in Hologres or configure the types of the columns to which data is written as JSON or JSONB.

      CREATE TEMPORARY TABLE kafka_users (
        `id` INT NOT NULL,
        `address` STRING,
        `offset` BIGINT NOT NULL METADATA,
        `partition` BIGINT NOT NULL METADATA,
        `timestamp` TIMESTAMP METADATA,
        `date` AS CAST(`timestamp` AS DATE),
        `country` AS JSON_VALUE(`address`, '$.country'),
        PRIMARY KEY (`partition`, `offset`) NOT ENFORCED
      ) WITH (
        'connector' = 'kafka',
        'properties.bootstrap.servers' = 'alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000',
        'topic' = 'users',
        'format' = 'json',
        'json.infer-schema.flatten-nested-columns.enable' = 'true', -- Automatically expand nested columns. 
        'scan.startup.mode' = 'earliest-offset'
      );
      
      CREATE TABLE IF NOT EXISTS holo.flink_test_db.sync_kafka_users
      WITH (
        'connector' = 'hologres'
      ) AS TABLE kafka_users;
      Note

      To prevent duplicate data from being written to Hologres after a deployment failover, you can add the related primary key to the table to uniquely identify data. If data is retransmitted, Hologres ensures that only one copy of data that has the same values of the partition and offset fields is retained.

    • Use the INSERT INTO statement

      A special method is used to optimize JSON and JSONB data in Hologres. Therefore, you can use the INSERT INTO statement to write nested JSON data to Hologres.

      If you use the INSERT INTO statement to synchronize data, you must manually create a table in Hologres and configure the types of the columns to which data is written as JSON or JSONB. Then, you can execute the INSERT INTO statement to write the address data to the column of the JSON type in Hologres.

      CREATE TEMPORARY TABLE kafka_users (
        `id` INT NOT NULL,
        'address' STRING, -- The data in this column is nested JSON data. 
        `offset` BIGINT NOT NULL METADATA,
        `partition` BIGINT NOT NULL METADATA,
        `timestamp` TIMESTAMP METADATA,
        `date` AS CAST(`timestamp` AS DATE),
        `country` AS JSON_VALUE(`address`, '$.country')
      ) WITH (
        'connector' = 'kafka',
        'properties.bootstrap.servers' = 'alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000',
        'topic' = 'users',
        'format' = 'json',
        'json.infer-schema.flatten-nested-columns.enable' = 'true', -- Automatically expand nested columns. 
        'scan.startup.mode' = 'earliest-offset'
      );
      
      CREATE TEMPORARY TABLE holo (
        `id` INT NOT NULL,
        `address` STRING,
        `offset` BIGINT,
        `partition` BIGINT,
        `timestamp` TIMESTAMP,
        `date` DATE,
        `country` STRING
      ) WITH (
        'connector' = 'hologres',
        'endpoint' = 'hgpostcn-****-cn-beijing-vpc.hologres.aliyuncs.com:80',
        'username' = 'LTAI5tE572UJ44Xwhx6i****',
        'password' = 'KtyIXK3HIDKA9VzKX4tpct9xTm****',
        'dbname' = 'flink_test_db',
        'tablename' = 'sync_kafka_users'
      );
      
      INSERT INTO holo
      SELECT * FROM kafka_users;
  3. The following table describes the parameters that you can configure in the code.

    Parameter

    Example

    Description

    properties.bootstrap.servers

    alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000

    The IP addresses or endpoints of Kafka brokers.

    Format: host:port,host:port,host:port. Separate multiple host:port pairs with commas (,).

    topic

    users

    The name of the Kafka topic.

    endpoint

    hgpostcn-****-cn-beijing-vpc.hologres.aliyuncs.com:80

    The endpoint of the Hologres instance.

    Format: <ip>:<port>.

    username

    LTAI5tE572UJ44Xwhx6i****

    The username that is used to access the Hologres database. You must enter the AccessKey ID of your Alibaba Cloud account.

    password

    KtyIXK3HIDKA9VzKX4tpct9xTm****

    The password that is used to access the Hologres database. You must enter the AccessKey secret of your Alibaba Cloud account.

    dbname

    flink_test_db

    The name of the Hologres database that you want to access.

    tablename

    sync_kafka_users

    The name of the Hologres table.

    Note
    • If you use the INSERT INTO statement to synchronize data, you must create the sync_kafka_users table and define required fields in the database of the destination instance.

    • If the public schema is not used, you must set tablename to schema.tableName.

  4. Click Save.

  5. In the left-side navigation pane, choose Development > ETL. On the page that appears, click Deploy.

  6. In the left-side navigation pane, choose O&M > Deployments. On the Deployments page, find the deployment that you want to manage and click Start in the Actions column. For more information about the parameters that you must configure when you start your deployment, see Start a deployment.

  7. In the Start Job dialog box, click Start.

    You can view the status and information of the deployment on the Deployments page after the deployment is started.image

Step 5: View the result of full data synchronization

  1. Log on to the Hologres console.

  2. On the Instances page, click the name of the instance that you want to manage.

  3. In the upper-right corner of the page, click Connect to Instance.

  4. On the Metadata Management tab, view the schema and data of the sync_kafka_users table to which data is synchronized in the users database.

    sync_kafka_users表

    The following figures show the schema and data of the sync_kafka_users table after full data synchronization.

    • Table schema

      Double-click the name of the sync_kafka_users table to view the table schema.

      表结构

      Note

      During data synchronization, we recommend that you declare the partition and offset fields of Kafka as the primary key for the Hologres table. This way, if data is retransmitted due to a deployment failover, only one copy of the data that has the same values of the partition and offset fields is stored.

    • Table data

      In the upper-right corner of the page for the sync_kafka_users table, click Query table. In the SQL editor, enter the following statement and click Run.

      SELECT * FROM public.sync_kafka_users order by partition, "offset";

      The following figure shows the data of the sync_kafka_users table.表数据

Step 6: Check whether table schema changes are automatically synchronized

  1. Manually send a message that contains a new column in the ApsaraMQ for Kafka console.

    1. Log on to the ApsaraMQ for Kafka console.

    2. On the Instances page, click the name of the instance that you want to manage.

    3. In the left-side navigation pane of the page that appears, click Topics. On the page that appears, find the topic named users.

    4. Choose More > Quick Start in the Actions column.

    5. In the Start to Send and Consume Message panel, configure the parameters and enter the content of the test message.

      消息内容

      Parameter

      Example

      Method of Sending

      Select Console.

      Message Key

      Enter flinktest.

      Message Content

      Copy and paste the following JSON content to the Message Content field.

      {
        "id": 100001,
        "first_name": "Dennise",
        "last_name": "Schuppe",
        "address": {
          "country": "Isle of Man",
          "state": "Montana",
          "city": "East Coleburgh"
        },
        "house-points": {
          "house": "Pukwudgie",
          "points": 76
        }
      }
      Note

      In this example, house-points is a new nested column.

      Send to Specified Partition

      Select Yes.

      Partition ID

      Enter 0.

    6. Click OK.

  2. In the Hologres console, view the changes in the schema and data of the sync_kafka_users table.

    1. Log on to the Hologres console.

    2. On the Instances page, click the name of the instance that you want to manage.

    3. In the upper-right corner of the page, click Connect to Instance.

    4. On the Metadata Management tab, double-click the name of the sync_kafka_users table.

    5. In the upper-right corner of the page for the sync_kafka_users table, click Query table. In the SQL editor, enter the following statement and click Run.

      SELECT * FROM public.sync_kafka_users order by partition, "offset";
    6. View the data of the table.

      The following figure shows the data of the sync_kafka_users table.Hologres表结果

      The figure shows that the data record whose id is 100001 is written to Hologres. In addition, the house-points.house and house-points.points columns are added to Hologres.

      Note

      Only data in the nested column house-points is included in the data that is inserted into the table of ApsaraMQ for Kafka. However, json.infer-schema.flatten-nested-columns.enable is declared in the parameters of the WITH clause for the kafka_users table. In this case, Realtime Compute for Apache Flink automatically expands the new nested column. After the column is expanded, the path to access the column is used as the name of the column.

(Optional) Step 7: Adjust deployment resource configurations

To ensure optimal deployment performance, we recommend that you adjust the parallelism of deployments and resource configurations of different nodes based on the amount of data that needs to be processed. To adjust the parallelism of deployments and the number of CUs in a simple manner, use the basic resource configuration mode. To adjust the parallelism of deployments and resource configurations of nodes in a more fine-grained manner, use the expert resource configuration mode.

  1. Log on to the development console of Realtime Compute for Apache Flink and go to the deployment details page.

    1. Log on to the management console of Realtime Compute for Apache Flink.

    2. Find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose O&M > Deployments.

  2. Modify resource configurations.

    1. On the Configuration tab, click Edit in the upper-right corner of the Resources section and select Expert for the Mode parameter.

    2. In the Resource Plan section, click Get Plan Now.

    3. Move the pointer over More and click Expand All.

      You can view the complete topology to learn the data synchronization plan of the deployment. The plan shows the tables that need to be synchronized.

    4. Manually configure PARALLELISM for each operator.

      The table in the users topic of ApsaraMQ for Kafka has four partitions. Therefore, you can set the PARALLELISM parameter for ApsaraMQ for Kafka to 4. Log data is written to only one Hologres table. To reduce the number of connections to Hologres, you can set the PARALLELISM parameter for Hologres to 2. For more information about how to configure resource parameters, see Configure a deployment. The following figure shows the resource configuration plan of the deployment after the parallelism is adjusted.作业配置计划

    5. Click OK.

    6. Configure the parameters in the Basic section. On the Deployments page, find the desired deployment and click Start in the Actions column. For more information about the parameters that you must configure when you start your deployment, see Start a deployment.

  3. In the left-side navigation pane, choose O&M > Deployments. On the Deployments page, click the name of the deployment that you want to manage.

  4. On the Status tab, view the effect after resource reconfiguration of the deployment.

    image.png

References