All Products
Search
Document Center

DataWorks:Configure DataHub Reader

Last Updated:Feb 14, 2025

DataHub Reader reads data from DataHub in real time by using the DataHub SDK.

Background information

DataHub Reader keeps running after it is started and reads data from DataHub when new data is stored to DataHub. DataHub Reader provides the following features:

  • Reads data in real time.

  • Reads data in parallel based on the number of shards in DataHub.

Procedure

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. In the Scheduled Workflow pane of the DataStudio page, move the pointer over the 新建 icon and choose Create Node > Data Integration > Real-time Synchronization.

    Alternatively, find the desired workflow in the Scheduled Workflow pane, right-click the workflow name, and then choose Create Node > Data Integration > Real-time Synchronization.

  3. In the Create Node dialog box, set the Sync Method parameter to End-to-end ETL and configure the Name and Path parameters.

    Important

    The node name cannot exceed 128 characters in length and can contain only letters, digits, underscores (_), and periods (.).

  4. Click Confirm.

  5. On the configuration tab of the real-time synchronization node, drag DataHub in the Input section to the canvas on the right.

  6. Click the DataHub node. In the configuration panel that appears, configure the parameters.

    image

    Parameter

    Description

    Data source

    The name of the DataHub that you added to DataWorks. You can select only a DataHub data source.

    If no data source is available, click New data source on the right to go to the Data Sources page in Management Center to add a DataHub data source. For more information, see Add a DataHub data source.

    Topic

    The name of the DataHub topic from which you want to synchronize data. You can click Data preview on the right to preview the selected topic.

    Use Subscription Feature

    If you turn on Use Subscription Feature, the system automatically generates a subscription ID. Data in DataHub is subscribed based on the subscription ID. This improves stability and performance. We recommend that you do not delete a subscription ID that is in use from DataHub. If you delete a subscription ID that is in use from DataHub, the related task fails.

    Output Fields

    The fields from which you want to synchronize data.

  7. In the top toolbar of the configuration tab of the real-time synchronization node, click the 保存 icon to save the node.