All Products
Search
Document Center

DataWorks:Manage images

Last Updated:Nov 15, 2024

If a specific development environment, such as a third-party dependency, is required for tasks that are run on a serverless resource group, you can use the image management feature to create a custom image that integrates required development packages and dependencies. Then, you can specify resources in the serverless resource group as the execution resources for running the tasks and the custom image as the runtime environment.

Prerequisites

Limits

  • The image management feature can be used only together with a serverless resource group.

    Note

    If a third-party package is required when you use an old-version exclusive resource group for scheduling to run PyODPS nodes, you can use the O&M Assistant feature to install the third-party package. For more information, see Use an exclusive resource group for scheduling to configure a third-party open source package.

  • The maximum number of custom images that can be created varies based on the DataWorks edition.

    • DataWorks Basic Edition and Standard Edition: 10

    • DataWorks Professional Edition: 50

    • DataWorks Enterprise Edition: 100

Procedure

Step 1: View the official images of DataWorks

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, click Image Management in the left-side navigation pane.

  2. On the DataWorks Official Images tab, view the official images of DataWorks. The following official images are supported:

    • dataworks_shell_task_pod: available for Shell tasks

    • dataworks_pyodps_task_pod: available for PyODPS 2 and PyODPS 3 tasks

    • dataworks_emr_datalake_5.15.1_task_pod: available for E-MapReduce (EMR) Spark, EMR Spark SQL, and EMR Shell tasks

      Note

      You can use this image to commit tasks in EMR DataLake clusters of V5.15.1.

    image

Step 2: Create a custom image

The official images serve as pre-configured base images to provide a standardized runtime environment for tasks of specific types. Custom images provide enhanced functionality and flexibility on the basis of the official images. You can expand the base images based on your actual application scenarios to achieve on-demand feature customization. This helps improve the execution efficiency and flexibility of data processing tasks.

  1. Use one of the following methods to go to the entry point for creating a custom image:

    • On the DataWorks Official Images tab, find the basic official image based on which you want to create a custom image and click Create Custom Image in the Actions column.

    • On the Custom Images tab, click Create Image.

  2. In the Create Image panel, configure parameters. The following table describes the parameters.

    Parameter

    Description

    Image Name

    The name of the custom image.

    Image Description

    The description of the custom image.

    Reference Type

    The value of this parameter is fixed to DataWorks Official Image, which indicates that you can create custom images based only on DataWorks official images.

    Image Namespace

    The value of this parameter is fixed to DataWorks Default.

    Image Repository

    The value of this parameter is fixed to DataWorks Default.

    Image Name/ID

    Select a DataWorks official image based on which you want to create a custom image. Valid values:

    • dataworks_shell_task_pod

    • dataworks_pyodps_task_pod

    • dataworks_emr_datalake_5.15.1_task_pod

    Visible Scope

    The scope in which the custom image is visible. Valid values: Visible Only to Creator and Visible to all.

    Module

    The service to which the custom image can be applied. This parameter can only be set to DataStudio.

    Supported Task Type

    • dataworks_shell_task_pod: available for Shell tasks

    • dataworks_pyodps_task_pod: available for PyODPS 2 and PyODPS 3 tasks

    • dataworks_emr_datalake_5.15.1_task_pod: available for EMR Spark, EMR Spark SQL, and EMR Shell tasks

    Installation Package

    The third-party package that you want to use. You can select one of the following methods to install a third-party package:

    • Quick installation: Select Python2, Python3, or Yum from the Installation Package drop-down list and then select a desired environment or resource.

    • Manual input: Select Script from the Installation Package drop-down list. Then, write commands in the command box to install a desired third-party package. You can run one of the following commands to install a third-party package:

      • pip install xx for Python 2

      • /home/tops/bin/pip3 install 'urllib3<2.0' for Python 3

      • yum install -y git

      • wget git

  3. Click OK.

Step 3: Publish the custom image

  1. On the Custom Images tab, find the created custom image.

  2. Click Publish in the Actions column.

  3. In the Publish Image panel, configure the Test Resource Group parameter and click Test to the right of Test Result.

    Note

    Select a serverless resource group for Test Resource Group.

    image

  4. After the test succeeds, click Publish.

Note
  • Only images that pass the test can be published.

  • If you configure a third-party package that is deployed over the Internet as a custom image and the image cannot pass the test after a long period of time, check whether the VPC with which the selected test resource group is associated can access the Internet. If the VPC cannot access the Internet, enable Internet access for the VPC. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.

  • If images fail to pass the test, you can perform the following operations to modify image configurations: Find a desired custom image, move the pointer over the image icon in the Actions column, and then select Modify.

Step 4: Associate the custom image with a workspace

  1. On the Custom Images tab, find the custom image that is published.

  2. Move the pointer over the image icon in the Actions column and select Change Workspace to associate the custom image with a workspace.

    image

Step 5: Use the custom image

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Find a desired node on the DataStudio page and double-click the node name to go to the configuration tab of the node. Click Properties in the right-side navigation pane and configure parameters in the Resource Group section.

    • Resource Group: Select a serverless resource group.

      Note

      To ensure smooth running of the node, we recommend that you set the Resource Group parameter to the test resource group that you selected when you published the custom image.

    • Image: Select a custom image that is published and associated with the current workspace.

    image

  3. Save and commit the node.

    Note

    The image that is selected in DataStudio cannot be synchronized to the production environment. You must follow the instructions that are described in Deploy nodes to deploy the node to allow the image to take effect in the production environment.

Sample configuration

Scenario

You want to segment a column of data in a MaxCompute table on a node and store the segmentation result in another table for a descendant node to use.

In this case, you can pre-install the segmentation tool Jieba in a desired custom image and run a PyODPS task to use the custom image to segment data in the MaxCompute table and store the segmentation result in another table. This way, the descendant node can seamlessly schedule the data.

Procedure

  1. Create test data.

    1. Add a MaxCompute data source to DataWorks, and associate the MaxCompute data source with DataStudio. For more information about how to add a MaxCompute data source, see Add a MaxCompute data source.

    2. In DataStudio, create an ODPS node, create a test table, and then add test data to the table.

      Note

      In the following example, a scheduling parameter is used. On the Properties tab in the right-side navigation pane of the configuration tab of the node, add a parameter whose name is bday and value is $[yyyymmdd] in the Scheduling Parameter section.

      -- Create a test table.
      CREATE TABLE IF NOT EXISTS custom_img_test_tb
      (
          c_customer_id BIGINT NOT NULL,
          c_customer_text STRING NOT NULL,
          PRIMARY KEY (c_customer_id)
      )
      COMMENT 'TABLE COMMENT'
      PARTITIONED BY (ds STRING COMMENT 'Partition')
      LIFECYCLE 90;
      
      -- Insert test data into the test table.
      INSERT INTO custom_img_test_tb PARTITION (ds='${bday}') (c_customer_id, c_customer_text) VALUES
      (1, '晚来天欲雪,能饮一杯无? '),
      (2, '月落乌啼霜满天,江枫渔火对愁眠。 '),
      (3, '山重水复疑无路,柳暗花明又一村。 '),
      (4, '春眠不觉晓,处处闻啼鸟。 '),
      (5, '静夜思,床前明月光,疑是地上霜。 '),
      (6, '海上生明月,天涯共此时。 '),
      (7, '旧时王谢堂前燕,飞入寻常百姓家。 '),
      (8, '一行白鹭上青天,窗含西岭千秋雪。 '),
      (9, '人生得意须尽欢,莫使金樽空对月。 '),
      (10, '天生我材必有用,千金散尽还复来。 ');
    3. Save and deploy the node.

  2. Create a custom image.

    Follow the instructions that are described in Step 2: Create a custom image in this topic to create a custom image. Settings of key parameters:

    • Image Name/ID: Select dataworks_pyodps_task_pod.

    • Supported Task Type: Select PyODPS 3.

    • Installation Package: Select Python3 and jieba.

  3. Publish the custom image and associate the custom image with a workspace. For more information, see the Step 3: Publish the custom image and Step 4: Associate the custom image with a workspace sections in this topic.

  4. Use the custom image in a scheduling task.

    1. In DataStudio, create and configure a PyODPS 3 node.

      import jieba
      from odps import ODPS
      from odps.models import TableSchema as Schema, Column, Partition
      
      # Read data from the test table.
      table = o.get_table('custom_img_test_tb')
      partition_spec = f"ds={args['bday']}"
      with table.open_reader(partition=partition_spec) as reader:
          records = [record for record in reader]
      
      # Segment the extracted text.
      participles = [' | '.join(jieba.cut(record['c_customer_text'])) for record in records]
      
      # Create a destination table.
      if not o.exist_table("participle_tb"):
          schema = Schema(columns=[Column(name='word_segment', type='string', comment='Segmentation result')], partitions=[Column(name='ds', type='string', comment='Partition field')])
          o.create_table("participle_tb", schema)
      
      # Write the segmentation result to the destination table.
      # Define an output partition and an output table.
      output_partition = f"ds={args['bday']}"
      output_table = o.get_table("participle_tb")
      
      # If the partition does not exist, create a partition first.
      if not output_table.exist_partition(output_partition):
          output_table.create_partition(output_partition)
      
      # Write the segmentation result to the output table.
      record = output_table.new_record()
      with output_table.open_writer(partition=output_partition, create_partition=True) as writer:
          for participle in participles:
              record['word_segment'] = participle
              writer.write(record)
    2. On the Properties tab in the right-side navigation pane of the configuration tab of the node, configure the following key settings:

      • Add a scheduling parameter whose name is bday and value is $[yyyymmdd] in the Scheduling Parameter section.

      • Select a serverless resource group, which is the test resource group that you used when you published the custom image, as a resource group for scheduling.

      • Select the custom image that is published and associated with the current workspace.

    3. Save and run the node with parameters configured.

    4. Optional. Create an ad hoc query and execute the following SQL statement to check whether the output table contains data:

      SELECT * FROM participle_tb WHERE ds=<Partition date>;

    5. Deploy the PyODPS node to the production environment.

      Note

      The image that is selected in DataStudio cannot be synchronized to the production environment. You must follow the instructions that are described in Deploy nodes to deploy the node to allow the image to take effect in the production environment.

References

  • When you use a custom image, you must select a serverless resource group as a resource group for scheduling. For more information about serverless resource groups, see Create and use a serverless resource group.

  • For more information about PyODPS, see Overview.

  • For more information about parameter settings in DataStudio, see Overview.