This topic was translated by AI and is currently in queue for revision by our editors. Alibaba Cloud does not guarantee the accuracy of AI-translated content. Request expedited revision

Image management

Updated at: 2025-02-19 18:06

Image management in DataWorks allows users to create and manage custom runtime environments for task execution. This interface enables the creation of custom images that incorporate necessary development packages and dependencies tailored to specific execution environments. For instance, custom images can be used to install third-party dependencies essential for running PyODPS tasks. This topic describes the process for creating custom images using image management features.

Background information

By default, DataWorks utilizes the Default standard image when executing tasks, selecting the most suitable image based on the task type. Official images act as pre-configured base images, providing a standardized runtime environment for various task types. Custom images build upon these base images, offering enhanced functionality and flexibility. Users can tailor these images to their specific application needs, optimizing the execution efficiency and adaptability of data processing tasks. DataWorks supports three primary methods for customizing images:

Instructions

  • The image management feature is only available in conjunction with a serverless resource group.

    Note

    If you are operating PyODPS task nodes with a legacy exclusive resource group for scheduling and depend on third-party packages, you can utilize the maintenance assistant. For more information, see how to configure third-party packages for exclusive resource groups for scheduling (not recommended).

  • The maximum number of custom images that can be created depends on the DataWorks edition:

    • Basic and Standard Editions: 10.

    • Professional Edition: 50.

    • Enterprise Edition: 100.

  • Only the Professional Edition and higher support the image building feature.

  • A maximum of two images can be built simultaneously in each region.

  • If selecting the Default standard image for EMR-type tasks results in long wait times, it may be due to older EMR cluster version images not being initialized. To address this, submit a ticket.

Prerequisites

Step 1: Access image management

  1. Log on to the DataWorks console.

  2. Access the image management page.

    In the left-side navigation pane, click Image Management to access the image management page.

    image

Step 2: Create a custom image

DataWorks supports the creation of custom images using either DataWorks Official Images or Alibaba Cloud ACR Images as the base. The following describes the configuration parameters for different base image types:

Method 1: Create directly based on DataWorks official images

  1. Configure the custom image parameters:

    Parameter

    Description

    Parameter

    Description

    Image Name

    The name of the custom image.

    Image Description

    The description of the custom image.

    Reference Data Type

    Select Dataworks Official Images.

    Image Namespace

    Fixed as DataWorks Default.

    Image Repository

    Fixed as DataWorks Default.

    Image Name/id

    DataWorks official images, supported options:

    • dataworks_shell_task_pod

    • dataworks_pyodps_task_pod

    • dataworks_emr_datalake_5.15.1_task_pod

    • dataworks_pyodps_py311_task_pod

    • dataworks_python_task_pod

    • dataworks_pairec_task_pod

    Visibility

    Support configuring the visibility of custom images, including Visible To Creator Only and Visible To All.

    Sub-product Usage

    The current custom image only supports Data Development.

    Supported Task Types

    • DataWorks Shell node official image: Supports Shell task type.

    • DataWorks PyODPS node official image: Supports PyODPS 2 and PyODPS 3 task types.

    • DataWorks EMR datalake 5.15.1 version official image: Supports EMR Spark, EMR Spark SQL, and EMR SHELL task types.

    Installation Package

    Add the required third-party packages as needed. The following methods are supported:

    • Quick installation: In the Installation Package drop-down selection box, select Python2, Python3, Yum. You can directly select the environment and resources to be installed.

    • Manual input: In the Installation Package drop-down selection box, select Script. You can manually enter installation commands in the Script command box. You can choose the following manual input example commands to download third-party packages.

      • pip example command: pip install xx, supports Python2.

      • pip3 example command: /home/tops/bin/pip3 install 'urllib3<2.0', supports Python3.

      • yum example command: yum install -y git.

      • wget example command: wget git.

  2. Click OK.

Method 2: Create based on Alibaba Cloud ACR images

Conditions

  • DataWorks creation is only compatible with Alibaba Cloud ACR Enterprise Edition image instances.

  • DataWorks supports only selecting one VPC to access Alibaba Cloud ACR image instances.

  • DataWorks supports Alibaba Cloud ACR image instances up to 5 GB in size.

  1. Configure the custom image parameters:

    Parameter

    Description

    Parameter

    Description

    Image Name

    The name of the custom image.

    Image Description

    The description of the custom image.

    Reference Data Type

    Select Alibaba Cloud ACR Images

    Image Instance ID

    Support selecting Enterprise Edition instances created in Alibaba Cloud Container Registry based on the instance ID. For more information about creating instances, see Create an Enterprise Edition instance.

    Image Namespace

    Support selecting the namespace under the image instance based on the selected instance. For more information about creating namespaces, see Create a namespace.

    Image Repository

    Support selecting the image repository under the image instance based on the selected instance. For more information about creating image repositories, see Create an image repository.

    Image Version

    Support selecting the image version of the custom image you need to create under the selected image repository.

    Associated VPC

    Select the VPC network bound to the image instance. For more information about configuring VPC networks, see Configure access control for virtual private clouds.

    Visibility

    Support configuring the visibility of custom images, including Visible To Creator Only and Visible To All.

    Sub-product Usage

    The current custom image only supports Data Development.

    Supported Task Types

    • Shell

    • Python

    • Notebook: When running Notebook tasks in DataWorks using ACR images, use the Notebook base image provided by DataWorks as the base image for your ACR image to provide a runtime environment for Notebook tasks. DataWorks provides the Notebook base image: dataworks-notebook:py3.11-ubuntu22.04:py3.11-ubuntu22.04-20241202

    Note
    • If you need to apply custom images created from Alibaba Cloud ACR images to Python tasks, confirm whether your ACR image instance contains a Python environment. Otherwise, Python tasks cannot be supported.

    • If you need to apply custom images created from Alibaba Cloud ACR images to Notebook tasks, ensure that the environment used to build the image has public network access capabilities to obtain the Notebook base image provided by DataWorks normally.

  2. Click OK.

Method 3: Create based on personal development environment instances

Data Studio's new data development feature allows you to create a new image from a personal development environment. For more information, see Create an image from a personal development environment.

Step 3: Publish a custom image

  1. On the Custom Images tab, locate the custom image you created.

  2. Click Publish in the Actions column.

  3. Select the Test Resource Group and click Test next to Test Results.

    Note

    Choose a serverless resource group as the test resource group.

  4. Once the test is successful, click Publish.

Note
  • Only images that pass the test can be published.

  • If your custom image retrieves third-party packages from the public network and consistently fails tests, verify that the VPC associated with the Test Resource Group has the capability to access the public network. For details on configuring public network access for VPCs, see Access the Internet using the SNAT feature of the public NAT Gateway.

  • If the test fails, you can click the Operation column of the target custom image and select image > Edit to modify the image configuration.

Step 4: Modify the image ownership space

  1. On the Custom Images tab, locate the published custom image.

  2. In the Operation column for the desired image, click image > Modify Associated Workspace to attach the custom image to the associated workspace.

Step 5: Build a permanent image

After completing Step 3, custom images are typically ready for use in business scenarios. However, each time a task node runs, DataWorks redeploy the image environment and download third-party packages, potentially increasing node runtime and incurring additional compute and traffic costs. To address this, DataWorks enables the conversion of custom images into permanent images, ensuring a consistent runtime environment for each task node execution, thereby saving time and reducing costs.

Note

Building permanent images is only supported for custom images created from official images.

Follow these steps:

  1. Log on to the DataWorks console, switch to the appropriate region, and click Image Management in the left-side navigation pane.

  2. On the Custom Images tab, locate the published custom image.

  3. In the Operation column for the image, click image > Build to initiate the creation of a permanent image.

  4. In the Select The Resource Group For Building The Image dialog box, select the resource group for image building, then click Continue.

    Note
    • Image building typically takes about 5 to 10 minutes, but the exact time may vary depending on the image size.

    • Building an image will result in computing charges calculated at 0.5 CU × the duration of the build. For more information, see the description of data computing billing.

    • To prevent build failures due to network issues or other reasons, ensure that the Resource Group For Building The Image is the same as the Test Resource Group selected in Step 3: Publish a Custom Image.

What to do next: Use the image

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Within the data development feature, locate the task node for the custom image, click Scheduling Configuration on the right, and set the resource properties:

    • Scheduling Resource Group: Choose a serverless resource group.

      Note
      • To ensure smooth task node operation, the Scheduling Resource Group should match the Test Resource Group used during Image Publishing.

      • If the desired resource group is not listed, check if it's associated with the current workspace. Visit the Resource Group List page, locate the resource group, and click Bind Workspace in the Actions column to bind it.

    • Image: Select the published image.

    image

  3. Save and submit the changes.

    Note

    Modifications made to the image in data development will not automatically synchronize to the production environment. You must publish the task to apply the changes in production.

Example: Use images to perform Chinese word segmentation through PyODPS nodes

If you need to segment Chinese text within a column of a MaxCompute table and store the results in another table for downstream scheduling nodes, you can install the jieba segmentation toolkit in a custom image. Then, use this image to process the segmentation of the Chinese text via PyODPS tasks and save the outcomes in a new table, ensuring smooth integration with the downstream scheduling flow.

  1. Create test data.

    1. Create a MaxCompute data source and bind it within DataWorks data development. For more information, see Create a MaxCompute data source.

    2. Create an ODPS node in data development, establish a test table, and insert test data.

      Note

      The example below utilizes scheduling parameters. Set the parameter name to bday and the value to $[yyyymmdd] in the Scheduling Configuration on the right.

      Create a test table.

      -- Create a test table
      CREATE TABLE IF NOT EXISTS custom_img_test_tb
      (
          c_customer_id BIGINT NOT NULL,
          c_customer_text STRING NOT NULL,
          PRIMARY KEY (c_customer_id)
      )
      COMMENT 'TABLE COMMENT'
      PARTITIONED BY (ds STRING COMMENT 'Partition')
      LIFECYCLE 90;
      
      -- Insert test data into the test table
      INSERT INTO custom_img_test_tb PARTITION (ds='${bday}') (c_customer_id, c_customer_text) VALUES
      (1, '晚来天欲雪,能饮一杯无?'),
      (2, '月落乌啼霜满天,江枫渔火对愁眠。'),
      (3, '山重水复疑无路,柳暗花明又一村。'),
      (4, '春眠不觉晓,处处闻啼鸟。'),
      (5, '静夜思,床前明月光,疑是地上霜。'),
      (6, '海上生明月,天涯共此时。'),
      (7, '旧时王谢堂前燕,飞入寻常百姓家。'),
      (8, '一行白鹭上青天,窗含西岭千秋雪。'),
      (9, '人生得意须尽欢,莫使金樽空对月。'),
      (10, '天生我材必有用,千金散尽还复来。');
    3. Save and publish.

  2. Create a custom image.

    Refer to Step 2: Create a custom image. Key parameters include the following:

    • Image name/ID: Choose dataworks_pyodps_task_pod, the official DataWorks PyODPS node image.

    • Supported task types: Select PyODPS 3.

    • Installation package: Choose Python3 and jieba.

  3. Publish the custom image and update the ownership project space. For more information, see Step 3: Publish a custom image and Step 4: Modify the image ownership space.

  4. Use the custom image in a scheduling task.

    1. Create and configure a PyODPS 3 node in data development with the following details:

      Use the custom image.

      import jieba
      from odps import ODPS
      from odps.models import TableSchema as Schema, Column, Partition
      
      # Read data from the test table
      table = o.get_table('custom_img_test_tb')
      partition_spec = f"ds={args['bday']}"
      with table.open_reader(partition=partition_spec) as reader:
          records = [record for record in reader]
      
      # Segment the extracted text
      participles = [' | '.join(jieba.cut(record['c_customer_text'])) for record in records]
      
      # Create a destination table
      if not o.exist_table("participle_tb"):
          schema = Schema(columns=[Column(name='word_segment', type='string', comment='Segmentation result')], partitions=[Column(name='ds', type='string', comment='Partition field')])
          o.create_table("participle_tb", schema)
      
      # Write the segmentation result to the destination table
      # Define an output partition and an output table
      output_partition = f"ds={args['bday']}"
      output_table = o.get_table("participle_tb")
      
      # If the partition does not exist, create a partition first
      if not output_table.exist_partition(output_partition):
          output_table.create_partition(output_partition)
      
      # Write the segmentation result to the output table
      record = output_table.new_record()
      with output_table.open_writer(partition=output_partition, create_partition=True) as writer:
          for participle in participles:
              record['word_segment'] = participle
              writer.write(record)
    2. On the Properties tab, configure the following key settings:

      • Scheduling parameters: Name bday, value $[yyyymmdd].

      • Scheduling Resource Group: Choose a serverless resource group, the same as the Test Resource Group used when Publishing The Image.

      • Image: Select the published custom image bound to the current workspace.

    3. Save, configure parameters, and run the node.

    4. (Optional) Execute the following SQL statement in an ad hoc query to verify the output table contains data.

      SELECT * FROM participle_tb WHERE ds=<partition date>;

      image

    5. Deploy the PyODPS node to the production environment.

      Note

      The image updated in data development won't sync to the production environment. You must publish the task to apply changes in production.

  5. Build the custom image as a permanent solution. For more information, see Step 5: Build a permanent image.

Appendix: View official images

  1. Log on to the DataWorks console, switch to the region where your DataWorks workspace is located, and click Image Management in the left-side navigation pane.

  2. View the official images available for DataWorks. The following official images are supported:

    • DataWorks Shell node official image: Supports Shell task types.

    • DataWorks PyODPS node official image: Supports PyODPS 2 and PyODPS 3 task types.

    • DataWorks EMR datalake 5.15.1 version official image: Supports EMR Spark, EMR Spark SQL, and EMR SHELL task types.

      Note

      You can use this image to submit tasks to EMR DataLake clusters of version 5.15.1.

    • DataWorks CDH node official image: Supports CDH Hive, CDH Spark, CDH Spark SQL, CDH MR, CDH Presto, and CDH Impala task types.

    image

References

  • When using a custom image, you must select a serverless resource group for scheduling. For more information about serverless resource groups, see Create and use serverless resource groups.

  • To create a custom CDH task runtime environment when developing a custom image, see Develop tasks based on a self-built Hadoop cluster.

  • For additional details on PyODPS, see PyODPS.

  • On this page (1)
  • Background information
  • Instructions
  • Prerequisites
  • Step 1: Access image management
  • Step 2: Create a custom image
  • Method 1: Create directly based on DataWorks official images
  • Method 2: Create based on Alibaba Cloud ACR images
  • Method 3: Create based on personal development environment instances
  • Step 3: Publish a custom image
  • Step 4: Modify the image ownership space
  • Step 5: Build a permanent image
  • What to do next: Use the image
  • Example: Use images to perform Chinese word segmentation through PyODPS nodes
  • Appendix: View official images
  • References
Feedback
phone Contact Us

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

alicare alicarealicarealicare