If a specific development environment, such as a third-party dependency, is required for tasks that are run on a serverless resource group, you can use the image management feature to create a custom image that integrates required development packages and dependencies. Then, you can specify resources in the serverless resource group as the execution resources for running the tasks and the custom image as the runtime environment.
Prerequisites
A serverless resource group is created. The image management feature must be used together with a serverless resource group. For more information about serverless resource groups, see Create and use a serverless resource group.
Optional. The virtual private cloud (VPC) with which the serverless resource group is associated has access to the Internet. This prerequisite is required if the environment in which you want to run tasks depends on a third-party package that is deployed over the Internet. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.
The AliyunDataWorksFullAccess policy or a policy that contains the ModifyResourceGroup permission is attached to the account that you want to use. For more information, see Manage permissions on the DataWorks services and the entities in the DataWorks console by using RAM policies.
Limits
The image management feature can be used only together with a serverless resource group.
NoteIf a third-party package is required when you use an old-version exclusive resource group for scheduling to run PyODPS nodes, you can use the O&M Assistant feature to install the third-party package. For more information, see Use an exclusive resource group for scheduling to configure a third-party open source package.
The maximum number of custom images that can be created varies based on the DataWorks edition.
DataWorks Basic Edition and Standard Edition: 10
DataWorks Professional Edition: 50
DataWorks Enterprise Edition: 100
Procedure
Step 1: View the official images of DataWorks
Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, click Image Management in the left-side navigation pane.
On the DataWorks Official Images tab, view the official images of DataWorks. The following official images are supported:
dataworks_shell_task_pod: available for
Shell
tasksdataworks_pyodps_task_pod: available for
PyODPS 2
andPyODPS 3
tasksdataworks_emr_datalake_5.15.1_task_pod: available for
E-MapReduce (EMR) Spark
,EMR Spark SQL
, andEMR Shell
tasksNoteYou can use this image to commit tasks in EMR DataLake clusters of V5.15.1.
Step 2: Create a custom image
The official images serve as pre-configured base images to provide a standardized runtime environment for tasks of specific types. Custom images provide enhanced functionality and flexibility on the basis of the official images. You can expand the base images based on your actual application scenarios to achieve on-demand feature customization. This helps improve the execution efficiency and flexibility of data processing tasks.
Use one of the following methods to go to the entry point for creating a custom image:
On the DataWorks Official Images tab, find the basic official image based on which you want to create a custom image and click Create Custom Image in the Actions column.
On the Custom Images tab, click Create Image.
In the Create Image panel, configure parameters. The following table describes the parameters.
Parameter
Description
Image Name
The name of the custom image.
Image Description
The description of the custom image.
Reference Type
The value of this parameter is fixed to DataWorks Official Image, which indicates that you can create custom images based only on DataWorks official images.
Image Namespace
The value of this parameter is fixed to DataWorks Default.
Image Repository
The value of this parameter is fixed to DataWorks Default.
Image Name/ID
Select a DataWorks official image based on which you want to create a custom image. Valid values:
dataworks_shell_task_pod
dataworks_pyodps_task_pod
dataworks_emr_datalake_5.15.1_task_pod
Visible Scope
The scope in which the custom image is visible. Valid values: Visible Only to Creator and Visible to all.
Module
The service to which the custom image can be applied. This parameter can only be set to DataStudio.
Supported Task Type
dataworks_shell_task_pod: available for
Shell
tasksdataworks_pyodps_task_pod: available for
PyODPS 2
andPyODPS 3
tasksdataworks_emr_datalake_5.15.1_task_pod: available for
EMR Spark
,EMR Spark SQL
, andEMR Shell
tasks
Installation Package
The third-party package that you want to use. You can select one of the following methods to install a third-party package:
Quick installation: Select
Python2
,Python3
, orYum
from the Installation Package drop-down list and then select a desired environment or resource.Manual input: Select
Script
from the Installation Package drop-down list. Then, write commands in the command box to install a desired third-party package. You can run one of the following commands to install a third-party package:pip install xx
for Python 2/home/tops/bin/pip3 install 'urllib3<2.0'
for Python 3yum install -y git
wget git
Click OK.
Step 3: Publish the custom image
On the Custom Images tab, find the created custom image.
Click Publish in the Actions column.
In the Publish Image panel, configure the Test Resource Group parameter and click Test to the right of Test Result.
NoteSelect a serverless resource group for Test Resource Group.
After the test succeeds, click Publish.
Only images that pass the test can be published.
If you configure a third-party package that is deployed over the Internet as a custom image and the image cannot pass the test after a long period of time, check whether the VPC with which the selected test resource group is associated can access the Internet. If the VPC cannot access the Internet, enable Internet access for the VPC. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.
If images fail to pass the test, you can perform the following operations to modify image configurations: Find a desired custom image, move the pointer over the icon in the Actions column, and then select Modify.
Step 4: Associate the custom image with a workspace
On the Custom Images tab, find the custom image that is published.
Move the pointer over the icon in the Actions column and select Change Workspace to associate the custom image with a workspace.
Step 5: Use the custom image
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Find a desired node on the DataStudio page and double-click the node name to go to the configuration tab of the node. Click Properties in the right-side navigation pane and configure parameters in the Resource Group section.
Resource Group: Select a serverless resource group.
NoteTo ensure smooth running of the node, we recommend that you set the Resource Group parameter to the test resource group that you selected when you published the custom image.
Image: Select a custom image that is published and associated with the current workspace.
Save and commit the node.
NoteThe image that is selected in DataStudio cannot be synchronized to the production environment. You must follow the instructions that are described in Deploy nodes to deploy the node to allow the image to take effect in the production environment.
Sample configuration
Scenario
You want to segment a column of data in a MaxCompute table on a node and store the segmentation result in another table for a descendant node to use.
In this case, you can pre-install the segmentation tool Jieba in a desired custom image and run a PyODPS task to use the custom image to segment data in the MaxCompute table and store the segmentation result in another table. This way, the descendant node can seamlessly schedule the data.
Procedure
Create test data.
Add a MaxCompute data source to DataWorks, and associate the MaxCompute data source with DataStudio. For more information about how to add a MaxCompute data source, see Add a MaxCompute data source.
In DataStudio, create an ODPS node, create a test table, and then add test data to the table.
NoteIn the following example, a scheduling parameter is used. On the Properties tab in the right-side navigation pane of the configuration tab of the node, add a parameter whose name is
bday
and value is$[yyyymmdd]
in the Scheduling Parameter section.-- Create a test table. CREATE TABLE IF NOT EXISTS custom_img_test_tb ( c_customer_id BIGINT NOT NULL, c_customer_text STRING NOT NULL, PRIMARY KEY (c_customer_id) ) COMMENT 'TABLE COMMENT' PARTITIONED BY (ds STRING COMMENT 'Partition') LIFECYCLE 90; -- Insert test data into the test table. INSERT INTO custom_img_test_tb PARTITION (ds='${bday}') (c_customer_id, c_customer_text) VALUES (1, '晚来天欲雪,能饮一杯无? '), (2, '月落乌啼霜满天,江枫渔火对愁眠。 '), (3, '山重水复疑无路,柳暗花明又一村。 '), (4, '春眠不觉晓,处处闻啼鸟。 '), (5, '静夜思,床前明月光,疑是地上霜。 '), (6, '海上生明月,天涯共此时。 '), (7, '旧时王谢堂前燕,飞入寻常百姓家。 '), (8, '一行白鹭上青天,窗含西岭千秋雪。 '), (9, '人生得意须尽欢,莫使金樽空对月。 '), (10, '天生我材必有用,千金散尽还复来。 ');
Save and deploy the node.
Create a custom image.
Follow the instructions that are described in Step 2: Create a custom image in this topic to create a custom image. Settings of key parameters:
Image Name/ID: Select
dataworks_pyodps_task_pod
.Supported Task Type: Select
PyODPS 3
.Installation Package: Select
Python3
andjieba
.
Publish the custom image and associate the custom image with a workspace. For more information, see the Step 3: Publish the custom image and Step 4: Associate the custom image with a workspace sections in this topic.
Use the custom image in a scheduling task.
In DataStudio, create and configure a PyODPS 3 node.
import jieba from odps import ODPS from odps.models import TableSchema as Schema, Column, Partition # Read data from the test table. table = o.get_table('custom_img_test_tb') partition_spec = f"ds={args['bday']}" with table.open_reader(partition=partition_spec) as reader: records = [record for record in reader] # Segment the extracted text. participles = [' | '.join(jieba.cut(record['c_customer_text'])) for record in records] # Create a destination table. if not o.exist_table("participle_tb"): schema = Schema(columns=[Column(name='word_segment', type='string', comment='Segmentation result')], partitions=[Column(name='ds', type='string', comment='Partition field')]) o.create_table("participle_tb", schema) # Write the segmentation result to the destination table. # Define an output partition and an output table. output_partition = f"ds={args['bday']}" output_table = o.get_table("participle_tb") # If the partition does not exist, create a partition first. if not output_table.exist_partition(output_partition): output_table.create_partition(output_partition) # Write the segmentation result to the output table. record = output_table.new_record() with output_table.open_writer(partition=output_partition, create_partition=True) as writer: for participle in participles: record['word_segment'] = participle writer.write(record)
On the Properties tab in the right-side navigation pane of the configuration tab of the node, configure the following key settings:
Add a scheduling parameter whose name is
bday
and value is$[yyyymmdd]
in the Scheduling Parameter section.Select a serverless resource group, which is the test resource group that you used when you published the custom image, as a resource group for scheduling.
Select the custom image that is published and associated with the current workspace.
Save and run the node with parameters configured.
Optional. Create an ad hoc query and execute the following SQL statement to check whether the output table contains data:
SELECT * FROM participle_tb WHERE ds=<Partition date>;
Deploy the PyODPS node to the production environment.
NoteThe image that is selected in DataStudio cannot be synchronized to the production environment. You must follow the instructions that are described in Deploy nodes to deploy the node to allow the image to take effect in the production environment.
References
When you use a custom image, you must select a serverless resource group as a resource group for scheduling. For more information about serverless resource groups, see Create and use a serverless resource group.
For more information about PyODPS, see Overview.
For more information about parameter settings in DataStudio, see Overview.