Python Script - Platform For AI - Alibaba Cloud Documentation Center

The Python Script component provided by Machine Learning Designer allows you to install custom dependencies and invoke custom Python functions. This topic describes how to configure the Python Script component and provides an example on how to use the component.

Background information

The Python Script component is placed in the UserDefinedScript folder on the left-side pane of a pipeline details page. To open a pipeline details page, go to the Visualized Modeling (Designer) page in the Platform for AI (PAI) console, select the pipeline that you want to use, and click Open.

Prerequisites

The permissions required to use Deep Learning Containers (DLC) are granted. For more information, see Grant the permissions that are required to use DLC.
The DLC computing resources on which the Python Script component depends are associated with the PAI workspace that you want to use. For more information, see Manage workspaces.
An Object Storage Service (OSS) bucket is created to store code for the Python Script component. For more information, see Create buckets.
Important
The OSS bucket must be created in the same region as Machine Learning Designer and DLC.
The Resource Access Management (RAM) user who manages the Python Script component is assigned the Algorithm Developer role in the workspace. For more information, see Manage members of the workspace. If the RAM user wants to use MaxCompute as a data source, you also need to assign the MaxCompute Developer role to the RAM user.

Configure the component in the PAI console

Input ports
The Python Script component has four input ports that can be used to receive OSS data and MaxCompute table data.
- Input ports for OSS data
  OSS data from upstream components can be mounted to the Python Script component. The system passes the path of the mounted data as a command line argument. No manual operations are required. For example, the python main.py --input1 /ml/input/data/input1 syntax specifies the path of OSS data that is read by the first OSS input port. The mounted files in the/ml/input/data/input1 path can be read the same way as on-premise files.
- Input ports for MaxCompute tables
  MaxCompute tables cannot be directly mounted to the component. The system converts the table metadata into a Uniform Resource Identifier (URI) and then passes the URI to the component as a command line argument. No manual operations are required. For example, the python main.py --input1 odps://some-project-name/tables/table syntax specifies the URI of the MaxCompute table that is read by the first MaxCompute input port. You can use the parse_odps_url function from the code template of this component to parse and obtain metadata such as the project name, table name, and partitions. For more information, see the Examples section in this topic.
Output ports
The Python Script component has four output ports. OSS Output Port 1 and OSS Output Port 2 are used to export OSS data. Table Output Port 1 and Table Output Port 2 are used to export MaxCompute tables.
- Output ports for OSS data
  The path specified by the Job Output Path parameter on the Code Config tab is automatically mapped to the /ml/output/ path. OSS Output Port 1 and OSS Output Port 2 correspond to the /ml/output/output1 and /ml/output/output2 paths, respectively. The files can be written in these paths in the same way as on-premises files before they are passed to downstream components.
- Output ports for MaxCompute tables
  If MaxCompute projects are associated with the PAI workspace, a temporary URI is passed as a command line argument by using the python main.py --output3 odps://<some-project-name>/tables/<output-table-name> syntax. You can use PyODPS to create a temporary table that corresponds to the URI, write the data that is processed by the component to the table, and pass the table to downstream components. For more information, see the Examples section in this topic.

Component parameters

Code Config

Parameter		Description
Job Output Path		The OSS path to which data is exported. This OSS path is mapped to the `/ml/output/` path. Data written to the `/ml/output/` path is persisted to the mapped OSS path. OSS Output Port 1 and OSS Output Port 2 correspond to the `/ml/output/output1` and ml/output/output2 path, respectively. Once connected to these output ports, the downstream components can read data from the mapped paths.
Code Source (Select a source from the drop-down list)	Literal Code	Python Code: the OSS path to store the script that you write in the code editor. By default, the script is saved as an object named main.py. Important Before you click Save, make sure that the OSS path that you want to use to store the script does not contain an object that has the same name as the current object. Otherwise, the existing object is overwritten. Code editor: the Python code editor where sample code is provided by default. For more information, see the Examples section in this topic. You can write code in the code editor.
	Specify Git Configuration	Git Repository Address: the address of the Git repository. Code branch: the branch where the code is stored. Default value: master. Code Commit: the commit where the code is submitted. This parameter takes precedence over the Code branch parameter. If you specify this parameter, the Code branch parameter is invalid. Git Username: the Git username. This parameter is required if you want to access a private code repository. Git Access Token: the access token to the Git repository. This parameter is required if you want to access a private code repository. For more information, see Obtain a GitHub account token.
	Select Code Source	Select Code Source Repositories: the code build you created. For more information, see Code builds. Code branch: the branch where the code is stored. Default value: master. Code Commit: the commit where the code is submitted. This parameter takes precedence over the Code branch parameter. If you specify this parameter, the Code branch parameter is invalid.
	Select Oss path	In the OSS Code Path field, you can select the path where the code is stored.
Command		Enter the command that you want to run. Example: `python main.py`. Note The system automatically generates a command based on the script name and the connected ports. No manual operations are required.
Advanced Option		Third Dependency: the third-party dependencies that you want to install. You can specify these dependencies in the format that is used by a Python requirement.txt file. The following code provides an example. These dependencies are automatically installed before you run the component. `cycler==0.10.0 # via matplotlib kiwisolver==1.2.0 # via matplotlib matplotlib==3.2.1 numpy==1.18.5 pandas==1.0.4 pyparsing==2.4.7 # via matplotlib python-dateutil==2.8.1 # via matplotlib, pandas pytz==2020.1 # via pandas scipy==1.4.1 # via seaborn` Whether to enable container monitoring: If you select this option, you can enter parameter configurations in the Error Monitoring Arguments field.

Run Config

Parameter	Description
ResourceGroup	Public Resource Group is supported. If you select Public Resource Group, set the InstanceType parameter to CPU or GPU and specify the CPU or GPU specifications. Default value: ecs.c6.large. By default, the resource group that is used by DLC resources of the current workspace is selected.
VPC Settings	You can select an existing virtual private cloud (VPC).
Security Group	You can select an existing security group.
Advanced Option	If you select this parameter, you can configure the following parameters: Instance Count: the number of instances that you want to create. Specify a value for this parameter based on your business requirements. Default value: 1. Job Image URI: the URI of the job image that you want to use. By default, open source XGBoost 1.6.0 is used. If you want to use deep learning frameworks, you must change the image. Job Type: the job type. You need to modify this parameter only if the script is executed in a distributed manner. Valid values: XGBoost/LightGBM Job TensorFlow Job PyTorch Job MPI Job

Examples

Parse the default sample code

By default, the Python Script component provides the following sample code:

import os
import argparse
import json
"""
Sample code for the Python Script component
"""
# MaxCompute is used in this workspace. The name and endpoint of the MaxCompute project are required. 
# To run the code, make sure that a MaxCompute project is associated with the workspace. 
# Example: {"endpoint": "http://service.cn.maxcompute.aliyun-inc.com/api", "odpsProject": "lq_test_mc_project"}. 
ENV_JOB_MAX_COMPUTE_EXECUTION = "JOB_MAX_COMPUTE_EXECUTION"


def init_odps():
    from odps import ODPS
    # Information about the default MaxCompute project that is associated with the workspace. 
    mc_execution = json.loads(os.environ[ENV_JOB_MAX_COMPUTE_EXECUTION])
    o = ODPS(
        access_id="<YourAccessKeyId>",
        secret_access_key="<YourAccessKeySecret>",
        # Use the region in which the MaxCompute project resides. Example: http://service.cn-shanghai.maxcompute.aliyun-inc.com/api. 
        endpoint=mc_execution["endpoint"],
        project=mc_execution["odpsProject"],
    )
    return o


def parse_odps_url(table_uri):
    from urllib import parse
    parsed = parse.urlparse(table_uri)
    project_name = parsed.hostname
    r = parsed.path.split("/", 2)
    table_name = r[2]
    if len(r) > 3:
        partition = r[3]
    else:
        partition = None
        return project_name, table_name, partition


def parse_args():
    parser = argparse.ArgumentParser(description="PythonV2 component script example.")
    parser.add_argument("--input1", type=str, default=None, help="Component input port 1.")
    parser.add_argument("--input2", type=str, default=None, help="Component input port 2.")
    parser.add_argument("--input3", type=str, default=None, help="Component input port 3.")
    parser.add_argument("--input4", type=str, default=None, help="Component input port 4.")
    parser.add_argument("--output1", type=str, default=None, help="Output OSS port 1.")
    parser.add_argument("--output2", type=str, default=None, help="Output OSS port 2.")
    parser.add_argument("--output3", type=str, default=None, help="Output MaxComputeTable 1.")
    parser.add_argument("--output4", type=str, default=None, help="Output MaxComputeTable 2.")
    args, _ = parser.parse_known_args()
    return args


def write_table_example(args):
    # Example: Execute an SQL statement to copy the public table data provided by PAI and feed the data to the temporary table of Table Output Port 1. 
    output_table_uri = args.output3
    o = init_odps()
    project_name, table_name, partition = parse_odps_url(output_table_uri)
    o.run_sql(f"create table {project_name}.{table_name} as select * from pai_online_project.heart_disease_prediction;")


def write_output1(args):
    # Example: Write the data to the subpath of OSS Output Port 1 and pass the data to downstream components by connecting to those components. 
    output_path = args.output1
    os.makedirs(output_path, exist_ok=True)
    p = os.path.join(output_path, "result.text")
    with open(p, "w") as f:
        f.write("TestAccuracy=0.88")


if __name__ == "__main__":
    args = parse_args()
    print("Input1={}".format(args.input1))
    print("Output1={}".format(args.output1))
    # write_table_example(args)
    # write_output1(args)

The preceding code includes the following commonly used functions:

init_odps(): initializes a MaxCompute instance to read the MaxCompute table data. To initiate a MaxCompute instance, you must enter your AccessKey ID and AccessKey secret. For more information about how to obtain an AccessKey pair, see Create an AccessKey pair.
parse_odps_url(table_uri): parses the MaxCompute table URI and returns the project name, table name, and partitions. Specify this parameter in the odps://${your_projectname}/tables/${table_name}/${pt_1}/${pt_2}/ format. Example: odps://test/tables/iris/pa=1/pb=1. In this example, pa=1/pb=1 is a multi-level partition.
parse_args(): parses the arguments that are passed to the script. The arguments specify the input and output data of the script.

Example 1: Use Python Script with other components

This example uses the heart disease prediction template to show how to use the Python Script component together with other components. 组合使用 To configure a pipeline, perform the following steps:

Create a pipeline based on the heart disease prediction template and open the pipeline. For more information, see Predict heart disease.

Drag the Python Script component to the canvas, rename the component SMOTE, and then enter the following code.

Important

The imblearn library is not included in the image that is used in this example. You must specify the imblearn library in the Third Dependency field of the Code Config tab. The library is automatically installed before the component is run.

import argparse
import json
import os
from odps.df import DataFrame
from imblearn.over_sampling import SMOTE
from urllib import parse
from odps import ODPS
ENV_JOB_MAX_COMPUTE_EXECUTION = "JOB_MAX_COMPUTE_EXECUTION"


def init_odps():
    # Information about the default MaxCompute project that is associated with the workspace. 
    mc_execution = json.loads(os.environ[ENV_JOB_MAX_COMPUTE_EXECUTION])
    o = ODPS(
        access_id="<Replace the value with your AccessKey ID>",
        secret_access_key="<Replace the value with your AccessKey secret>",
        # Use the region in which the MaxCompute project resides. Example: http://service.cn-shanghai.maxcompute.aliyun-inc.com/api. 
        endpoint=mc_execution["endpoint"],
        project=mc_execution["odpsProject"],
    )
    return o


def get_max_compute_table(table_uri, odps):
    parsed = parse.urlparse(table_uri)
    project_name = parsed.hostname
    table_name = parsed.path.split('/')[2]
    table = odps.get_table(project_name + "." + table_name)
    return table


def run():
    parser = argparse.ArgumentParser(description='PythonV2 component script example.')
    parser.add_argument(
        '--input1', type=str, default=None, help='Component input port 1.'
    )
    parser.add_argument(
        '--output3', type=str, default=None, help='Component input port 1.'
    )
    args, _ = parser.parse_known_args()
    print('Input1={}'.format(args.input1))
    print('output3={}'.format(args.output3))
    o = init_odps()
    imbalanced_table = get_max_compute_table(args.input1, o)
    df = DataFrame(imbalanced_table).to_pandas()
    sm = SMOTE(random_state=2)
    X_train_res, y_train_res = sm.fit_resample(df, df['ifhealth'].ravel())
    new_table = o.create_table(get_max_compute_table(args.output3, o).name, imbalanced_table.schema, if_not_exists=True)
    with new_table.open_writer() as writer:
        writer.write(X_train_res.values.tolist())


if __name__ == '__main__':
    run()

Replace access_id and secret_access_key with your AccessKey ID and AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.

Connect the SMOTE component as a downstream component of the Split component. Then, the component uses the SMOTE algorithm to perform oversampling on the split datasets that contain a small number of samples and generates new samples to handle class imbalance.
To use the generated samples for training, connect the Logistic Regression for Binary Classification component as a downstream component of the SMOTE component.
Compare the models that are generated from the left and right branches by connecting the Confusion Matrix and Binary Classification Evaluation components as downstream components at the end of the two branches. After you run the pipeline, click the icon to view the evaluation results.
The evaluation results show that oversampling does not significantly improve model performance. This indicates that the original sample distribution and model have a good performance.

Example 2: Use Python Script to orchestrate DLC jobs

In Machine Learning Designer, you can connect multiple Python Script components to orchestrate and schedule a pipeline for DLC jobs. For example, start four DLC jobs based on the sequence as shown in the following Directed Acyclic Graph (DAG).

Note

If the code execution of DLC does not require reading data from the upstream component or passing data to the downstream component, the connections between the components indicate only the dependencies among the components and the sequence of running these components.

DAG图 You can deploy the entire pipeline in Machine Learning Designer to DataWorks to schedule the pipeline as a periodic task. For more information, see Use DataWorks tasks to schedule pipelines in Machine Learning Designer.

Example 3: Pass global variables to Python Script

Configure global variables.
On the pipeline details page in Machine Learning Designer, click the blank area on the canvas and configure global variables on the Global Variables tab in the right-side pane.

Use one of the following methods to pass the configured global variables to the Python script component:

Click the Python Script component. Select Advanced Option on the Code Config tab and pass the global variables in the Command field.

Modify the Python code to use argparser to parse parameters.

The following sample code shows how to parse the global variables that you configure in Step 1. You need to modify the code based on the actual global variables that you configure. After you modify the code, you can paste the code to the code editor on the Code Config tab.

import os

import argparse
import json

"""
Sample code for the Python Script component
"""

ENV_JOB_MAX_COMPUTE_EXECUTION = "JOB_MAX_COMPUTE_EXECUTION"


def init_odps():

    from odps import ODPS

    mc_execution = json.loads(os.environ[ENV_JOB_MAX_COMPUTE_EXECUTION])

    o = ODPS(
        access_id="<YourAccessKeyId>",
        secret_access_key="<YourAccessKeySecret>",
        endpoint=mc_execution["endpoint"],
        project=mc_execution["odpsProject"],
    )
    return o


def parse_odps_url(table_uri):
    from urllib import parse

    parsed = parse.urlparse(table_uri)
    project_name = parsed.hostname
    r = parsed.path.split("/", 2)
    table_name = r[2]
    if len(r) > 3:
        partition = r[3]
    else:
        partition = None
    return project_name, table_name, partition


def parse_args():
    parser = argparse.ArgumentParser(description="PythonV2 component script example.")

    parser.add_argument("--input1", type=str, default=None, help="Component input port 1.")
    parser.add_argument("--input2", type=str, default=None, help="Component input port 2.")
    parser.add_argument("--input3", type=str, default=None, help="Component input port 3.")
    parser.add_argument("--input4", type=str, default=None, help="Component input port 4.")

    parser.add_argument("--output1", type=str, default=None, help="Output OSS port 1.")
    parser.add_argument("--output2", type=str, default=None, help="Output OSS port 2.")
    parser.add_argument("--output3", type=str, default=None, help="Output MaxComputeTable 1.")
    parser.add_argument("--output4", type=str, default=None, help="Output MaxComputeTable 2.")
    # Add code based on the configured global variables. 
    parser.add_argument("--arg1", type=str, default=None, help="Argument 1.")
    parser.add_argument("--arg2", type=int, default=None, help="Argument 2.")
    args, _ = parser.parse_known_args()
    return args


def write_table_example(args):

    output_table_uri = args.output3

    o = init_odps()
    project_name, table_name, partition = parse_odps_url(output_table_uri)
    o.run_sql(f"create table {project_name}.{table_name} as select * from pai_online_project.heart_disease_prediction;")


def write_output1(args):
    output_path = args.output1

    os.makedirs(output_path, exist_ok=True)
    p = os.path.join(output_path, "result.text")
    with open(p, "w") as f:
        f.write("TestAccuracy=0.88")


if __name__ == "__main__":
    args = parse_args()

    print("Input1={}".format(args.input1))
    print("Output1={}".format(args.output1))
    # Add code based on the configured global variables. 
    print("Argument1={}".format(args.arg1))
    print("Argument2={}".format(args.arg2))
    # write_table_example(args)
    # write_output1(args)