All Products
Search
Document Center

Platform For AI:Submit a training job

Last Updated:Oct 31, 2024

PAI SDK for Python provides an easy-to-use high-level API. You can use the SDK to submit training jobs to Platform for AI (PAI) and run the jobs in the cloud. This topic describes how to prepare a training job script and use the SDK to submit a training job.

Billing

When you submit a training job, the job runs on Deep Learning Containers (DLC) resources and you are charged for the resources. For more information, see DLC billing.

Overview

You can use the Estimator class in the pai.estimator module of PAI SDK for Python to submit training jobs. To submit a training job, perform the following steps:

  • Create an Estimator instance to configure the training job, including the training job script, startup command, hyperparameters, image, and computing resources.

  • Use the Estimator.fit() method to specify the training data and submit the training job.

Sample code:

from pai.estimator import Estimator

# Create an Estimator instance to configure the training job. 
est = Estimator(
    command="<LaunchCommand>"
    source_dir="<SourceCodeDirectory>"
    image_uri="<TrainingImageUri>"
    instance_type="<TrainingInstanceType>",
    hyperparameters={
        "n_estimators": 500,
        "max_depth": 5,
    },
)

# Specify the training data and submit the training job. 
est.fit(
    inputs={
        "train_data": "oss://<YourOssBucket>/path/to/train/data/",
    }
)

# Obtain the path of the output model. 
print(est.model_data())

Prepare a training job script and required dependencies

  • Prepare a training job script

    You can create a training job script in the on-premises environment and submit the script to PAI. PAI configures the cloud environment and runs the script. Sample training job script:

    import argparse
    import os
    import json
    
    def train(hps, train_data, test_data):
        """Add your code for model training."""
        pass
    
    def save_model(model):
        """Save the output model."""
        # Obtain the path where the output model is to be saved by using the PAI_OUTPUT_MODEL environment variable. Default path: /ml/output/model/. 
        output_model_path = os.environ.get("PAI_OUTPUT_MODEL")
    
        # Write the output model to the obtained path. 
        pass
    
    def load_hyperparameters():
        """Read the hyperparameters."""
        # Obtain the path that contains the hyperparameters by using the PAI_CONFIG_DIR environment variable. Default path: /ml/input/config/. 
        hps_path = os.path.join(os.environ.get("PAI_CONFIG_DIR"), "hyperparameters.json")
        with open(hps_path, "r") as f:
            hyperparameters = json.load(f)
        return hyperparameters
    
    def run():
        #1. Load the hyperparameters. 
        hps = load_hyperparameters()
        print("Hyperparameters: ", hps)
    
        #2. Load the input data. 
        # Call the est.fit() method to load the input data that is stored in File Storage NAS (NAS) or Object Storage Service (OSS) into a container. 
        # Obtain the path of the input data in the on-premises environment by using the PAI_INPUT_{CHANNEL_NAME} environment variable. 
        train_data = os.environ.get("PAI_INPUT_TRAIN")
        test_data = os.environ.get("PAI_INPUT_TEST")
    
        model = train(hps, train_data, test_data)
    
        #3. Add the training code. When the training is completed, the output model is saved to the specified path. 
        save_model(model)
    
    
    if __name__ == "__main__":
        run()

    A training job script must follow specific standards to load hyperparameters, load input data, and save the output model. The following section describes the requirements.

    • Load hyperparameters

      After you configure the hyperparameters parameter for the Estimator instance, a hyperparameter file named hyperparameters.json is generated in the path that is specified by the PAI_CONFIG_DIR environment variable. The default path is /ml/input/config/. In the training job script, you can obtain the hyperparameters by reading the {PAI_CONFIG_DIR}/hyperparameters.json file.

      For example, if you specify hyperparameters={"batch_size": 32, "learning_rate": 0.01} for the Estimator instance, the following section shows the content of the {PAI_CONFIG_DIR}/hyperparameters.json file:

      {
        "batch_size": "32",
      	"learning-rate": "0.01"
      }
      
    • Load input data

      You can use the inputs parameter of the Estimator.fit() method to specify the path of the input data. You must specify the path in key-value pair format, where the key is the name of the input data (also called ChannelName) and the value is the storage path of the input data. Sample code:

      estimator.fits(
        	inputs={
          		"train": "oss://<YourOssBucket>/train/data/train.csv",
          		"test": "oss://<YourOssBucket>/test/data/",
        	}
      )
      

      The input data is mounted to the /ml/input/data/{ChannelName} path. In the training job script, you can obtain the mount path of the input data by using the PAI_INPUT_{ChannelName} environment variable and then read the data in the same manner you read on-premises files. In the preceding example, you can obtain the mount paths of the input data by using the PAI_INPUT_TRAIN and PAI_INPUT_TEST environment variables.

    • Save the output model

      You need to save the output model to the required path to persist the model. To obtain the path where the model is to be saved, use the PAI_OUTPUT_MODEL environment variable. The default path is /ml/output/model.

  • If your training job script requires additional Python package dependencies that are not provided in the image that you use, you can write the requirements.txt file in the directory in which the training job script resides. Before your script is run, the third-party library dependencies are installed in the job environment.

You need to store the training job script and related dependency files in a specific directory. For example, the training job script and dependency files are located in the train_src directory that you created in the on-premises environment. Specify source_dir="train_src" when you create the Estimator instance to package and upload the content of the train_src directory to PAI.

|-- train_src # The directory that contains the training job script. 
	|-- requirements.txt # The additional third-party dependencies of the training job script. 
	'-- train.py # The training job script. You can run the script by using the python train.py command.
	`-- utils.py

Obtain PAI images

To submit a training job, specify the image that you want to use to run the job. The image must contain the dependencies of the training job script, including the machine learning framework and third-party libraries. You can use a custom image from Container Registry (ACR) or a prebuilt image from PAI. PAI provides prebuilt images for common machine learning frameworks. You can use the pai.image.retrieve method to obtain PAI images. Sample code:

Note

For information about the third-party Python libraries that are preinstalled in a PAI image, see Public images.

from pai.image import retrieve, list_images, ImageScope


# Obtain all PAI images for training with PyTorch. 
for image_info in list_images(framework_name="PyTorch"):
 print(image_info)

# Obtain the PAI image for training with TensorFlow 2.3 on CPUs. 
print(retrieve(framework_name="TensorFlow", framework_version="2.3"))

# Obtain the latest PAI image for training with TensorFlow on GPUs. 
# Specify framework_version="latest" to obtain the latest image. 
print(retrieve(framework_name="TensorFlow", framework_version="latest",
		accelerator_type="GPU"))

# Obtain the PAI image for training with PyTorch 1.12 on GPUs. 
print(retrieve(framework_name="PyTorch", framework_version="1.12",
 accelerator_type="GPU"))

Run a training job

Run a training job in PAI

To run a training job in PAI, create an Estimator instance to configure the training job and then call the Estimator.fit() method to submit the job. After you submit the job, the system prints the URL of the job details page and continues to print the job logs until the job status changes to successful, failed, or stopped. You can use the printed URL to view job execution details, job logs, resource usage, and training metrics in the PAI console.

By default, the Estimator.fit() method exits after the job is completed. You can use the estimator.model_data() method to obtain the OSS path of the output model.

Sample code:

from pai.estimator import Estimator
from pai.image import retrieve

# Obtain the latest PAI image for training with PyTorch. 
torch_image_uri = retrieve("PyTorch", framework_version="1.12").image_uri

est = Estimator(
    # The startup command of the training job. 
    command="python train.py",
    # The path of the training job script. You can specify a relative or absolute path in the on-premises file system. You can also specify the OSS path of a TAR package. Example: oss://<YourOssBucket>/your-code-path-to/source.tar.gz. 
    # If the requirements.txt file exists in the directory that contains the training job script, the dependencies in the file are automatically installed before the script is run. 
    source_dir="./train_src/",
    # The image that you want to use for training. 
    image_uri=torch_image_uri,
    # The instance type that you want to use for training. 
    instance_type="ecs.c6.large",
    # The hyperparameters for training. 
    hyperparameters={
        "n_estimators": 500,
        "objective": "reg:squarederror",
        "max_depth": 5,
    },
    # The prefix of the training job name. The name is in the {base_job_name}_{submitted-datetime} format. 
    base_job_name="example_train_job",
)

# Submit the training job and print the URL of the job details page. By default, the Estimator.fit() method exits after the job status changes to successful, failed, or stopped. 
est.fit()

# Obtain the path of the output model. 
print(est.model_data())

Run a training job in an on-premises environment

Debugging is difficult in the cloud environment. Therefore, you can run a training job and debug the job in the on-premises environment. To run a training job in the on-premises environment, specify instance_type="local" when you create the Estimator instance. This way, the training job is run in a docker container.

estimator = Estimator(
    image_uri=image_uri,
    entry_point="train.py",
    # Run a training job in the on-premises environment. 
    instance_type="local",
)

estimator.fit(
    inputs={
        # You can use OSS data. The data is downloaded and then mounted to the container. 
        "train": "oss://<BucketName>/path-to-data/",
        # You can also use data on the host machine. The data is mounted to the corresponding directory. 
        "test": "/data/to/test/data"
    }
)

# Obtain the path of the output model. 
print(estimator.model_data())

References

Appendix

Preset environment variables of training jobs

If you submit a training job to PAI, PAI stores the following information about the job as environment variables: hyperparameters, the path of the input data, and the path of the output model. You can use the predefined environment variables to obtain the information when you configure the training job script or startup command (Estimator.command).

  • PAI_HPS_{HYPERPARAMETER_NAME}

    This environment variable specifies the value of a single hyperparameter. Environment variables can contain only letters, digits, and underscores (_). Other characters in the hyperparameter are replaced with underscores (_).

    For example, if you specify hyperparameters={"epochs": 10, "batch-size": 32, "train.learning_rate": 0.001}, the following environment variables are generated:

    PAI_HPS_EPOCHS=10
    PAI_HPS_BATCH_SIZE=32
    PAI_HPS_TRAIN_LEARNING_RATE=0.001
    

    You can use these environment variables in the startup command of the training job. Sample code:

    est = Estimator(
     command="python train.py --epochs $PAI_HPS_EPOCHS --batch-size $PAI_HPS_BATCH_SZIE",
     hyperparameters={
     "epochs": 10,
     "batch-size": 32,
     },
     # more arguments for estimator..
    )
    

    In the training job script (train.py), you can obtain the hyperparameters by using the argparse library to parse the command parameter.

  • PAI_USER_ARGS

    This environment variable specifies the values of all hyperparameters in the --{hyperparameter_name} {hyperparameter_value} format.

    For example, if you specify hyperparameters={"epochs": 10, "batch-size": 32, "learning-rate": 0.001}, the following environment variable is generated:

    PAI_USER_ARGS="--epochs 10 --batch-size 32 --learning-rate 0.001"
    

    You can use this environment variable in the startup command. In the following example, the actual command is python train.py --epochs 10 --batch-size 32 --learning-rate 0.001.

    est = Estimator(
        command="python train.py $PAI_USER_ARGS",
        hyperparameters={
            "epochs": 10,
            "learning-rate": 0.001
            "batch-size": 32,
        },
        # more arguments for estimator..
    )
  • PAI_HPS

    This environment variable specifies the values of all hyperparameters in JSON format.

    For example, if you specify hyperparameters={"epochs": 10, "batch-size": 32}, the following environment variable is generated:

    PAI_HPS={"epochs": 10, "batch-size": 32}
    
  • PAI_INPUT_{channel_name}

    This environment variable specifies the input channels of the job. Each channel_name corresponds to a mount path of the input data that is stored in OSS or NAS.

    For example, if you specify est.fit(inputs={"train": "oss://<YourOssBucket>/path-to-data/", "test": "oss://<YourOssBucket>/path-to/data/test.csv"}), the following environment variables are generated:

    PAI_INPUT_TRAIN=/ml/input/data/train/
    PAI_INPUT_TEST=/ml/input/data/test/test.csv
    

    You can read the input data in the mount path in the same manner as you read on-premises files.

    Note

    If you specify an OSS path that ends with a forward slash (), the environment variable points to a directory. If you specify an OSS path that ends with a file name, the environment variable points to the file.

  • PAI_OUTPUT_{channel_name}

    This environment variable specifies the output channels of the job. By default, the following output channels are created: MODEL and CHECKPOINTS, where MODEL specifies the path of the output model and CHECKPOINTS specifies the path of the checkpoints. Each channel_name corresponds to a mount path and an OSS URI. You can obtain the file path by using the PAI_OUTPUT_{channel_name} environment variable.

    PAI_OUTPUT_MODEL=/ml/output/model/
    PAI_OUTPUT_CHECKPOINTS=/ml/output/checkpoints/
    

    After you save the output model or checkpoints to the required path, PAI automatically uploads the model or checkpoints to the corresponding OSS path.

Directory structure

Sample directory structure for a training job that runs in PAI:

/ml
|-- usercode # The directory to which your code files are mounted. You can obtain the directory by using the PAI_WORKING_DIR environment variable.
|   |-- requirements.txt
|   `-- train.py
|-- input # The input data and configuration of the job. 
| '-- config # The directory that contains the configuration of the job. You can obtain the directory by using the PAI_CONFIG_DIR environment variable. 
| |-- hyperparameters.json # The hyperparameters of the training job. 
| '-- data # The input channels of the job. In this example, the job has two input channels: train_data and test_data. 
|       |-- test_data
|       |   `-- test.csv
|       `-- train_data
|           `-- train.csv
'-- output # The output channels of the job. By default, the MODEL and CHECKPOINTS channels are used. 
        '-- model # You can obtain the path of the output model by using the PAI_OUTPUT_{CHANNEL_NAME} environment variable. 
        `-- checkpoints