All Products
Search
Document Center

Platform For AI:View training jobs

Last Updated:Nov 12, 2024

After you submit a training job, you can view the basic information, configurations, events, resource views, and logs of the job to obtain the details of the job.

View the basic information and configurations of a job

  1. Log on to the PAI console. Select a region and a workspace.

  2. In the left-side navigation pane, choose Model Training > Deep Learning Containers (DLC).

  3. Click the name of the desired job.

  4. On the Overview tab, view the basic information, environment information, and resource information of the job. image

View the events of a job

You can view the scheduling events and resource-related events of a job in DLC and troubleshoot issues based on the events.

  • View job events.

    Click the Event tab and view job events. image

  • View node events.

    In the Instance section at the bottom of the Overview tab, click Log in the Actions column. In the dialog box that appears, click the Event tab and view node events.image

View resource views

Resource views allow you to view a number of metrics, such as GPU usage, GPU memory usage, CPU utilization, memory usage, and network I/O. You can view the resource usage of the job in real time on the Monitoring tab. This helps you understand the resource requirements of the job and allocate resources in a cost-effective manner.

Go to the Monitoring tab and view resource views of a job.image

Metrics in the job, pod, and GPU dimensions are supported.

View job logs

If a job is running unexpectedly or you want to view the running history of the job, you can view the job logs to obtain the key information during the running of the job. You can use one of the following methods to view the logs:

  • In the Instance section at the bottom of the Overview tab of a job, click Log in the Actions column to view the output logs of a node.image

  • Go to the Log tab of a job and search for logs by keyword. For more information, see the "Search for aggregated logs by keyword" section in the Create and manage container training jobs topic. image

View behavior events

Platform for AI (PAI) is integrated with ActionTrail. You can view and search for the DLC behavior events of your Alibaba Cloud account over the previous 90 days in ActionTrail. For more information, see Use ActionTrail to query behavior events.

References

You can manage a training job based on its status. For more information, see Manage training jobs.