Use TensorBoard to view DLC training results - Platform For AI

You can create a TensorBoard instance for a Deep Learning Containers (DLC) job in Platform for AI (PAI) and view the visualized analysis report of model training results on the TensorBoard. This topic describes how to create and manage TensorBoard instances.

Prerequisites

A DLC job is created and associated with a dataset. For more information, see Submit training jobs.

Limits

You can use TensorBoard to view analysis reports only for training jobs that are associated with a dataset.

Create a TensorBoard instance

Go to the Deep Learning Containers (DLC) page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. Find the workspace that you want to manage and click the workspace ID.
3. In the left-side navigation pane, choose Model Training > Deep Learning Containers (DLC).
Find the desired job and click TensorBoard in the Actions column. In the TensorBoard panel, click Create TensorBoard.

On the Create TensorBoard page, configure the parameters and click OK. The following tables describe the parameters.

Basic information

Parameter

Description

TensorBoard Name

The name of the TensorBoard instance.

TensorBoard Configuration

The following configuration types are supported:

By Dataset
- Dataset: Select the dataset that is created in the workspace.
- Summary Path: Enter the relative path of the summary directory in the dataset.
By Object Storage Service (OSS)
- OSS: Select an OSS storage path.
- Summary Path: Enter the relative path of the summary directory in OSS.
By Task
- DLC Job: Select an existing DLC job.
- Summary Path: Enter the absolute path of the summary directory in the task. For example, if the summary file is in the /tensorboards/summary directory of the dataset and the mount path of the dataset in the DLC job is /mnt/data, the absolute path of the summary file in the DLC job is /mnt/data/tensorboards/summary.

You can click Add to mount multiple summary paths for each TensorBoard instance to compare metrics across multiple jobs.

Resource configuration

The following table describes the supported resource types.

Resource type	Description
Free Quota	The system provides you with a certain amount of free resources. Each instance can use up to 2 vCPUs and 4 GiB of memory.
Public Resources	If free resources cannot meet your requirements, you can use public resources to start a TensorBoard instance. The public resources use the pay-as-you-go billing method. If free resources are used up, you can stop TensorBoard instances that use the free resources to release related free resources. This way, you can continue to use the free resources.
Resource Quota	If free resources cannot meet your requirements, you can use resource quotas to create instances. Note This feature is available only to users in the whitelist. If you want to use this feature, contact your account manager to configure the whitelist. You must configure the following parameters: Resource Quota: Select a general computing resource quota or Lingjun resource quota. For information about how to create a resource quota, see Create a resource quota. If no resource quota is available, you can click Associate Resource Quota to associate a resource quota with the workspace. Priority: the priority of a TensorBoard instance. Valid values: 1 to 9. The value 1 indicates the lowest priority. Job Resource: the resources that you use to run a TensorBoard instance. The resources include the number of vCPUs and the memory. The unit of the memory size is GiB.

VPC settings
If you use Public Resources to create a TensorBoard instance, the VPC-related parameters are available.
- If you do not configure a virtual private cloud (VPC), Internet connection is used. However, the system may stutter during TensorBoard instance startup or reports viewing due to the limited bandwidth of the Internet connection.
- To ensure sufficient network bandwidth and stable performance, we recommend that you configure a VPC.
  Select a VPC, a vSwitch, and a security group in the current region. After you complete the configuration, the cluster in which the TensorBoard instance runs can access the services in the selected VPC and use the security group that you specified to control access.
  Important
  If the TensorBoard instance uses a dataset that requires a VPC, such as a Cloud Parallel File Storage (CPFS) dataset or a NAS dataset that has a mount target in the VPC, you must configure a VPC.

Go to the TensorBoard page to view the analysis report.
1. In the left-side navigation pane of the workspace page, choose AI Computing Asset Management > Jobs.
2. On the TensorBoard tab, if the Status of the TensorBoard instance is Running, click View TensorBoard in the Actions column.
  The TensorBoard page appears.

Manage a TensorBoard instance

To manage a TensorBoard instance, perform the following steps:

Go to the Distributed Training Jobs page
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane, choose AI Computing Asset Management > Jobs to go to the Distributed Training Jobs page.
Manage a TensorBoard instance.
- View the details of a TensorBoard instance.
  On the Tensorboard tab, click the name of the TensorBoard instance. On the TensorBoard Instance Details page, you can view the Basic Information and Configuration Information.
- View associated DLC jobs.
  You can view the number of the DLC jobs that you associate with a TensorBoard instance. On the Tensorboard tab, move the pointer over the icon in the Associated Task column to view the ID of the associated DLC job. You can click the ID to go to the details page of the DLC job.
- View associated datasets.
  You can view the number of the datasets that you associate with a TensorBoard instance. On the Tensorboard tab, move the pointer over the icon in the Associated Dataset column to view the ID of the associated dataset. You can click the ID to go to the details page of the dataset.
- View the running duration.
  You can view the running duration of a TensorBoard instance. The running duration starts when the instance is started. After you stop the TensorBoard instance, the running duration is reset. On the Tensorboard tab, view the running duration of the TensorBoard instance in the Running Duration column.
- Stop a TensorBoard instance.
  - Click Stop in the Actions column of the TensorBoard instance.
  - Click Auto-stop Settings in the Actions column of the TensorBoard instance to specify the time at which you want the instance to automatically stop.

References

You can create a TensorBoard instance for a DLC job on the AI Asset Management > Jobs page. For more information, see Create and manage TensorBoard instances.