All Products
Search
Document Center

Platform For AI:Create and manage TensorBoard instances

Last Updated:Sep 13, 2024

You can create and manage TensorBoard instances on the TensorBoard tabs of the Jobs page in the Platform for AI (PAI) console. A TensorBoard instance can be associated with a dataset or a Deep Learning Containers (DLC) job. After the instance is started, you can view the visualized analysis report of model training results of TensorBoard. This topic describes how to create and manage TensorBoard instances.

Limits

You cannot use the TensorBoard feature for DLC jobs that are created in the Malaysia (Kuala Lumpur) region.

Account and permission requirements

  • Alibaba Cloud account: You can use an Alibaba Cloud account to complete all operations without additional authorization.

  • RAM user: You must add a RAM user to your workspace as a member and assign the required role to the member to grant the member the related operation permissions. For more information about the permissions of each role, see Appendix: Roles and permissions.

Create a TensorBoard instance

To create a TensorBoard instance, perform the following steps:

  1. Go to the Distributed Training Jobs page

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose AI Computing Asset Management > Jobs to go to the Distributed Training Jobs page.

  2. On the TensorBoard tab, click Create TensorBoard.

  3. On the Create TensorBoard page, configure the parameters and click OK. The following tables describe the parameters.

    • Basic information

      Parameter

      Description

      TensorBoard Name

      The name of the TensorBoard instance.

      TensorBoard Configuration

      The following configuration types are supported:

      • By Dataset

        • Dataset: Select the dataset that is created in the workspace.

        • Summary Path: Enter the relative path of the summary directory in the dataset.

      • By Object Storage Service (OSS)

        • OSS: Select an OSS storage path.

        • Summary Path: Enter the relative path of the summary directory in OSS.

      • By Task

        • DLC Job: Select an existing DLC job.

        • Summary Path: Enter the absolute path of the summary directory in the task. For example, if the summary file is in the /tensorboards/summary directory of the dataset and the mount path of the dataset in the DLC job is /mnt/data, the absolute path of the summary file in the DLC job is /mnt/data/tensorboards/summary.

      You can click Add to mount multiple summary paths for each TensorBoard instance to compare metrics across multiple jobs.

    • Resource configuration

      The following table describes the supported resource types.

      Resource type

      Description

      Free Quota

      The system provides you with a certain amount of free resources. Each instance can use up to 2 vCPUs and 4 GiB of memory.

      Public Resources

      If free resources cannot meet your requirements, you can use public resources to start a TensorBoard instance. The public resources use the pay-as-you-go billing method. If free resources are used up, you can stop TensorBoard instances that use the free resources to release related free resources. This way, you can continue to use the free resources.

      Resource Quota

      If free resources cannot meet your requirements, you can use resource quotas to create instances.

      Note

      This feature is available only to users in the whitelist. If you want to use this feature, contact your account manager to configure the whitelist.

      You must configure the following parameters:

      • Resource Quota: Select a general computing resource quota or Lingjun resource quota. For information about how to create a resource quota, see Create a resource quota. If no resource quota is available, you can click Associate Resource Quota to associate a resource quota with the workspace.

      • Priority: the priority of a TensorBoard instance. Valid values: 1 to 9. The value 1 indicates the lowest priority.

      • Job Resource: the resources that you use to run a TensorBoard instance. The resources include the number of vCPUs and the memory. The unit of the memory size is GiB.

    • VPC settings

      If you use Public Resources to create a TensorBoard instance, the VPC-related parameters are available.

      • If you do not configure a virtual private cloud (VPC), Internet connection is used. However, the system may stutter during TensorBoard instance startup or reports viewing due to the limited bandwidth of the Internet connection.

      • To ensure sufficient network bandwidth and stable performance, we recommend that you configure a VPC.

        Select a VPC, a vSwitch, and a security group in the current region. After you complete the configuration, the cluster in which the TensorBoard instance runs can access the services in the selected VPC and use the security group that you specified to control access.

        Important

        If the TensorBoard instance uses a dataset that requires a VPC, such as a Cloud Parallel File Storage (CPFS) dataset or a NAS dataset that has a mount target in the VPC, you must configure a VPC.

  4. Find the TensorBoard instance that you create and click View TensorBoard in the Actions column after the TensorBoard instance enters the Running state.

    The TensorBoard page appears. TensorBoard allows you to view the dataset or summary log file during the training in a visualized manner to help you better understand and debug the training. This improves the training effect.image

Manage a TensorBoard instance

image

  • View the details of a TensorBoard instance.

    On the Tensorboard tab, click the name of the TensorBoard instance. On the TensorBoard Instance Details page, you can view the Basic Information and Configuration Information.

  • View associated DLC jobs.

    You can view the number of the DLC jobs that you associate with a TensorBoard instance. On the Tensorboard tab, move the pointer over the image icon in the Associated Task column to view the ID of the associated DLC job. You can click the ID to go to the details page of the DLC job.

  • View associated datasets.

    You can view the number of the datasets that you associate with a TensorBoard instance. On the Tensorboard tab, move the pointer over the image icon in the Associated Dataset column to view the ID of the associated dataset. You can click the ID to go to the details page of the dataset.

  • View the running duration.

    You can view the running duration of a TensorBoard instance. The running duration starts when the instance is started. After you stop the TensorBoard instance, the running duration is reset. On the Tensorboard tab, view the running duration of the TensorBoard instance in the Running Duration column.

  • Stop a TensorBoard instance.

    • Click Stop in the Actions column of the TensorBoard instance.

    • Click Auto-stop Settings in the Actions column of the TensorBoard instance to specify the time at which you want the instance to automatically stop.

References

You can also create and manage TensorBoard instances on the Deep Learning Containers (DLC) page. For more information, see TensorBoard.