All Products
Search
Document Center

Platform For AI:DLC FAQ

Last Updated:Dec 11, 2024

This topic provides answers to frequently asked questions about Deep Learning Containers (DLC) of Platform for AI (PAI).

What do I do if the "exited with code 137" error message appears?

If the "exited with code 137" error message appears, you can use instances that have a larger memory size, increase the number of worker nodes, or modify the reserved memory size in your code.

image

In Linux, the error code 137 indicates that the process is forcibly aborted by the SIGKILL signal. The most common reason is high memory usage, also known as the Out of memory (OOM) error. You can identify the cause of insufficient memory based on the memory usage of the worker nodes in the job details and increase the available memory.

What do I do if the job status is Failed or Dequeued?

The following table describes the sequence of job execution statuses for DLC.

Job Type

Status Sequence

Pay-as-you-go resources

Lingjun preemptible resources

Creating -> Bidding -> Environment Preparing -> Running -> Succeeded / Failed / Stopped

Lingjun resources or general computing public resources

Creating -> Environment Preparing -> Running -> Succeeded / Failed / Stopped

Subscription resources

Creating -> Queued -> Environment Preparing -> Running -> Succeeded / Failed / Stopped

  • What to do if the status is Environment Preparing?

    If a job persists in the Environment Preparing state, it might be because you configured a CPFS type dataset without setting up a virtual private cloud (VPC). To resolve this, recreate the job, configure the CPFS type dataset and a VPC. Make sure the configured VPC is the same as the VPC of CPFS. For more information, see Submit training jobs.

  • What to do if the status is Failed?

    Hover your mouse over the image.png icon on the job details page or view the log to locate the causes of job execution failure. For more information, see View training jobs.

Can I change a job using public resources to exclusive resources?

To change the resources, you must recreate the job. Click Clone in the Actions column of the original job to create a new one with the same configuration. Then, you can change the resources. For more information about the billing, see Billing of DLC.

How to set up multiple nodes and GPUs in DLC?

When creating a DLC job, configure the following command. For more information, see Submit training jobs.

python -m torch.distributed.launch \ --nproc_per_node=2 \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ train.py --epochs=100

How to download a model trained on DLC to my local machine?

Associate a dataset to your DLC job to save the trained model file in the mounted dataset folder. Alternatively, access the corresponding file system directly to download the model file to your local machine.