FAQ about PAI DLC - Platform For AI - Alibaba Cloud Documentation Center

This topic provides answers to frequently asked questions about Deep Learning Containers (DLC) of Platform for AI (PAI).

What do I do if the "exited with code 137" error message appears?

If the "exited with code 137" error message appears, you can use instances that have a larger memory size, increase the number of worker nodes, or modify the reserved memory size in your code.

In Linux, the error code 137 indicates that the process is forcibly aborted by the SIGKILL signal. The most common reason is high memory usage, also known as the Out of memory (OOM) error. You can identify the cause of insufficient memory based on the memory usage of the worker nodes in the job details and increase the available memory.

What do I do if the job status is Failed or Dequeued?

The following table describes the sequence of job execution statuses for DLC.

Job Type		Status Sequence
Pay-as-you-go resources	Lingjun preemptible resources	`Creating -> Bidding -> Environment Preparing -> Running -> Succeeded / Failed / Stopped`
	Lingjun resources or general computing public resources	`Creating -> Environment Preparing -> Running -> Succeeded / Failed / Stopped`
Subscription resources		`Creating -> Queued -> Environment Preparing -> Running -> Succeeded / Failed / Stopped`

What to do if the status is Environment Preparing?
If a job persists in the Environment Preparing state, it might be because you configured a CPFS type dataset without setting up a virtual private cloud (VPC). To resolve this, recreate the job, configure the CPFS type dataset and a VPC. Make sure the configured VPC is the same as the VPC of CPFS. For more information, see Submit training jobs.
What to do if the status is Failed?
Hover your mouse over the icon on the job details page or view the log to locate the causes of job execution failure. For more information, see View training jobs.

Can I change a job using public resources to exclusive resources?

To change the resources, you must recreate the job. Click Clone in the Actions column of the original job to create a new one with the same configuration. Then, you can change the resources. For more information about the billing, see Billing of DLC.

How to set up multiple nodes and GPUs in DLC?

When creating a DLC job, configure the following command. For more information, see Submit training jobs.

python -m torch.distributed.launch \ --nproc_per_node=2 \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ train.py --epochs=100

How to download a model trained on DLC to my local machine?

Associate a dataset to your DLC job to save the trained model file in the mounted dataset folder. Alternatively, access the corresponding file system directly to download the model file to your local machine.

For information about how to associate a dataset during DLC job submission, see Submit a job in the PAI console.
For information about how to download files from Object Storage Service (OSS) to your local machine, see Get started by using the OSS console.
For information about how to download files from NAS file system to your local machine, see Mount a file system on a Function Compute function.