All Products
Search
Document Center

Platform For AI:DLC FAQ

Last Updated:Jan 06, 2026

This topic provides answers to frequently asked questions about Deep Learning Containers (DLC) of Platform for AI (PAI).

Q: Model training failed with the error: SupportsDistributedTraining false, please set InstanceCount=1

  • Cause: The current training task uses multiple instances (node count greater than 1), but this model does not support distributed training.

  • Solution: Set the number of nodes to 1.

    image

Q: Model training failed with the error: failed to compose dlc job specs, resource limiting triggered, you are trying to use more GPU resources than the threshold.

The training job exceeds the limit of 2*GPU running concurrently. Wait for current jobs to complete before starting new ones, or submit a ticket to request a higher quota.

Q: What do I do if the "exited with code 137" error message appears?

If the "exited with code 137" error message appears, you can use instances that have a larger memory size, increase the number of worker nodes, or modify the reserved memory size in your code.

image

In Linux, the error code 137 indicates that the process is forcibly aborted by the SIGKILL signal. The most common reason is high memory usage, also known as the Out of memory (OOM) error. You can identify the cause of insufficient memory based on the memory usage of the worker nodes in the job details and increase the available memory.

Q: What do I do if the job status is Failed or Dequeued?

The following table describes the sequence of job execution statuses for DLC.

Job Type

Status Sequence

Pay-as-you-go resources

Lingjun preemptible resources

Creating -> Bidding -> Environment Preparing -> Running -> Succeeded / Failed / Stopped

Lingjun resources or general computing public resources

Creating -> Environment Preparing -> Running -> Succeeded / Failed / Stopped

Subscription resources

Creating -> Queued -> Environment Preparing -> Running -> Succeeded / Failed / Stopped

  • What to do if the status is Environment Preparing?

    If a job persists in the Environment Preparing state, it might be because you configured a CPFS type dataset without setting up a virtual private cloud (VPC). To resolve this, recreate the job, configure the CPFS type dataset and a VPC. Make sure the configured VPC is the same as the VPC of CPFS. For more information, see Create a training job.

  • What to do if the status is Failed?

    Hover your mouse over the image.png icon on the job details page or view the log to locate the causes of job execution failure. For more information, see View training jobs.

Q: Can I change a job using public resources to exclusive resources?

To change the resources, you must recreate the job. Click Clone in the Actions column of the original job to create a new one with the same configuration. Then, you can change the resources. For more information about the billing, see Billing of DLC.

Q: How to set up multiple nodes and GPUs in DLC?

When creating a DLC job, configure the following command. For more information, see Create a training job.

python -m torch.distributed.launch \ --nproc_per_node=2 \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ train.py --epochs=100

Q: How to download a model trained on DLC to my local machine?

When submitting a DLC job, you can associate a dataset and configure the startup command to save the training results to the mounted dataset folder.

imageAfter the training, the model files are automatically saved to the mount path of the dataset. Then, you can access the corresponding file system to download the model file to your local machine.

Q: How to use Docker images in DLC?

Q: How do I release completed nodes in a DLC job for other jobs to use?

Problem description:

When a DLC job runs a multi-worker distributed training task, some workers may finish early and exit. This can happen due to factors such as data skew. However, with the default configuration, these completed workers continue to occupy their scheduled nodes.

Solution: Configure the PAI-DLC advanced parameter ReleaseResourcePolicy. By default, this parameter is not configured, and compute resources are released only after the entire job is complete. If you set it to pod-exit, compute resources are released as soon as a worker exits.

Q: Why does a DLC job return the error OSError: [Errno 116] Stale file handle?

Problem description:

Multiple workers fail with the OSError: [Errno 116] Stale file handle error when executing PyTorch's torch.compile for ahead-of-time compilation. This failure is caused by an inability to read a cache file.

Analysis:

This error typically occurs in Network File System (NFS) environments. It is triggered when a process holds a file handle for a file that has been deleted or moved on the server. The handle becomes stale, and the error occurs when the client tries to use this stale handle to access the file.

Root cause:

PyTorch's ahead-of-time (AOT) compilation caches optimized computation graphs to the file system. By default, the cache is stored in /tmp, which is typically a local tmpfs. However, in cluster computing environments, this cache may be stored on CPFS, a distributed file system mounted over NFS. NFS is sensitive to file deletion operations. A stale file handle can be triggered under several conditions. For example, PyTorch may automatically clean up old cache files, or another process may delete the cache directory. The error can also occur if files are moved, deleted, or have their permissions changed on the NFS server while clients are still holding outdated handles. In addition, NFS clients cache file attributes by default, which can prevent them from detecting changes made on the server. The issue is exacerbated by several environmental factors. CPFS is built on NFS, which is prone to stale file handle errors under high concurrency. This is especially true for short-lifecycle files, such as temporary cache files. Concurrent access from multiple workers that are reading, writing, or cleaning the same cache can also introduce race conditions.

Solution:

  • First, you can verify this by forcing the cache to use local tmpfs (by setting the environment variable TORCHINDUCTOR_CACHE_DIR to /dev/shm/torch_cache).

Alternative solutions:

  1. Follow the official PyTorch documentation to use Redis as a shared cache (this requires setting up a Redis service).

  2. Review the CPFS NFS mount configuration. Using the noac mount option may help, but it reduces performance.

  3. Disable caching entirely by setting TORCHINDUCTOR_CACHE_DIR="". However, this sacrifices compilation performance.