FAQ about GPU-accelerated instances

0.0.201

This topic provides answers to some commonly asked questions about GPU-accelerated instances.

What are the driver and CUDA versions of GPU-accelerated instances in Function Compute?
What do I do if "CUFFT_INTERNAL_ERROR" is reported during function execution?
What do I do if a CUDA GPG error occurs when I build an image?
Why is the type of my GPU-accelerated instance g1?
Why do my provisioned GPU-accelerated instances fail to be allocated?
What is the limit on the size of a GPU image?
What do I do if a GPU image fails to be converted to an accelerated image?
Should a model be integrated into or separated from an image?
How do I perform a model warm-up?
What do I do if "[FunctionNotStarted] Function Instance health check failed on port xxx in 120 seconds" is reported when I start a GPU image?
What do I do if the end-to-end latency of my function is high and fluctuates greatly?
What do I do if the system fails to find the NVIDIA driver?
What do I do if "On-demand invocation of current GPU type is disabled..." is reported on Ada-series GPU-accelerated instances?
What are the usage notes for idle GPU-accelerated instances?

What are the driver and CUDA versions of GPU-accelerated instances in Function Compute?

The following items list the versions of the main components of GPU-accelerated instances:

Driver versions: Drivers include kernel-mode drivers (KMDs) such as nvidia.ko and CUDA user-mode drivers (UMDs) such as libcuda.so. NVIDIA provides the drivers that are used by GPU-accelerated instances in Function Compute. The driver versions used by GPU-accelerated instances may change as a result of feature iteration, new GPU releases, bug fixes, and driver lifecycle expiration. We recommend that you do not add driver-related components to your image. For more information, see Image usage notes.
CUDA Toolkit versions: CUDA Toolkit includes various components, such as CUDA Runtime, cuDNN, and cuFFT. The CUDA Toolkit version is determined by the container image you use.

The GPU drivers and CUDA Toolkit, both released by NVIDIA, are related to each other. For more information, see NVIDIA CUDA Toolkit Release Notes.

The current driver version of GPU-accelerated instances in Function Compute is 550.54.15, and the version of the corresponding CUDA UMD is 12.4. For optimal compatibility, we recommend that you use CUDA Toolkit version 11.8 or later, but not exceeding the version of the CUDA UMD.

What do I do if "CUFFT_INTERNAL_ERROR" is reported during function execution?

The cuFFT library in CUDA 11.7 has forward compatibility issues. If you encounter this error with newer GPU models, we recommend that you upgrade to at least CUDA 11.8. For more information about GPU models, see Instance Specifications.

Take PyTorch as an example. After the upgrade, you can use the following code snippet for verification. If no errors are reported, the upgrade is successful.

import torch
out = torch.fft.rfft(torch.randn(1000).cuda())

What do I do if a CUDA GPG error occurs when I build an image?

The following GPG error is reported during the image building process:

W: GPG error: https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease' is not signed.

In this case, you can append the following script to the RUN rm command line of the Dockerfile file and rebuild the image.

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC

Why is the type of my GPU-accelerated instance g1?

The g1 instance type is equivalent to the fc.gpu.tesla.1 instance type. For more information, see Instance Specifications.

Why do my provisioned GPU-accelerated instances fail to be allocated?

The allocation of provisioned instances may fail due to the following reasons:

The startup of the provisioned instances times out.
- Error code: FunctionNotStarted.
- Error message: Function instance health check failed on port XXX in 120 seconds.
- Solution: Check the application startup logic to see if it includes the logic for downloading models from the Internet and loading large models (over 10 GB). We recommend that you start the web server before you run the model loading logic.
The maximum number of instances for the function or region is reached.
- Error code: ResourceThrottled.
- Error message: Reserve resource exceeded limit.
- Solution: By default, an Alibaba Cloud account is limited to 30 physical GPUs allocated per region. You can view the actual quota in the Quota Center console. If you require more physical GPUs, you can apply for a quota adjustment in the Quota Center console.

What is the limit on the size of a GPU image?

The image size limit applies only to compressed images. You can check the size of a compressed image in the Container Registry console. You can also run the docker images command to query the size of an image before compression.

In most cases, an uncompressed image smaller than 20 GB can be deployed to Function Compute and will function as expected.

What do I do if a GPU image fails to be converted to an accelerated image?

The time required to convert an image increases as the size of your image grows. This may lead to a conversion failure. You can re-trigger the conversion of the GPU image by editing and re-saving the function configurations in the Function Compute console. While editing, you do not need to actually modify the parameters if you want to retain the existing settings.

Should a model be integrated into or separated from an image?

If your model files are large, undergo frequent iterations, or would exceed the image size limit when published together with the image, we recommend that you separate the model from the image. In such cases, you can store the model in a File Storage NAS (NAS) file system or an Object Storage Service (OSS) file system.

How do I perform a model warm-up?

We recommend that you warm up your model by using the /initialize method. Production traffic is directed to the model only after the warm-up based on the /initialize method is complete. You can refer to the following topics to learn more about model warm-up:

What do I do if "[FunctionNotStarted] Function Instance health check failed on port xxx in 120 seconds" is reported when I start a GPU image?

Cause: The AI/GPU application takes too long to start. As a result, the health check of Function Compute fails. In most cases, starting AI/GPU applications is time-consuming due to lengthy model loading times, which can cause the web server startup to time out.
Solutions:
- Avoid dynamically loading the model over the Internet during application startup. We recommend that you place the model in an image or a NAS file system and load it from the nearest path.
- Place model initialization in the /initialize method and prioritize completing the application startup. In other words, load the model after the web server has started.
  Note
  For more information about the lifecycle of a function instance, see Configure instance lifecycles.

What do I do if the end-to-end latency of my function is high and fluctuates greatly?

Make sure that the state of image acceleration is Available in the environment information.
Check the type of the NAS file system. If your function needs to read data, such as a model, from a NAS file system, we recommend that you use a Performance NAS file system instead of a Capacity one to ensure optimal performance. For more information, see General-purpose NAS file systems.

What do I do if the system fails to find the NVIDIA driver?

This issue arises when you use the docker run --gpus all command to specify a container and then build an application image using the docker commit method. The built image contains local NVIDIA driver information, which prevents the driver from being properly mounted after the image is deployed to Function Compute. As a result, the system cannot find the NVIDIA driver.

To solve the issue, we recommend that you use Dockerfile to build an application image. For more information, see Dockerfile.

In addition, do not add driver-related components to your image. For more information, see Image usage notes.

What do I do if "On-demand invocation of current GPU type is disabled..." is reported on Ada-series GPU-accelerated instances?

The error ResourceExhausted:On-demand invocation of current GPU type is disabled, please provision instances instead usually occurs when the number of incoming requests exceeds the maximum capacity of the current provisioned instances. Since Ada-series GPU-accelerated instances operate only in provisioned mode, we recommend that you increase the number of provisioned instances based on actual business traffic.

What are the usage notes for idle GPU-accelerated instances?

CUDA version
We recommend that you use CUDA 12.2 or an earlier version.
Image permissions
We recommend that you run container images as the default root user.
Instance logon
You cannot log on to an idle GPU-accelerated instance because the GPUs are frozen.
Graceful instance rotation
Function Compute rotates idle GPU-accelerated instances based on the workload. To ensure service quality, we recommend that you add lifecycle hooks to function instances for model warm-up and pre-inference. This way, your inference service can be provided immediately after the launch of a new instance. For more information, see Model service warm-up.
Model warm-up and pre-inference
To reduce the latency of the initial wake-up of an idle GPU-accelerated instance, we recommend that you use the initialize hook in your code to warm up or preload your model. For more information, see Model warm-up.
Provisioned instance configurations
When you turn on the Idle Mode switch, the existing provisioned GPU-accelerated instances for the function are gracefully shut down. Provisioned instances are reallocated after they are released for a short period of time.
Built-in Metrics Server of inference frameworks
To improve the compatibility and performance of idle GPUs, we recommend that you disable the built-in Metrics Server of your inference frameworks, such as NVIDIA Triton Inference Server and TorchServe.