This topic provides answers to some commonly asked questions about GPU-accelerated instances.
What are the driver and CUDA versions of GPU-accelerated instances in Function Compute?
The following items list versions of main components of GPU-accelerated instances
Driver versions: Drivers include kernel-mode drivers (KMD) such as nvidia.ko
and CUDA user-mode drivers (UMD) such as libcuda.so
. Drivers that are used by GPU-accelerated instances of Function Compute are provided by NVIDA and deployed by Function Compute. The driver versions used by GPU-accelerated instances may change as a result of iterations, releases of new card models, bug fixes, and driver lifecycle expiration. We recommend that you do not specify a specific driver version in container images. For more information, see Image usage notes.
CUDA Toolkit versions: CUDA Toolkit includes various components, such as CUDA Runtime, cuDNN, and cuFFT. The CUDA Toolkit version is determined by the container image you use.
GPU drivers and CUDA Toolkit are released by NVIDIA and related to each other. For more information, see NVIDIA CUDA Toolkit Release Notes.
The current driver version of GPU-accelerated instances in Function Compute is 550.54.15, and the version of the CUDA module is 12.4. For best compatibility, we recommend that you use CUDA Toolkit 11.8 or later and do not use a CUDA UMD whose version is later than that of the platform.
What do I do if CUFFT_INTERNAL_ERROR is reported during a function execution?
The cuFFT library in CUDA 11.7 has forward compatibility issues. If you encounter the above error in card types newer than Ampere, we recommend that you upgrade CUDA to 11.8 or later. For more information about GPU card types, see Instance types and usage modes.
Take PyTorch as an example. After the upgrade, you can use the following code snippet for verification. If no error is reported, the upgrade is valid.
import torch
out = torch.fft.rfft(torch.randn(1000).cuda())
What do I do if a CUDA GPG error is reported when I build an image?
The following GPG error is reported during the image building process:
W: GPG error: https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease' is not signed.
In this case, you can append the following script to the RUN rm
command line of the Dockerfile file and rebuild the image.
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
Why is the instance type of my GPU-accelerated instance g1?
The g1 instance type is the same as fc.gpu.tesla.1. For more information, see Instance specifications.
Why do my provisioned GPU-accelerated instances fail to be allocated?
The allocation of provisioned instances may fail due to the following reasons:
What is the limit on the size of a GPU image?
The image size limit applies only to compressed images. You can view the size of a compressed image in the Container Registry console. You can run the docker images
command to query the size of an uncompressed image.
In most cases, an uncompressed image that is smaller than 20 GB in size can be deployed to Function Compute and used as expected.
What do I do if a GPU image fails to be converted to an accelerated image?
The time required to convert an image increases as the size of your image grows. A timeout error may cause a conversion failure. You can re-trigger the acceleration conversion of the GPU image by configuring and saving the function configurations in the Function Compute console. You do not need to modify the parameters if you want to retain existing settings.
Should a model be integrated into or separated from an image?
If your model files are large, frequently iterated, or exceed the size limit on images when they are published with an image, we recommend that you separate the model from the image. If the model is small in size, for example, 100 MB, and not frequently changed, you can distribute the model file with the image. For more information about the limits on the size of a platform image, see What is the limit on the size of a GPU image?
To deploy a model separately from an image, you can store the model in a File Storage NAS (NAS) file system or an Object Storage Service (OSS) file system. The model is loaded from the mount target when the application starts. For more information, see Configure a NAS file system and Configure an OSS file system.
We recommend that you store models in a Performance NAS file system, which is compatible with Portable Operating System Interface (POSIX) and has a high initial read bandwidth that helps reduce the time required to load your model. For more information, see General-purpose NAS file systems.
You can also store models in OSS buckets. This way, you can use OSS accelerators to achieve lower latency and higher throughput. If you mount an OSS file system, workloads are processed in UMD and instance memory and temporary storage space are consumed. Therefore, we recommend that you use OSS on GPU-accelerated instances with large specifications.
How do I perform a model warm-up?
We recommend that you warm up your model by using the /initialize
method. Production traffic is directed to the model only after the warm-up based on the /initialize
method is complete. You can refer to the following topics to learn more:
What do I do if the "[FunctionNotStarted] Function Instance health check failed on port xxx in 120 seconds" error is reported when I start a GPU image?
Cause: The startups of AI/GPU applications are time-consuming. As a result, the health check of the applications in the Function Compute console fails. In most cases, the startups of the AI/GPU applications are time-consuming because it takes long time to load models. This causes the startup of the web server to time out.
Solution:
Do not dynamically load the model over the Internet when the application starts. We recommend that you place the model in an image or in a NAS file system and load the model from the nearest path.
Place model initialization in the /initialize
method to preferentially start the application. Specifically, load the mode after the web server is started.
What do I do if the end-to-end latency of my function is large and fluctuates greatly?
Make sure that the state of image acceleration is Available in the environment information.
Check the type of the NAS file system. If your function needs to read data, such as a model, from a NAS file system, we recommend that you use a Performance NAS file system, instead of the Capacity type, to ensure the performance. For more information, see General-purpose NAS file systems.
What do I do if the system fails to find the NVIDIA driver?
This issue occurs when you run the docker run --gpus all
command to specify a container and use the docker commit
method to build an application image. As a result, the driver cannot be mounted and the NVIDIA driver cannot be found after the image is deployed to Function Compute. The system cannot find the NVIDIA driver.
To resolve the issue, we recommend that you use Dockerfile to build an application image. For more information, see Dockerfile.
Do not specify a specific driver version in a container image. For more information, see Image usage notes.
What do I do if "On-demand invocation of current GPU type is disabled..." is reported on Ada-series GPU-accelerated instances?
Generally, ResourceExhausted:On-demand invocation of current GPU type is disabled, please provision instances instead is reported because the actual number of requests exceeds the maximum number of requests of provisioned instances. GPU-accelerated instances of the Ada card type support only the provisioned mode. We recommend that you increase the number of provisioned instances based on the actual number of requests.
What are the usage notes for idle GPU instances?
CUDA version
We recommend that you use CUDA 12.2 or an earlier version.
Image permissions
We recommend that you run container images as the default root user.
Instance logon
You cannot log on to an idle GPU-accelerated instance because the GPU cards are frozen.
Model warmup and pre-inference
To ensure that the latency of the initial wake-up of an idle GPU-accelerated instance meets your businesses requirements, we recommend that you use the initialize
hook in your business code to warm up or preload your model. For more information, see Model warm up.