All Products
Search
Document Center

Function Compute:FAQ about GPU-accelerated instances

Last Updated:Jun 26, 2024

This topic provides answers to some commonly asked questions about GPU-accelerated instances.

What is the version of the driver used by GPU-accelerated instances of Function Compute?

The current driver version is 535.161.08.

NVIDIA provides the drivers that are used by GPU-accelerated instances of Function Compute. The driver version used by GPU-accelerated instances may change in the future as a result of feature iterations, releases of new card models, bug fixes, and driver lifecycle expiration. We recommend that you do not specify a specific driver version in container images. For more information, see Image usage notes.

What is the CUDA version of GPU-accelerated instances of Function Compute?

The CUDA version varies based on the container image that you use. We recommend that you use CUDA 11.x or later in Function Compute.

What do I do if a CUDA GPG error is reported when I build an image?

The following GPG error is reported during the image building process:

W: GPG error: https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease' is not signed.

You can append the following script to the RUN rm command line of the Dockerfile and rebuild the image:

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC

Why is the instance type of my GPU-accelerated instance g1?

The g1 instance type is the same as fc.gpu.tesla.1. For more information, see the "Instance types" section of the Instance types and usage modes topic.

Why do my provisioned GPU-accelerated instances fail to be allocated?

The allocation of provisioned instances may fail due to the following reasons:

  • The startup of the provisioned instances times out.

    • Error code: FunctionNotStarted.

    • Error message: Function instance health check failed on port XXX in 120 seconds.

    • Solution: View the application startup logic to check whether the logic for downloading models from the Internet and loading large models (larger than 10 GB) exists. We recommend that you start the web server before you run the model loading logic.

  • The maximum number of instances at the function level or region level is reached.

    • Error code: ResourceThrottled.

    • Error message: Reserve resource exceeded limit.

    • Solution: If you have higher requirements on physical GPUs, join the DingTalk group (11721331) for technical support.

What is the limit on the size of a GPU image?

The image size limit applies only to compressed images. You can view the size of a compressed image in the Container Registry console. You can run the docker image command to query the size of an uncompressed image.

In most cases, an uncompressed image that is smaller than 20 GB in size can be deployed to Function Compute and used as expected.

What do I do if a GPU image fails to be converted to an accelerated image?

The time required to convert an image increases as the size of your image grows. This may cause a conversion failure. You can re-trigger the acceleration conversion of the GPU image by configuring and saving the function configurations in the Function Compute console. You do not need to modify the parameters if you want to retain existing settings.

Should a model be integrated into or separated from an image?

We recommend that you integrate a model into an image. This way, the model can reuse image cache to accelerate distribution without generating additional storage costs.

If the model cannot be integrated into the image due to reasons such as an oversized model (larger than 5 GB), we recommend that you store the model in Apsara File Storage NAS (NAS) and load the model when you start applications. We recommend that you use a General-purpose NAS file system of the Performance type, instead of the Capacity type. For more information, see General-purpose NAS file systems.

How do I perform a model warm-up?

We recommend that you warm up your model by using the /initialize method. Production traffic is directed to the model only after the warm-up by using the /initialize method is complete. You can refer to the following topics to learn more about model warm-up:

What do I do if the [FunctionNotStarted] Function Instance health check failed on port xxx in 120 seconds error is reported when I start a GPU image?

  • Cause: The startups of AI/GPU applications are time-consuming. As a result, the health check of the applications in the Function Compute console fails. In most cases, the startups of the AI/GPU applications are time-consuming because it takes long time to load models. This causes the startup of the web server to time out.

  • Solution

    • Do not dynamically load the model from the Internet when the application starts. We recommend that you place the model in an image or in a NAS file system and load the model from the nearest path.

    • Place model initialization in the /initialize method to preferentially start the application. That is, load the model after the web server is started.

      Note

      For more information about the lifecycle of a function instance, see Function instance lifecycle.

What do I do if the end-to-end latency of my function is large and fluctuates greatly?

  1. Make sure that the state of image acceleration is Available in the environment information.

  2. Check the type of the NAS file system. If your function needs to read data, such as a model, from a NAS file system, we recommend that you use a General-purpose NAS file system of the Performance type, instead of the Capacity type, to ensure the performance. For more information, see General-purpose NAS file systems.

What do I do if the system fails to find the NVIDIA driver?

This issue occurs when you run the docker run --gpus all command to specify a container and use the docker commit method to build an application image. As a result, the driver cannot be mounted and the NVIDIA driver cannot be found after the image is deployed to Function Compute.

To resolve the issue, we recommend that you use Dockerfile to build an application image. For more information, see Dockerfile.

Do not specify a specific driver version in a container image. For more information, see Image usage notes.