All Products
Search
Document Center

Elastic GPU Service:What do I do if the GPU fails to work because the nvidia-fabricmanager version is inconsistent with the Tesla driver version?

Last Updated:Sep 30, 2024

For an instance that runs Ubuntu and belongs to the GPU-accelerated compute-optimized instance family ebmgn7 or ebmgn7e, the apt-daily service may automatically update nvidia-fabricmanager if you install nvidia-fabricmanager by using an installation package. This results in version inconsistency between nvidia-fabricmanager and the Tesla driver. As a result, nvidia-fabricmanager fails to start and the GPU fails to work as expected. This topic provides the solution to this issue.

Problem description

After you install nvidia-fabricmanager by using an installation package, the following error message appears when you view the service status. In this case, the GPU fails to work as expected.

报错.jpg

Cause

If you install nvidia-fabricmanager by using an installation package on a GPU-accelerated compute-optimized instance that runs Ubuntu, the apt-daily service automatically updates nvidia-fabricmanager. This results in version inconsistency between nvidia-fabricmanager and the Tesla driver. As a result, nvidia-fabricmanager fails to start and the GPU fails to work as expected.

Solution

The GPU can work as expected only if the nvidia-fabricmanager version is consistent with the Tesla driver version. To prevent or resolve GPU unavailability caused by version inconsistency between nvidia-fabricmanager and the Tesla driver, perform the following steps:

  1. Check the nvidia-fabricmanager version and the Tesla driver version.

    • Run the following command to check the nvidia-fabricmanager version:

      sudo dpkg --list |grep nvidia-fabricmanager

      In this example, the nvidia-fabricmanager version is 550.90.07. nvidia-fabricmanager-550 is the name of the installation package.

      fabricmanager.jpg

    • Run the following command to check the Tesla driver version:

      nvidia-smi

      In this example, the Tesla driver version is 550.90.07.

      驱动版本-550.jpg

  2. Check whether the current nvidia-fabricmanager version is consistent with the Tesla driver version.

    • If the two versions are consistent, proceed to the next step.

    • If the two versions are inconsistent, perform one of the following operations:

      • Upgrade the Tesla driver to ensure that the Tesla driver version is consistent with the nvidia-fabricmanager version. For more information, see Upgrade an NVIDIA Tesla driver.

      • Uninstall and reinstall nvidia-fabricmanager. Then, proceed to the next step.

        Note

        For information about how to uninstall nvidia-fabricmanager, see Step 1: Uninstall nvidia-fabricmanager.

  3. Run the following command to prevent nvidia-fabricmanager from being automatically updated:

    In this example, the installation package nvidia-fabricmanager-550 is used. Replace the installation package name in the command with the actual nvidia-fabricmanager package name.

    sudo apt-mark hold nvidia-fabricmanager-550 

    If the following result is displayed, nvidia-fabricmanager is prohibited from being updated.

    禁止自动升级.jpg

  4. Run the following command to verify that updates to nvidia-fabricmanager are prohibited:

    sudo apt-mark showhold

    If the cloud-init and nvidia-fabricmanager-550 information is displayed, updates to nvidia-fabricmanager are prohibited.

    showhold.jpg