For an instance that runs Ubuntu and belongs to the GPU-accelerated compute-optimized instance family ebmgn7 or ebmgn7e, the apt-daily service may automatically update nvidia-fabricmanager if you install nvidia-fabricmanager by using an installation package. This results in version inconsistency between nvidia-fabricmanager and the Tesla driver. As a result, nvidia-fabricmanager fails to start and the GPU fails to work as expected. This topic provides the solution to this issue.
Problem description
After you install nvidia-fabricmanager by using an installation package, the following error message appears when you view the service status. In this case, the GPU fails to work as expected.
Cause
If you install nvidia-fabricmanager by using an installation package on a GPU-accelerated compute-optimized instance that runs Ubuntu, the apt-daily service automatically updates nvidia-fabricmanager. This results in version inconsistency between nvidia-fabricmanager and the Tesla driver. As a result, nvidia-fabricmanager fails to start and the GPU fails to work as expected.
Solution
The GPU can work as expected only if the nvidia-fabricmanager version is consistent with the Tesla driver version. To prevent or resolve GPU unavailability caused by version inconsistency between nvidia-fabricmanager and the Tesla driver, perform the following steps:
Check the nvidia-fabricmanager version and the Tesla driver version.
Run the following command to check the nvidia-fabricmanager version:
sudo dpkg --list |grep nvidia-fabricmanager
In this example, the nvidia-fabricmanager version is
550.90.07
.nvidia-fabricmanager-550
is the name of the installation package.Run the following command to check the Tesla driver version:
nvidia-smi
In this example, the Tesla driver version is
550.90.07
.
Check whether the current nvidia-fabricmanager version is consistent with the Tesla driver version.
If the two versions are consistent, proceed to the next step.
If the two versions are inconsistent, perform one of the following operations:
Upgrade the Tesla driver to ensure that the Tesla driver version is consistent with the nvidia-fabricmanager version. For more information, see Upgrade an NVIDIA Tesla driver.
Uninstall and reinstall nvidia-fabricmanager. Then, proceed to the next step.
NoteFor information about how to uninstall nvidia-fabricmanager, see Step 1: Uninstall nvidia-fabricmanager.
Run the following command to prevent nvidia-fabricmanager from being automatically updated:
In this example, the installation package
nvidia-fabricmanager-550
is used. Replace the installation package name in the command with the actual nvidia-fabricmanager package name.sudo apt-mark hold nvidia-fabricmanager-550
If the following result is displayed, nvidia-fabricmanager is prohibited from being updated.
Run the following command to verify that updates to
nvidia-fabricmanager
are prohibited:sudo apt-mark showhold
If the
cloud-init
andnvidia-fabricmanager-550
information is displayed, updates to nvidia-fabricmanager are prohibited.