NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
安装完 GPU 的驱动并重启系统后,执行 nvidia-smi 报错:
1 |
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. |
执行 nvidia-settings,报错:
1 2 |
ERROR: NVIDIA driver is not loaded ERROR: Unable to load info from any available system |
通过 lspci 可以看到系统识别了显卡:
1 2 |
[root@localhost ~]# lspci | grep -i nvidia d8:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1) |
执行 dkms status 发现已经安装驱动:
1 2 |
dkms status nvidia, 430.50: added |
解决办法:
有人说内核版本冲突,也有执行nvcc -V命令的。可是我安装的包是匹配的内核版本,执行nvcc提示找不到命令。通过 nvidia-smi 报错提示发现需要我需要确认驱动是否已经安装并运行,因为之前自己确实做了一些列操作,驱动安装了肯定没问题,只是不确定是否安装正确,第二个是如果安装了驱动有没有加载到内核运行。通过dkms发现驱动模块已经添加。
1)查询当前使用的 nvidia 驱动 版本号
1 2 |
# [root@localhost ~]# ls /usr/src | grep nvidia nvidia-440.95.01 |
2)重新安装驱动(根据个人实际的版本号填写,上一步的输出)
1 |
dkms install -m nvidia -v 440.95.01 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
[root@localhost dev]# dkms install -m nvidia -v 440.95.01 Kernel preparation unnecessary for this kernel. Skipping... Building module: cleaning build area... 'make' -j2 module SYSSRC=/lib/modules/3.10.0-957.el7.x86_64/build IGNORE_XEN_PRESENCE=1 IGNORE_PREEMPT_RT_PRESENCE=1 IGNORE_CC_MISMATCH=1................... cleaning build area... DKMS: build completed. nvidia.ko.xz: Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/3.10.0-957.el7.x86_64/extra/ nvidia-modeset.ko.xz: Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/3.10.0-957.el7.x86_64/extra/ nvidia-drm.ko.xz: Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/3.10.0-957.el7.x86_64/extra/ nvidia-uvm.ko.xz: Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/3.10.0-957.el7.x86_64/extra/ Adding any weak-modules depmod.... DKMS: install completed. [root@localhost dev]# nvidia-smi Wed Jun 2 05:53:08 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.95.01 Driver Version: 440.95.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 | | N/A 25C P0 36W / 250W | 0MiB / 16160MiB | 3% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ |
分类: linux
(base) user@ubuntu:/usr/src$ sudo dkms install -m nvidia -v 460.84
Module nvidia/460.84 already installed on kernel 5.4.0-77-generic/x86_64
(base) user@ubuntu:/usr/src$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
顺着下来,一切正常,但是这里显示已经安装了。但是还是不能用