Friday, February 5, 2010

Ubuntu fails to load nvidia kernel module

As much a note to myself, as a warning to others…

After a kernel upgrade, my Ubuntu Karmic started the X server in low-resolution mode. My Xorg.0.log said:

(II) LoadModule: "nvidia"
(II) Loading /usr/lib/xorg/modules/drivers//nvidia_drv.so
(II) Module nvidia: vendor="NVIDIA Corporation"
        compiled for 4.0.2, module version = 1.0.0
        Module class: X.Org Video Driver
(EE) NVIDIA: Failed to load the NVIDIA kernel module. Please check your
(EE) NVIDIA:     system's kernel log for additional error messages.
(II) UnloadModule: "nvidia"
(II) Unloading /usr/lib/xorg/modules/drivers//nvidia_drv.so
(EE) Failed to load module "nvidia" (module-specific error, 0)
(EE) No drivers available.

I tried modprobe nvidia and such, but it seemed that the module actually did not exist. This module should be installed by the package nvidia-185-kernel-source, which was present on my system. However, it turns out that the kernel module is compiled on-the-fly by a program called jockey which controls DKMS, the Dynamic Kernel Module Support.

It is possible to force a recompile using dpkg-reconfigure:

$ sudo dpkg-reconfigure nvidia-185-kernel-source
Removing all DKMS Modules
Done.
Loading new nvidia-185.18.36 DKMS files...
Building for architecture x86_64
Module build for the currently running kernel was skipped since the
kernel source for this kernel does not seem to be installed.

I need the kernel source, eh? Why the hell is that not a dependency, if the driver package is useless without it? Anyway, let's install the kernel source then:

$ uname -r
2.6.31-19-generic
$ sudo apt-get install linux-source-2.6.31

Installs fine, but makes no difference. Turns out that dpkg-reconfigure was lying: I just need the headers. Here we go:

$ sudo apt-get install linux-headers-2.6.31-19-generic
...
$ sudo dpkg-reconfigure nvidia-185-kernel-source
Removing all DKMS Modules
Done.
Loading new nvidia-185.18.36 DKMS files...
Building for architecture x86_64
Building initial module for 2.6.31-19-generic
Done.

nvidia.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/2.6.31-19-generic/updates/dkms/

depmod......

DKMS: install Completed.
$ modprobe nvidia
$

That's better.

Several bug reports indicate similar problems, but the current way this is handled is terribly inadequate. The driver package should pull in the kernel headers if it needs them. There was no warning about this when the kernel was upgraded. There was no warning when the module failed to compile on boot. A fix for a problem with the same symptoms was released back in December; another one is in the upcoming Lucid release.

Oh yeah, I ended up rebooting my system. Whatever happened to Ctrl+Alt+Backspace? (Answer.)

Update, 2010-08-04: After another kernel upgrade, my display driver was hosed again. After hours of tinkering, I typed sudo dpkg-reconfigure nvidia-current and was greeted with the message gzip: stdout: No space left on device. Apparently, my /boot partition was full (of abandoned kernels). Something to check, for whoever runs into similar problems! Also, the kernel module appears to have been renamed from nvidia to nvidia-current.