[Nouveau] Struggle with GPU lockups and console deadlock using kernel-space modifications

Mon Oct 5 01:30:13 PDT 2015

Hello.

I have a poorly functioning GeForce 8600 GTS (rev a1) video card,
that causes many problems for the box where it’s installed,
primarily GPU lockups (sometimes unprovoked),
several instances in a day.
Without intervention, a GPU lockup is a condition
where the system console is no longer usable
(even the keyboard, because
switching from Xorg to TUI becomes obstructed; see below).

Upgrading Linux from 3.16 to 4.3 and improved cooling
reduced some minor problems (such as snow),
but didn’t prevent lockups.
Therefore I present some experiences and reflections
on abating GPU lockups.

First, I hoped to use the bus hardware reset control, found at
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/reset
in my system.
The #nouveau @ freenode channel suggested some insights afterwards.
But maneuvers with unloading/loading the nouveau module
after a lockup were not succeed for several reasons.
I envisaged the following sequence that could recover a computer
from a GPU lockup, with the system console usable again
and applications not disturbed too much.

1. Suspend all applications using the video card.
2. If necessary, perform hardware reset on the bus.
3. If necessary, run initialization functions for the video card.
4. Restore a usable video mode (can be achieved
  by switching virtual consoles Xorg ⇔ TUI, for example).
5. Resume the work.

Implementation became a challenge.
First, loading nouveau with config=NvForcePost=1
doesn’t result in a usable console, either during Linux startup or else.
It doesn’t produce a signal at all with my hardware.
I tested it with at least three different versions of nouveau
at both Linux 3.16 and Linux 4.3.

There are major problems with making the step 4 from the kernel mode.
Linux kernel has the «set_console(nr)» function (from vt.c),
but doesn’t export it.
Moreover, in modern kernels (apparently since Linux 3)
even this internal kernel function performs checks
for «vt_dont_switch», hence a deadlock can ensue.
Neither are exported other functions, even such high-level ones
as «suspend_console()» and «resume_console()».
I even was unable to reconnaissance their true addresses
in the memory using /proc/kallsyms — there are only zeros
(that wasn’t the case for Linux 2.6).

Some partial experiences, made using remote shell access,
are described below.

With any module and no lockup:
stop Xorg;
echo 0 >/sys/class/vtconsole/vtcon1/bind;
rmmod nouveau;
modprobe nouveau; — success.

With standard nouveau module on Linux 4.3:
 • stop Xorg; reset device; modprobe nouveau; —
   the module won’t initialize.
 • stop Xorg; reset device; reload module (as above) —
   won’t work, symptoms differ from case to case.

With modified nouveau modules, after a lockup:
 • «nvkm_device_init(⧦);» (at ⧦->devinit->post = false) —
   no effect.
 • reset device while Xorg runs —
   system crash or deadlock, nothing in logs.
 • «⧦->devinit->post = true; nvkm_device_init(⧦);» without reset —
   to be tested.
(«⧦» points to the card’s «struct nvkm_device» object.)

My modified nouveau module was derived
from git://people.freedesktop.org/~darktama/nouveau

Proposals and suggestions?
Please, think generally and not focus too much on my particular case.
My video card (and, possibly, some related stuff on the motherboard)
almost certainly functions improperly.

Regards, Incnis Mrsi