[Nouveau] nouveau locking up on Debian Jessie.

Ilia Mirkin imirkin at alum.mit.edu
Sat Mar 21 07:35:18 PDT 2015


On Fri, Mar 20, 2015 at 6:02 PM, Megaf <mmegaf at gmail.com> wrote:
> Hi Ladies and Gentlemans.
>
> I'd like to report a possible bug and ask for help in solving it.
>
> I have a Mid 2010 Macbook Pro running Debian Jessie.
> The bug seems to happen at random, but mainly when using a 3D application
> with other apps running, such as Chromium (Chrome) and Iceweasel (Firefox).
> It happens when watching videos with VLC and mplayer2.
>
> I hope someone will manage to help.
> Thanks.
>
>
>
> [*] Here is some information on the GPU itself according to kernel messages.
> ======================================================
>
> Mar 17 23:31:31 MacSam kernel: [    4.116055] nouveau  [
> PTHERM][0000:04:00.0] FAN control: none / external
> Mar 17 23:31:31 MacSam kernel: [    4.116064] nouveau  [
> PTHERM][0000:04:00.0] fan management: automatic
> Mar 17 23:31:31 MacSam kernel: [    4.116068] nouveau  [
> PTHERM][0000:04:00.0] internal sensor: yes
> Mar 17 23:31:31 MacSam kernel: [    4.116092] nouveau  [
> CLK][0000:04:00.0] 03: core 405 MHz shader 405 MHz memory 405 MHz
> Mar 17 23:31:31 MacSam kernel: [    4.116098] nouveau  [
> CLK][0000:04:00.0] 07: core 450 MHz shader 810 MHz memory 450 MHz
> Mar 17 23:31:31 MacSam kernel: [    4.116101] nouveau  [
> CLK][0000:04:00.0] 0e: core 450 MHz shader 810 MHz memory 450 MHz
> Mar 17 23:31:31 MacSam kernel: [    4.116104] nouveau  [
> CLK][0000:04:00.0] 0f: core 450 MHz shader 950 MHz memory 450 MHz
> Mar 17 23:31:31 MacSam kernel: [    4.116119] nouveau  [
> CLK][0000:04:00.0] --: core 405 MHz shader 810 MHz
> Mar 17 23:31:31 MacSam kernel: [    4.116168] nouveau W[
> PCE0][0000:04:00.0] disabled, PCE0=1 to enable
> Mar 17 23:31:31 MacSam kernel: [    4.116301] [TTM] Zone  kernel: Available
> graphics memory: 1897648 kiB
> Mar 17 23:31:31 MacSam kernel: [    4.116303] [TTM] Initializing pool
> allocator
> Mar 17 23:31:31 MacSam kernel: [    4.116309] [TTM] Initializing DMA pool
> allocator
> Mar 17 23:31:31 MacSam kernel: [    4.116321] nouveau  [     DRM] VRAM: 256
> MiB
> Mar 17 23:31:31 MacSam kernel: [    4.116323] nouveau  [     DRM] GART:
> 1048576 MiB
> Mar 17 23:31:31 MacSam kernel: [    4.116327] nouveau  [     DRM] TMDS table
> version 2.0
> Mar 17 23:31:31 MacSam kernel: [    4.116329] nouveau  [     DRM] DCB
> version 4.0
> Mar 17 23:31:31 MacSam kernel: [    4.116332] nouveau  [     DRM] DCB outp
> 00: 01800113 00010030
> Mar 17 23:31:31 MacSam kernel: [    4.116334] nouveau  [     DRM] DCB outp
> 01: 020112a6 0f220010
> Mar 17 23:31:31 MacSam kernel: [    4.116336] nouveau  [     DRM] DCB outp
> 02: 02011262 00020010
> Mar 17 23:31:31 MacSam kernel: [    4.116337] nouveau  [     DRM] DCB conn
> 00: 00000040
> Mar 17 23:31:31 MacSam kernel: [    4.116339] nouveau  [     DRM] DCB conn
> 01: 00101146
> Mar 17 23:31:31 MacSam kernel: [    4.124954] [drm] Supports vblank
> timestamp caching Rev 2 (21.10.2013).
> Mar 17 23:31:31 MacSam kernel: [    4.124956] [drm] Driver supports precise
> vblank timestamp query.
> Mar 17 23:31:31 MacSam kernel: [    4.136978] nouveau  [     DRM] MM: using
> M2MF for buffer copies
> Mar 17 23:31:31 MacSam kernel: [    4.151394] input: HDA NVidia Headphone as
> /devices/pci0000:00/0000:00:08.0/sound/card0/input10
> Mar 17 23:31:31 MacSam kernel: [    4.151527] input: HDA NVidia
> HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:08.0/sound/card0/input11
> Mar 17 23:31:31 MacSam kernel: [    4.151654] input: HDA NVidia
> HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:08.0/sound/card0/input12
> Mar 17 23:31:31 MacSam kernel: [    4.151780] input: HDA NVidia
> HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:08.0/sound/card0/input13
> Mar 17 23:31:31 MacSam kernel: [    4.252701] nouveau  [     DRM] allocated
> 1280x800 fb: 0x50000, bo ffff8800aa383800
> Mar 17 23:31:31 MacSam kernel: [    4.252825] fbcon: nouveaufb (fb0) is
> primary device
> Mar 17 23:31:31 MacSam kernel: [    4.326232] Console: switching to colour
> frame buffer device 160x50
> Mar 17 23:31:31 MacSam kernel: [    4.328317] nouveau 0000:04:00.0: fb0:
> nouveaufb frame buffer device
> Mar 17 23:31:31 MacSam kernel: [    4.328319] nouveau 0000:04:00.0:
> registered panic notifier
> Mar 17 23:31:31 MacSam kernel: [    4.336093] usb 4-5: new low-speed USB
> device number 3 using ohci-pci
> Mar 17 23:31:31 MacSam kernel: [    4.340098] [drm] Initialized nouveau
> 1.1.2 20120801 for 0000:04:00.0 on minor 0
>
> It's a nvidia 320M by the way.
>
> [*] And here is the bug itself.
> ===================
>
> Mar 20 21:18:42 MacSam kernel: [41316.923835] nouveau E[
> PGRAPH][0000:04:00.0] DATA_ERROR INVALID_VALUE
> Mar 20 21:18:42 MacSam kernel: [41316.923848] nouveau E[
> PGRAPH][0000:04:00.0] ch 3 [0x000fb2a000 Xorg[19317]] subc 2 class 0x502d
> mthd 0x060c data 0x00044110

This does seem a little high... the code that produces this is in the
ddx's NV50EXASolid (src/nv50_exa.c):

        BEGIN_NV04(push, NV50_2D(DRAW_POINT32_X(0)), 4);
        PUSH_DATA (push, x1);
        PUSH_DATA (push, y1);
        PUSH_DATA (push, x2);
        PUSH_DATA (push, y2);

It's suggesting that the x2 value is 0x44110. This seems awfully high,
unless you have a REALLY hi-dpi screen.

However that value also looks an awful lot like a regular pushbuf
command... this would correspond to

size = 1, subchannel = 2, method = 0x110

which is a SERIALIZE call, which happens at the start of NV50EXACopy.

> Mar 20 21:18:42 MacSam kernel: [41316.945939] nouveau E[
> PFIFO][0000:04:00.0] DMA_PUSHER - ch 3 [Xorg[19317]] get 0x002002c61c put
> 0x002002ca80 ib_get 0x000001d5 ib_put 0x000001da state 0x80000024 (err:
> INVALID_CMD) push 0x003060b0
> Mar 20 21:18:42 MacSam kernel: [41316.956144] nouveau E[
> PGRAPH][0000:04:00.0] magic set 0:
> Mar 20 21:18:42 MacSam kernel: [41316.956152] nouveau E[
> PGRAPH][0000:04:00.0]     0x00408604: 0x20090d0f
> Mar 20 21:18:42 MacSam kernel: [41316.956155] nouveau E[
> PGRAPH][0000:04:00.0]     0x00408608: 0x00206365
> Mar 20 21:18:42 MacSam kernel: [41316.956159] nouveau E[
> PGRAPH][0000:04:00.0]     0x0040860c: 0x80000432
> Mar 20 21:18:42 MacSam kernel: [41316.956162] nouveau E[
> PGRAPH][0000:04:00.0]     0x00408610: 0x62100003
> Mar 20 21:18:42 MacSam kernel: [41316.956165] nouveau E[
> PGRAPH][0000:04:00.0] TRAP_TEXTURE - TP0:  FAULT
> Mar 20 21:18:42 MacSam kernel: [41316.956176] nouveau E[
> PGRAPH][0000:04:00.0] ch 3 [0x000fb2a000 Xorg[19317]] subc 2 class 0x502d
> mthd 0x08dc data 0x00000000

And this is the end of the blit call (0x8dc is the last param). The
simplest explanation is that X called into the DDX's EXA copy and
solid handlers at the same time, and they wrote over each other.
However I didn't think that such a thing would be possible, so perhaps
there's something else going on.

The rest of the pushbuf just reads like errors due to a
misaligned/confused pushbuf, which could easily happen as a result of
the above.

>
>
> [*] dmesg | grep nouveau:
> ======================
>
> [    2.654061] fb: switching to nouveaufb from simple
> [    2.666836] nouveau 0000:04:00.0: enabling device (0006 -> 0007)
> [    2.667790] nouveau  [  DEVICE][0000:04:00.0] BOOT0  : 0x0af000a2
> [    2.667793] nouveau  [  DEVICE][0000:04:00.0] Chipset: MCP89 (NVAF)
> [    2.667795] nouveau  [  DEVICE][0000:04:00.0] Family : NV50
> [    2.667837] nouveau  [   VBIOS][0000:04:00.0] checking PRAMIN for
> image...
> [    2.730295] nouveau  [   VBIOS][0000:04:00.0] ... appears to be valid
> [    2.730298] nouveau  [   VBIOS][0000:04:00.0] using image from PRAMIN
> [    2.730399] nouveau  [   VBIOS][0000:04:00.0] BIT signature found
> [    2.730402] nouveau  [   VBIOS][0000:04:00.0] version 70.89.02.00.00
> [    2.756101] nouveau 0000:04:00.0: irq 45 for MSI/MSI-X
> [    2.756117] nouveau  [     PMC][0000:04:00.0] MSI interrupts enabled
> [    2.756141] nouveau  [     PFB][0000:04:00.0] RAM type: stolen system
> memory
> [    2.756143] nouveau  [     PFB][0000:04:00.0] RAM size: 256 MiB
> [    2.756145] nouveau  [     PFB][0000:04:00.0]    ZCOMP: 0 tags
> [    2.757742] nouveau  [    VOLT][0000:04:00.0] GPU voltage: 810000uv
> [    4.120027] nouveau  [  PTHERM][0000:04:00.0] FAN control: none /
> external
> [    4.120036] nouveau  [  PTHERM][0000:04:00.0] fan management: automatic
> [    4.120040] nouveau  [  PTHERM][0000:04:00.0] internal sensor: yes
> [    4.120057] nouveau  [     CLK][0000:04:00.0] 03: core 405 MHz shader 405
> MHz memory 405 MHz
> [    4.120061] nouveau  [     CLK][0000:04:00.0] 07: core 450 MHz shader 810
> MHz memory 450 MHz
> [    4.120064] nouveau  [     CLK][0000:04:00.0] 0e: core 450 MHz shader 810
> MHz memory 450 MHz
> [    4.120068] nouveau  [     CLK][0000:04:00.0] 0f: core 450 MHz shader 950
> MHz memory 450 MHz
> [    4.120083] nouveau  [     CLK][0000:04:00.0] --: core 405 MHz shader 810
> MHz
> [    4.120130] nouveau W[    PCE0][0000:04:00.0] disabled, PCE0=1 to enable
> [    4.120260] nouveau  [     DRM] VRAM: 256 MiB
> [    4.120262] nouveau  [     DRM] GART: 1048576 MiB
> [    4.120266] nouveau  [     DRM] TMDS table version 2.0
> [    4.120268] nouveau  [     DRM] DCB version 4.0
> [    4.120271] nouveau  [     DRM] DCB outp 00: 01800113 00010030
> [    4.120273] nouveau  [     DRM] DCB outp 01: 020112a6 0f220010
> [    4.120275] nouveau  [     DRM] DCB outp 02: 02011262 00020010
> [    4.120277] nouveau  [     DRM] DCB conn 00: 00000040
> [    4.120279] nouveau  [     DRM] DCB conn 01: 00101146
> [    4.141082] nouveau  [     DRM] MM: using M2MF for buffer copies
> [    4.252704] nouveau  [     DRM] allocated 1280x800 fb: 0x50000, bo
> ffff8800a820e000
> [    4.252820] fbcon: nouveaufb (fb0) is primary device
> [    4.326530] nouveau 0000:04:00.0: fb0: nouveaufb frame buffer device
> [    4.326533] nouveau 0000:04:00.0: registered panic notifier
> [    4.340171] [drm] Initialized nouveau 1.1.2 20120801 for 0000:04:00.0 on
> minor 0
> [    7.847812] nouveau W[    PCE0][0000:04:00.0] disabled, PCE0=1 to enable
> [   57.278684] nouveau W[    PCE0][0000:04:00.0] disabled, PCE0=1 to enable
> [   80.746844] nouveau W[    PCE0][0000:04:00.0] disabled, PCE0=1 to enable
>
>
> Linux MacSam 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt7-1 (2015-03-01) x86_64
> GNU/Linux
> /etc/debian_version
> 8.0
>
> # glxinfo
> name of display: :0.0
> display: :0  screen: 0
> direct rendering: Yes
> server glx vendor string: SGI
> server glx version string: 1.4

And I assume at some point it says "Gallium 0.4 on NVAF"?

>
> X.Org X Server 1.16.4
> Release Date: 2014-12-20
> X Protocol Version 11, Revision 0
> Build Operating System: Linux 3.16.0-4-amd64 x86_64 Debian
> Current Operating System: Linux MacSam 3.16.0-4-amd64 #1 SMP Debian
> 3.16.7-ckt7-1 (2015-03-01) x86_64
> Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64
> root=UUID=cdf52b77-b313-4cd2-b4b8-f16561761833 ro quiet
> Build Date: 11 February 2015  12:32:02AM
> xorg-server 2:1.16.4-1 (http://www.debian.org/support)
> Current version of pixman: 0.32.6
>     Before reporting problems, check http://wiki.x.org
>     to make sure that you have the latest version.

Hm, too bad. I was hoping you had some ancient version of X I could
blame all this on :(

Are you comfortable modifying C code? If so, what happens if you add
locking to the DDX? i.e. explicitly acquire a mutex at the start of
all the Prepare methods, and release it at the end of the Done method
(hm, does that still get called if prepare fails? probably not, so be
careful).

  -ilia


More information about the Nouveau mailing list