Radeon lockup on 3.8.5-201.fc18.x86_64

Tue Apr 23 10:08:20 PDT 2013

On Mon, Apr 22, 2013 at 10:55 PM, Michel Dänzer <michel at daenzer.net> wrote:
> On Mon, 2013-04-22 at 16:19 -0700, Andy Lutomirski wrote:
>> On Thu, Apr 18, 2013 at 2:12 PM, Alex Deucher <alexdeucher at gmail.com> wrote:
>> > On Thu, Apr 18, 2013 at 5:11 PM, Andy Lutomirski <luto at amacapital.net> wrote:
>> >> On Mon, Apr 8, 2013 at 7:01 AM, Alex Deucher <alexdeucher at gmail.com> wrote:
>> >>> On Fri, Apr 5, 2013 at 5:11 PM, Andy Lutomirski <luto at amacapital.net> wrote:
>> >>>> Every day or so, I'll click something and my screens go blank for a
>> >>>> second or two.  dmesg complains about a lockup, and afterwards
>> >>>> everything is painfully slow.  (Even switching focus to other emacs
>> >>>> windows takes a second or two.)  Once this happens, if I restart X, I
>> >>>> get a blank screen, although the mouse still works and I can switch
>> >>>> VTs and use the console.
>> >>>
>> >>> Try disabling hyperZ.  Set env var R600_HYPERZ=0 (mesa 9.1) or
>> >>> R600_DEBUG=nohyperz (mesa git).
>> >>
>> >> It lasted longer.  I have both of those environment variables set on
>> >> the Xorg process but not on clients.  Do  I need it everywhere?
>> >
>> > For anything that uses the 3D driver.
>>
>> This didn't appear to fix it, although it may have fixed some
>> graphical glitches in gmail's compose window.
>
> Seems rather unlikely that's directly related to HyperZ, but who knows.
>
>
>> [350788.530966] radeon 0000:08:00.0: GPU lockup CP stall for more than 40769msec
>> [350788.530970] radeon 0000:08:00.0: GPU lockup (waiting for
>> 0x000000000000178f last fence id 0x000000000000178e)
>> [350788.532047] radeon 0000:08:00.0: Saved 103 dwords of commands on ring 0.
>> [350788.532051] radeon 0000:08:00.0: GPU softreset: 0x00000003
>> [350788.547792] radeon 0000:08:00.0:   GRBM_STATUS               = 0xA0003828
>> [350788.547794] radeon 0000:08:00.0:   GRBM_STATUS_SE0           = 0x00000007
>> [350788.547797] radeon 0000:08:00.0:   GRBM_STATUS_SE1           = 0x00000007
>> [350788.547799] radeon 0000:08:00.0:   SRBM_STATUS               = 0x200000C0
>> [350788.547802] radeon 0000:08:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
>> [350788.547805] radeon 0000:08:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
>> [350788.547807] radeon 0000:08:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000004
>> [350788.547810] radeon 0000:08:00.0:   R_008680_CP_STAT          = 0x80008647
>> [350788.547811] radeon 0000:08:00.0:   GRBM_SOFT_RESET=0x00007F6B
>> [350788.547866] radeon 0000:08:00.0:   GRBM_STATUS               = 0x00003828
>> [350788.547869] radeon 0000:08:00.0:   GRBM_STATUS_SE0           = 0x00000007
>> [350788.547872] radeon 0000:08:00.0:   GRBM_STATUS_SE1           = 0x00000007
>> [350788.547874] radeon 0000:08:00.0:   SRBM_STATUS               = 0x200000C0
>> [350788.547877] radeon 0000:08:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
>> [350788.547879] radeon 0000:08:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
>> [350788.547882] radeon 0000:08:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
>> [350788.547884] radeon 0000:08:00.0:   R_008680_CP_STAT          = 0x00000000
>> [350788.565361] radeon 0000:08:00.0: GPU reset succeeded, trying to resume
>> [350788.583801] [drm] probing gen 2 caps for device 8086:1d1a = 2/0
>> [350788.583807] [drm] enabling PCIE gen 2 link speeds, disable with
>> radeon.pcie_gen2=0
>> [350788.590840] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
>> [350788.590976] radeon 0000:08:00.0: WB enabled
>> [350788.590978] radeon 0000:08:00.0: fence driver on ring 0 use gpu
>> addr 0x0000000040000c00 and cpu addr 0xffff880442f58c00
>> [350788.590979] radeon 0000:08:00.0: fence driver on ring 3 use gpu
>> addr 0x0000000040000c0c and cpu addr 0xffff880442f58c0c
>> [350788.607480] [drm] ring test on 0 succeeded in 2 usecs
>> [350788.607560] [drm] ring test on 3 succeeded in 1 usecs
>> [350788.615053] [drm] ib test on ring 0 succeeded in 0 usecs
>> [350788.615133] [drm] ib test on ring 3 succeeded in 1 usecs
>>
>> I'm not convinced there's an actual hang.  40 seconds is a long time,
>> and I've only ever seen this when clicking something, and when this
>> happens, the screen goes blank immediately (not after a 40 second
>> delay).
>
> Hmm, now that you mention this, I notice in your original report it
> claims that the CP stalled for 'more than 5102593msec', which is clearly
> bogus. Looks like something's wrong with the lockup detection.
> Did this start after a kernel update or something like that?

It's recent.  It may have been when F18 switched from 3.7 to 3.8.

I think there are bugs in the lockup detection and in the lockup
recovery.  Firefox, in particular, is *really* slow afterwards.  Are
interrupts possibly getting dropped or misconfigured during the reset?

--Andy