[Intel-gfx] On SNB: Hangcheck timer elapsed... GPU hung

Mon Feb 14 14:18:53 CET 2011

Apologies if this is a known issue, but I haven't been able to convince
myself that someone is looking after it.  I've been seeing this issue
with Linux kernel 2.6.37, 2.6.38-rc4 and the most recent merge of Linus's
git tree and drm-intel-fixes.  I'm happy to provide more information,
apply patches, run tools, read code, as requested.

I have a Core i7 2600K CPU (yay me!) on a DH67CL motherboard, and I'm trying
to use the on-board graphics.  Typically this set-up works well for about
10-30 minutes.  Then I get an error of the form:

    [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung

after which, my cursor no longer changes shape, anti-aliased fonts misbehave
(mostly they just aren't't rendered) and 3-D applications no longer start up.
I haven't found any obvious trigger for this -- I don't need to be
interacting with the machine for it to happen -- but it happens pretty
reliably.

I've just now managed to catch this with the drm.debug parameter set to 7.
Things seem ok initially...

    ... various cmd/nr/dev/auth entries logged ...
    Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0xc010645b, nr=0x5b, dev 0xe200, auth=1
    Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1

Then there's this worrying-looking message:

    Feb 14 22:47:50 orpheus kernel: [drm:intel_prepare_page_flip], preparing flip with no unpin work?
    Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], ret = fffffe00

The cmd/nr/dev/auth fields are then repeated for a couple of seconds:

    Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
    Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
    Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
    Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
    Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
    Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
    ... repeat many, many times ...
    Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
    Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
    Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1

And finally, the hangcheck timer expires:

    Feb 14 22:47:52 orpheus kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
    Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
    Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
    Feb 14 22:47:52 orpheus kernel: [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 300039 at 300025, next 300040)
    Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], ret = fffffff5
    Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
    Feb 14 22:47:52 orpheus kernel: [drm:i915_error_work_func], resetting chip
    Feb 14 22:47:52 orpheus kernel: [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
    Feb 14 22:47:52 orpheus kernel: [drm:i915_reset] *ERROR* Failed to reset chip.
    Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], ret = fffffffb
    Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x4020645d, nr=0x5d, dev 0xe200, auth=1
    Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], ret = fffffffb

After that, the system limps along as described above.

I haven't delved into trying to understand what this means; I'm hoping that
the above trace rings bells for someone.  I can provide a more complete log
if someone reckons that'd be useful.

Thanks for your time!

-Ted