[Intel-gfx] On SNB: Hangcheck timer elapsed... GPU hung
Ted Phelps
phelps at gnusto.com
Mon Feb 14 14:18:53 CET 2011
Apologies if this is a known issue, but I haven't been able to convince
myself that someone is looking after it. I've been seeing this issue
with Linux kernel 2.6.37, 2.6.38-rc4 and the most recent merge of Linus's
git tree and drm-intel-fixes. I'm happy to provide more information,
apply patches, run tools, read code, as requested.
I have a Core i7 2600K CPU (yay me!) on a DH67CL motherboard, and I'm trying
to use the on-board graphics. Typically this set-up works well for about
10-30 minutes. Then I get an error of the form:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
after which, my cursor no longer changes shape, anti-aliased fonts misbehave
(mostly they just aren't't rendered) and 3-D applications no longer start up.
I haven't found any obvious trigger for this -- I don't need to be
interacting with the machine for it to happen -- but it happens pretty
reliably.
I've just now managed to catch this with the drm.debug parameter set to 7.
Things seem ok initially...
... various cmd/nr/dev/auth entries logged ...
Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0xc010645b, nr=0x5b, dev 0xe200, auth=1
Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
Then there's this worrying-looking message:
Feb 14 22:47:50 orpheus kernel: [drm:intel_prepare_page_flip], preparing flip with no unpin work?
Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
The cmd/nr/dev/auth fields are then repeated for a couple of seconds:
Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
Feb 14 22:47:50 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
... repeat many, many times ...
Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
And finally, the hangcheck timer expires:
Feb 14 22:47:52 orpheus kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], ret = fffffe00
Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
Feb 14 22:47:52 orpheus kernel: [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 300039 at 300025, next 300040)
Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], ret = fffffff5
Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x400c645f, nr=0x5f, dev 0xe200, auth=1
Feb 14 22:47:52 orpheus kernel: [drm:i915_error_work_func], resetting chip
Feb 14 22:47:52 orpheus kernel: [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
Feb 14 22:47:52 orpheus kernel: [drm:i915_reset] *ERROR* Failed to reset chip.
Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], ret = fffffffb
Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], pid=2180, cmd=0x4020645d, nr=0x5d, dev 0xe200, auth=1
Feb 14 22:47:52 orpheus kernel: [drm:drm_ioctl], ret = fffffffb
After that, the system limps along as described above.
I haven't delved into trying to understand what this means; I'm hoping that
the above trace rings bells for someone. I can provide a more complete log
if someone reckons that'd be useful.
Thanks for your time!
-Ted
More information about the Intel-gfx
mailing list