Hangcheck timer elapsed..

Fri Dec 21 13:04:05 PST 2012

On Fri, Dec 21, 2012 at 4:56 AM, Linus Torvalds
<torvalds at linux-foundation.org> wrote:
> This thing isn't repeatable, and it can go days without happening, but
> it has happened several times now over the last several weeks, to the
> point where it is very annoying.
>
> I get:
>
>   [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
>   [drm] capturing error event; look for more information in
> /debug/dri/0/i915_error_state
>   [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
>   [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
>   [drm:i915_reset] *ERROR* Failed to reset chip.
>
> and then I need to reboot, because restarting X just causes it to be
> slow and unaccelerated.
>
> I'm attaching the i915_error_state thing, although I suspect it's
> useless, since I got it after an X restart. But maybe it shows why
> even the X restart doesn't do anything.
>
> This is a Westmere setup: it's a
>
>   Intel(R) Core(TM) i5 CPU         670  @ 3.47GHz
>
> and dmesg doesn't have anything interesting in it at all. Running
> up-to-date Fedora 17.

Yeah, looks like the ilk death which somehow became much easier to hit
in 3.7. Bug to track the various reporters is at

https://bugs.freedesktop.org/show_bug.cgi?id=55984

> Any ideas about anything in particular I can do to trigger it and help
> debug it? There's usually nothing special going on when this happens.
> This last one was during a kernel build, but the screen was actually
> locked (and I don't even have a fancy screensaver, it's just a blank
> black screen for me).

As far as we know it requires light gpu load (desktop, 3d compositor
seems to help to hit it) with some form of memory/io load. Kernel
compiles, massive svn checkouts or just filling the pagecache with
crap and cleaning it up again seems to be good at triggering it.
Progress in debugging has been extremely slow, especially after we've
disabled rc6 on ilk (attempted an enable in the 3.7 by default of
that) - without rc6 none of the local machines we devs have here can
reproduce the bug any more, and the rc6 hangs have a slightly
different hang signature, so decent chance it's a different bug.

We're suspecting it's an old one, just made somehow much easier to hit
with the changes in 3.7. Chris has some good evidence already that the
hw has stricter alignment requirements than what we implement, but
patches are only just now gathering testing feedback. And after a few
weeks of searching for the loose piece of ducttape so that we can
reattach it I've finally found a slight change in our shrinker
behaviour which might mitigate things. Patch is in testing since a few
days and hasn't blown up yet - which is a record thus far. So I'm
still hopeful that I can forge this quick hack into a real patch to
reapply the lost ducttape.

> Other times, it's just normal desktop. Quite often it is during a
> kernel compile, with loads in the 30+ range, so maybe it's triggered
> by high loads resulting in some program not being hugely responsive
> (maybe losing the drm state?) but quite frankly, I do a *lot* of
> kernel compiles especially during the merge window, so the "it
> happened during a kernel compile" is not necessarily indicative of any
> deeper causation - it's just that compiling kernels is what I do ;)
>
> I've gotten hangcheck timers over the years, but it really seems to
> have been getting worse. Please help. If the reset worked and it would
> clear up after I just logged out and back in again, that would already
> be a big thing.

gpu reset state transitions should be much less racy in 3.8 than 3.7,
but that doesn't help if the gpu hangs to quickly. You can try to
resurrect it in that case with

# echo 0 > /sys/kernel/debug/dri/0/i915_wedged

We can't really disable that "hanging too fast" check, since it helps
tremendously in preventing gpu hangs from spiralling into certain
death of the entire system, and so is important for debugging.

Other workarounds to mitigate things are:
- Don't use a 3d compositor, the ddx should transparently fall over to
sw rendering when the gpu is permanently declared dead.
- Using SNA accelaration instead of UXA in the ddx seems to migitate
the hangs a lot (Option "AccelMethod" "SNA" in an xorg.conf snippet) -
we suspect it's due to the different layout of the indirect state
objects. UXA has an entire tree of those, SNA packs them into one
single gem object, which seems to tremendously reduce changes that the
gpu/kernel barfs on them and so hangs the gpu.

Oh, and in case you get the urge to flame your drm/i915 maintainer to
crips over this disaster (it's one imo), I'll be offline the next 2
weeks, snowboarding in the swiss alps ;-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch