[Intel-gfx] 5 bugs

Bryce Harrington bryce at canonical.com
Fri Jun 17 01:54:45 CEST 2011


On Fri, Jun 17, 2011 at 12:12:16AM +0100, Chris Wilson wrote:
> On Thu, 16 Jun 2011 15:46:29 -0700, Bryce Harrington <bryce at canonical.com> wrote:
> > On Thu, Jun 16, 2011 at 12:37:00PM +0100, Chris Wilson wrote:
> > > On Wed, 15 Jun 2011 18:10:29 -0700, Bryce Harrington <bryce at canonical.com> wrote:
> > > >   https://bugs.freedesktop.org/show_bug.cgi?id=36515
> > > 
> > > This looks to be a continuation of the WAIT_EVENT on a dead pipe that we
> > > thought we had beaten into submission. The other reports provide more
> > > circumstantial evidence to suggest that the hang coincides with a hotplug
> > > event. I think the cause is a race between the kernel turning the pipe off
> > > due to the hotplug and reprobing and that uevent reaching the ddx. In the
> > > meantime, we've queued another video frame to execute on the dead pipe.
> > > Worse we may have queued it up long before the hotplug event and due to
> > > buffering in the GPU command stream it only gets executed afterwards.
> > > 
> > > commit 85345517fe6d4de27b0d6ca19fef9d28ac947c4a
> > > Author: Chris Wilson <chris at chris-wilson.co.uk>
> > > Date:   Sat Nov 13 09:49:11 2010 +0000
> > > 
> > >     drm/i915: Retire any pending operations on the old scanout when switching
> > > 
> > > Handles the case were we are changing modes. Unfortunately, disabling an
> > > output takes a different path. Though, I think we can a similar big hammer
> > > approach there are well.
> > 
> > As luck would have it, my own i965 laptop locked up today with I guess
> > this same bug.  IPEHR=0x01820000
> > 
> > Before I restart it, is there any data which could be gathered that
> > would assist you?
> 
> My theory is based upon this still being a WAIT_EVENT on a disable pipe.
> The error state should support this is the DSP*CNTR is disabled for the
> pipe we are waiting on. But the other observation to make is whether you
> know if a modeset happened at around the same time as the hang.

The hang occurred while the system was preparing for sleep, triggered by
a lid close event.

>From my kern.log:

Jun 14 23:40:40 lynmouth kernel: [511433.780066] tg3 0000:08:00.0: eth0: Link is down
Jun 14 23:40:41 lynmouth kernel: [511434.597257] PM: Syncing filesystems ... done.
Jun 14 23:40:41 lynmouth kernel: [511434.615699] PM: Preparing system for mem sleep
Jun 14 23:40:45 lynmouth kernel: [511439.284049] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Jun 14 23:40:45 lynmouth kernel: [511439.284823] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 1680764 at 1680757, next 1680765)
Jun 14 23:40:46 lynmouth kernel: [511439.788055] [drm:i915_reset] *ERROR* Failed to reset chip.
Jun 16 15:02:15 lynmouth kernel: [511439.916240] Freezing user space processes ... (elapsed 0.01 seconds) done.
Jun 16 15:02:15 lynmouth kernel: [511439.932109] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
Jun 16 15:02:15 lynmouth kernel: [511439.948084] PM: Entering mem sleep

I don't see a modeset event but could be it happens but doesn't cause a
log entry.  I'll flip on more debugging output and check.

The log shows the system has an uptime of 15 days and has gone through
suspend resume cycles roughly daily.  I do play videos on it from time
to time, although I hadn't been at the time of this suspend/resume
cycle.

The system does occasionally lose its dualhead configuration during
suspend/resume, and comes back mirrored.  I've assumed it to be a
gnome-settings-daemon bug, but could be a symptom of this problem.  It
does hint that perhaps some modeset or output hotplug event or something
does occur during resume.

> > Otherwise, I can boot and test the patch you posted to the bug.
> 
> I'm confident that that patch closes another window for the bug. I'm
> less confident that that's the only race condition we have.
>
> > One of the difficulties with this type of bug is that it's so
> > intermittent and uncertain to reproduce (and so easily confused with
> > other unrelated freezes), that it's hard to tell for certain if a given
> > patch has definitively helped the situation.  Do you have suggestions on
> > ways of measuring this better, or techniques to help in triggering the
> > bug more reliably?
> 
> If am I right, then we have two paths that cause WAIT_FOR_EVENT,
> windowed swapbuffers (or sub_copy_swap) and video. So playing a number
> of video streams should increase the likelihood of the bug, run in
> parallel with looping xrandr mode changes - in particular disabling
> outputs.

Awesome, can do.

The reason I ask is because the way Ubuntu's stable updates process
works, if I can demonstrate that a patch improves things, in a way
that's clear to a non-X person (i.e. the archive admin team) to
understand, I can get the patch released to all Ubuntu users.  If I
can't prove it or demonstrate it in some fashion, it'll get rejected or
significantly delayed.

Bryce



More information about the Intel-gfx mailing list