[Intel-gfx] Possible i915 regression with 4.4-rc

Daniel Vetter daniel at ffwll.ch
Fri Dec 4 08:02:52 PST 2015


On Fri, Dec 04, 2015 at 11:40:59AM +0200, Ville Syrjälä wrote:
> On Fri, Dec 04, 2015 at 10:49:48AM +0200, Jani Nikula wrote:
> > On Thu, 03 Dec 2015, Ville Syrjälä <ville.syrjala at linux.intel.com> wrote:
> > > On Thu, Dec 03, 2015 at 09:00:55PM +0100, Takashi Iwai wrote:
> > >> Hi,
> > >> 
> > >> I've experienced a few graphics issues recently, and I tend to believe
> > >> that it has happened since 4.4-rc.  Namely, after some long time usage
> > >> on my HSW laptop (two or three days), the mouse cursor vanished
> > >> suddenly.  It kept pointing but just became invisible.  Also, after
> > >> some S3 cycles, some glyphs on a console or on Firefox became
> > >> invisible, too.  The windows and graphics were shown well, and X core
> > >> fonts were still shown properly, too.  Switching to VT1 and back
> > >> didn't change the situation.
> > >
> > > I think I have a fix for this *very* annoying problem. I'v been cursing
> > > on irc for weeks about it, until I finally got off my arse and debugged
> > > it.
> > >
> > > I pushed out my my cursor branch:
> > > git://github.com/vsyrjala/linux.git disappearing_cursor_fix
> > >
> > > It has lots of other junk too, but it should be just there two that fix it:
> > > 59f65fa270fb ("drm/i915: Kill intel_crtc->cursor_bo")
> > > 25651a198d17 ("drm/i915: Drop the broken curcor base==0 special casing")
> > >
> > > Unfortunatleey I've managed to keep myself busy on other stuff, so didn't
> > > send them out yet. Maybe tomorrow...
> > 
> > So I've hit this too, albeit very rarely, on a Haswell running Debian
> > stable with the stock v3.16 kernel. Haven't seen it on any other
> > machine. It's really too rare to even debug or verify a fix. Is it
> > possible we just happened to make an old bug occur more frequently now?
> 
> The potential for it has definitely been there for a long time.

Oh dear, let's have fun and look at some awful history.

commit e568af1c626031925465a5caaab7cca1303d55c7
Author: Daniel Vetter <daniel.vetter at ffwll.ch>
Date:   Wed Mar 26 20:08:20 2014 +0100

    drm/i915: Undo gtt scratch pte unmapping again

Which essentially reverted

commit 828c79087cec61eaf4c76bb32c222fbe35ac3930
Author: Ben Widawsky <benjamin.widawsky at intel.com>
Date:   Wed Oct 16 09:21:30 2013 -0700

    drm/i915: Disable GGTT PTEs on GEN6+ suspend
    
    Once the machine gets to a certain point in the suspend process, we
    expect the GPU to be idle. If it is not, we might corrupt memory.
    Empirically (with an early version of this patch) we have seen this is
    not the case. We cannot currently explain why the latent GPU writes
    occur.
    
    In the technical sense, this patch is a workaround in that we have an
    issue we can't explain, and the patch indirectly solves the issue.
    However, it's really better than a workaround because we understand why
    it works, and it really should be a safe thing to do in all cases.
    
    The noticeable effect other than the debug messages would be an increase
    in the suspend time. I have not measure how expensive it actually is.
    
    I think it would be good to spend further time to root cause why we're
    seeing these latent writes, but it shouldn't preclude preventing the
    fallout.
    
    NOTE: It should be safe (and makes some sense IMO) to also keep the
    VALID bit unset on resume when we clear_range(). I've opted not to do
    this as properly clearing those bits at some later point would be extra
    work.
    
    v2: Fix bugzilla link
    
    Bugzilla: http://bugs.freedesktop.org/show_bug.cgi?id=65496
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=59321
    Tested-by: Takashi Iwai <tiwai at suse.de>
    Tested-by: Paulo Zanoni <paulo.r.zanoni at intel.com>
    Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
    Tested-By: Todd Previte <tprevite at gmail.com>
    Cc: stable at vger.kernel.org
    Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>

This was a regression in a regression right before I ragequit the entire
bug handling deal because no one cared any more and management was all
"why is this important".

Would be interesting if these issues magically disapper when changing that
back again. Doesn't mean that we're any closer to figuring out what's
corrupting what exactly here, but at least we'd have a reason to digg out
this old sob story of mine.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the Intel-gfx mailing list