[Intel-gfx] [PATCH 2/2] [v2] drm/i915: Disable GGTT PTEs on GEN6+ suspend

Wed Oct 16 19:06:27 CEST 2013

On Wed, Oct 16, 2013 at 05:58:31PM +0100, Chris Wilson wrote:
> On Wed, Oct 16, 2013 at 09:21:30AM -0700, Ben Widawsky wrote:
> > Once the machine gets to a certain point in the suspend process, we
> > expect the GPU to be idle. If it is not, we might corrupt memory.
> > Empirically (with an early version of this patch) we have seen this is
> > not the case. We cannot currently explain why the latent GPU writes
> > occur.
> > 
> > In the technical sense, this patch is a workaround in that we have an
> > issue we can't explain, and the patch indirectly solves the issue.
> > However, it's really better than a workaround because we understand why
> > it works, and it really should be a safe thing to do in all cases.
> > 
> > The noticeable effect other than the debug messages would be an increase
> > in the suspend time. I have not measure how expensive it actually is.
> > 
> > I think it would be good to spend further time to root cause why we're
> > seeing these latent writes, but it shouldn't preclude preventing the
> > fallout.
> > 
> > NOTE: It should be safe (and makes some sense IMO) to also keep the
> > VALID bit unset on resume when we clear_range(). I've opted not to do
> > this as properly clearing those bits at some later point would be extra
> > work.
> > 
> > v2: Fix bugzilla link
> 
> And the other one?
> 

I'm really amazing. If we move ahead with this patch, Daniel, can you just erase
the extra bugs.freedesktop.org/6549://

> > Bugzilla: http://bugs.freedesktop.org/6549://bugs.freedesktop.org/show_bug.cgi?id=65496

Bugzilla: http://bugs.freedesktop.org/show_bug.cgi?id=65496

> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=59321
> > Tested-by: Takashi Iwai <tiwai at suse.de>
> > Tested-by: Paulo Zanoni <paulo.r.zanoni at intel.com>
> > Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> 
> So clearing the valid bit should result in the GPU reporting errors for
> delayed accesses, but none were reported?
> -Chris
> 

So I can't actually reproduce the problem for some reason. Paulo will
need to answer. One theory is the fault information is lost on suspend.

The original patch put faults both in suspend, and resume. After this, I
asked Paulo to wedge the GPU, and there I saw faults.

-- 
Ben Widawsky, Intel Open Source Technology Center