[Intel-gfx] [PATCH 2/2] [v2] drm/i915: Disable GGTT PTEs on GEN6+ suspend

Thu Oct 17 11:24:07 CEST 2013

On Thu, Oct 17, 2013 at 09:41:09AM +0200, Takashi Iwai wrote:
> At Wed, 16 Oct 2013 18:27:33 +0100,
> Chris Wilson wrote:
> > 
> > On Wed, Oct 16, 2013 at 10:06:27AM -0700, Ben Widawsky wrote:
> > > On Wed, Oct 16, 2013 at 05:58:31PM +0100, Chris Wilson wrote:
> > > > So clearing the valid bit should result in the GPU reporting errors for
> > > > delayed accesses, but none were reported?
> > > 
> > > So I can't actually reproduce the problem for some reason. Paulo will
> > > need to answer. One theory is the fault information is lost on suspend.
> > > 
> > > The original patch put faults both in suspend, and resume. After this, I
> > > asked Paulo to wedge the GPU, and there I saw faults.
> > 
> > If we can capture the error, and it should be very possible to do so, we
> > should be able to pinpoint the cause quite quickly. If it is just deferred
> > writes, it should also be a problem across module unload - which should
> > be easier for getting debug information out.
> 
> The bug is only about S4, thus it's not so easy to capture anything in
> the resume kernel, as all lost after transition to the restored
> kernel.
> 
> BTW, I also suspect that the similar problem might still happen in
> other cases, e.g. via kexec even with this patch.

How are devices idled (or suspended) prior to hibernate resume or kexec?
>From my reading, i915_drm_freeze() should be called before the resume
image is executed. What we can do is to make the first action of
i915_driver_unload() be i915_drm_freeze(), then clear the PTE valid
bits and wait a second or two for a GPU fault before proceeding with an
unload. By doing that we can debug our suspend paths - all that remains
is the possibility of rogue hardware state. And that should show up by
breaking module load.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre