[Intel-gfx] [PATCH 2/2] [v2] drm/i915: Disable GGTT PTEs on GEN6+ suspend

Thu Oct 17 12:06:12 CEST 2013

At Thu, 17 Oct 2013 10:24:07 +0100,
Chris Wilson wrote:
> 
> On Thu, Oct 17, 2013 at 09:41:09AM +0200, Takashi Iwai wrote:
> > At Wed, 16 Oct 2013 18:27:33 +0100,
> > Chris Wilson wrote:
> > > 
> > > On Wed, Oct 16, 2013 at 10:06:27AM -0700, Ben Widawsky wrote:
> > > > On Wed, Oct 16, 2013 at 05:58:31PM +0100, Chris Wilson wrote:
> > > > > So clearing the valid bit should result in the GPU reporting errors for
> > > > > delayed accesses, but none were reported?
> > > > 
> > > > So I can't actually reproduce the problem for some reason. Paulo will
> > > > need to answer. One theory is the fault information is lost on suspend.
> > > > 
> > > > The original patch put faults both in suspend, and resume. After this, I
> > > > asked Paulo to wedge the GPU, and there I saw faults.
> > > 
> > > If we can capture the error, and it should be very possible to do so, we
> > > should be able to pinpoint the cause quite quickly. If it is just deferred
> > > writes, it should also be a problem across module unload - which should
> > > be easier for getting debug information out.
> > 
> > The bug is only about S4, thus it's not so easy to capture anything in
> > the resume kernel, as all lost after transition to the restored
> > kernel.
> > 
> > BTW, I also suspect that the similar problem might still happen in
> > other cases, e.g. via kexec even with this patch.
> 
> How are devices idled (or suspended) prior to hibernate resume or kexec?
> >From my reading, i915_drm_freeze() should be called before the resume
> image is executed.

I also didn't follow the complete (and complex) flow, but from my
understanding,

S4 case:
hibernation_restore() in kernel/power/hibernate.c calls
dpm_suspend_start(PMSG_QUIESCE), which invokes pm->freeze in the end.
Since there is no pm->freeze_noirq, dpm_suspend_end(PMSG_QUIESCE) in
resume_target_kernel() shouldn't matter.

kexec case:
it's usually shutdown ops called from kernel_restart_prepare() ->
device_shutdown().  So, it's same as the normal shutdown.  When
KEXEC_PRESERVE_CONTEXT flag is set (where it works like
suspend/resume), dpm_suspend_start(PMSG_FREEZE) will be called, which
again invokes pm->freeze.

i915 driver has no shutdown ops, and it's good so (we'd like to see
the messages), but this means the device is still active at the normal
kexec until the very latest stage, I'm afraid.

> What we can do is to make the first action of
> i915_driver_unload() be i915_drm_freeze(), then clear the PTE valid
> bits and wait a second or two for a GPU fault before proceeding with an
> unload. By doing that we can debug our suspend paths - all that remains
> is the possibility of rogue hardware state. And that should show up by
> breaking module load.

Well, I somehow think the problem happens at transition to the
restored image, where we have completely different memory maps from
the boot kernel and this leads to memory corruption in /proc dcache or
such.  With unload / reload module case, the rest memory is preserved,
thus it's a fairly different situation.

Of course, I'm not against testing this at all.  Just trying to
understand what's going on...

thanks,

Takashi