[Intel-gfx] S4 resume breakage with i915 driver

Takashi Iwai tiwai at suse.de
Thu Aug 25 15:54:58 UTC 2016


On Thu, 25 Aug 2016 17:32:41 +0200,
Chris Wilson wrote:
> 
> On Thu, Aug 25, 2016 at 03:11:35PM +0200, Takashi Iwai wrote:
> > Hi,
> > 
> > while debugging our 4.4.x based SLE12-SP2 kernel, we noticed that S4
> > resume is broken on many machines with i915 gfx, even on the upstream
> > 4.7 kernel.  Originally it was reported by Intel about SKL machines,
> > but later on, we found that it hits on many other older chips (at
> > least HSW), too.
> > 
> > This was hard to identify because there have been other S4 resume bugs
> > until recently.  But even after these fixes, when the system is tested
> > on i915 gfx, the system gets memory corruption or kernel Oops sooner
> > or later after a few (usually < 10) S4 cycles.
> > 
> > As the bug happened between 4.2 and 4.3, I bisected and it pointed to
> > the commit:
> > 
> >   4c436d55b279bbc6b02aac02e7dc683fc09f884e
> >     drm/i915: Enable Resource Streamer state save/restore on MI_SET_CONTEXT
> > 
> > Indeed, reverting this on top of our 4.4.x kernel seems to make S4
> > working stably (at least on a test machine).
> > 
> > Does this make any sense to you guys?
> 
> It could explain a HSW issue, but not SKL and not BDW by default (using
> execlists).

Hrm, IIRC, SKL didn't work without the commit.  But it doesn't matter
if this doesn't has a bad influence on SKL, either.

> All machine big core, no Braswell or other !llc device?

Yeah, tested only on SKL, HSW and IVY.  No small boxes.

> > Since the commit message doesn't give a good explanation about this
> > change, I wonder what's the purpose of this commit.  Was it merely
> > optimization?
> 
> It's a step towards enabling the Resource Streamer hardware block in the
> GPU, kind of a glorified prefetch. Userspace also has to notify the GPU
> to enable it for a batch as well.
> 
> The only doubt in mind my about that patch was whether the hw ignored
> the RS_RESTORE_STATE if MI_RESTORE_INHIBIT as expected:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 3c97f0e7a003..c618bb86aeb9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -574,10 +574,15 @@ mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
>  
>         /* These flags are for resource streamer on HSW+ */
>         if (IS_HASWELL(dev_priv) || INTEL_GEN(dev_priv) >= 8)
> -               flags |= (HSW_MI_RS_SAVE_STATE_EN | HSW_MI_RS_RESTORE_STATE_EN);
> +               flags |= HSW_MI_RS_SAVE_STATE_EN;
>         else if (INTEL_GEN(dev_priv) < 8)
> -               flags |= (MI_SAVE_EXT_STATE_EN | MI_RESTORE_EXT_STATE_EN);
> -
> +               flags |= MI_SAVE_EXT_STATE_EN;
> +       if ((flags & MI_RESTORE_INHIBIT) == 0) {
> +               if (IS_HASWELL(dev_priv) || INTEL_GEN(dev_priv) >= 8)
> +                       flags |= HSW_MI_RS_RESTORE_STATE_EN;
> +               else if (INTEL_GEN(dev_priv) < 8)
> +                       flags |= MI_RESTORE_EXT_STATE_EN;
> +       }
>  
> 
> But this code is only executed for Haswell, and we only inhibit restore
> from uninitialised contexts. The biggest fear is that we are restoring
> garbage from userspace and making the GPU do strange things.
> 
> That still doesn't explain stray writes into system memory, that is more
> likely our GTT being broken.

Yeah, that's my expectation.  But the bisection didn't reach it.
So maybe this casually surfaced the hidden GTT or cache issue.

> Could you confirm that bisect has any
> impact on the other machines, and of course double check the result?

You're asking bisection on all machines from the scratch for such a
bug taking so long time to reproduce, and especially for i915 code
path, that is known to be deadly difficult due to various merge
commits?  I sincerely decline the offer :)

Yes, the result was double-checked.  This has a positive effect on all
our tested machines.  Maybe But it's hard to tell exactly whether this is
the 100% culprit.  As said, there have been multiple S4 bugs, so far.
IVY worked without this patch (after x86 fixes), but obviously this
had no negative effect, either.


thanks,

Takashi


More information about the Intel-gfx mailing list