[Intel-gfx] [PATCH v2] drm/i915: fix SFC reset flow

Chris Wilson chris at chris-wilson.co.uk
Thu Sep 19 09:48:45 UTC 2019


Quoting Tvrtko Ursulin (2019-09-19 10:34:15)
> 
> On 19/09/2019 02:53, Daniele Ceraolo Spurio wrote:
> > Our assumption that the we can ask the HW to lock the SFC even if not
> > currently in use does not match the HW commitment. The expectation from
> > the HW is that SW will not try to lock the SFC if the engine is not
> > using it and if we do that the behavior is undefined; on ICL the HW
> > ends up to returning the ack and ignoring our lock request, but this is
> > not guaranteed and we shouldn't expect it going forward.
> > 
> > Also, failing to get the ack while the SFC is in use means that we can't
> > cleanly reset it, so fail the engine reset in that scenario.
> > 
> > v2: drop rmw change, keep the log as debug and handle failure (Chris),
> >      improve comments (Tvrtko).
> > 
> > Reported-by: Owen Zhang <owen.zhang at intel.com>
> > Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > ---
> >   static void gen11_unlock_sfc(struct intel_engine_cs *engine)
> > @@ -430,12 +436,21 @@ static int gen11_reset_engines(struct intel_gt *gt,
> >               for_each_engine_masked(engine, gt->i915, engine_mask, tmp) {
> >                       GEM_BUG_ON(engine->id >= ARRAY_SIZE(hw_engine_mask));
> >                       hw_mask |= hw_engine_mask[engine->id];
> > -                     hw_mask |= gen11_lock_sfc(engine);
> > +                     ret = gen11_lock_sfc(engine, &hw_mask);
> > +                     if (ret)
> > +                             goto sfc_unlock;
> 
> Break on first failure looks unsafe to me. I think it would be more 
> robust to continue, no? Like if we have been asked to reset multiple 
> engines and only one failed, why not do the ones we can?

Any failure means a fallback to a full device reset. It doesn't matter
if we could reset one of two+ engines at that point, it's past the point
of no return.

> >       ret = gen6_hw_domain_reset(gt, hw_mask);
> >   
> > +sfc_unlock:
> > +     /*
> > +      * we unlock the SFC based on the lock status and not the result of
> > +      * gen11_lock_sfc to make sure that we clean properly if something
> > +      * wrong happened during the lock (e.g. lock acquired after timeout
> > +      * expiration).
> > +      */
> >       if (engine_mask != ALL_ENGINES)
> >               for_each_engine_masked(engine, gt->i915, engine_mask, tmp)
> >                       gen11_unlock_sfc(engine);
> > 
> 
> So you decided not to read the register and cross check?

Very meh, that check didn't seem like it would improve our ability to
handle resets. If you wanted to actually check, we should check the
lock_ack_bit is cleared as well. (Or at least check that is clear before
we start the next lock_sfc().) I think skipping an unnecessary
gen11_lock_sfc() is a solid improvement by itself (I'm hopeful that's
enough to make icl more robust...). Tightening up the unlock can be done
separately.
-Chris


More information about the Intel-gfx mailing list