[Intel-gfx] [PATCH] drm/i915: Clear local engine-needs-reset bit if in progress elsewhere

Tue Aug 29 15:17:46 UTC 2017

Quoting Jeff McGee (2017-08-29 16:04:17)
> On Tue, Aug 29, 2017 at 10:07:18AM +0100, Chris Wilson wrote:
> > Quoting Jeff McGee (2017-08-28 21:18:44)
> > > On Mon, Aug 28, 2017 at 08:44:48PM +0100, Chris Wilson wrote:
> > > > Quoting jeff.mcgee at intel.com (2017-08-28 20:25:30)
> > > > > From: Jeff McGee <jeff.mcgee at intel.com>
> > > > > 
> > > > > If someone else is resetting the engine we should clear our own bit as
> > > > > part of skipping that engine. Otherwise we will later believe that it
> > > > > has not been reset successfully and then trigger full gpu reset. If the
> > > > > other guy's reset actually fails, he will trigger the full gpu reset.
> > > > 
> > > > The reason we did continue on to the global reset was to serialise
> > > > i915_handle_error() with the other thread. Not a huge issue, but a
> > > > reasonable property to keep -- and we definitely want a to explain why
> > > > only one reset at a time is important.
> > > > 
> > > > bool intel_engine_lock_reset() {
> > > >       if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
> > > >                             &engine->i915->gpu_error.flags))
> > > >               return true;
> > > > 
> > > >       intel_engine_wait_for_reset(engine);
> > > The current code doesn't wait for the other thread to finish the reset, but
> > > this would add that wait. 
> > 
> > Pardon? If we can't reset the engine, we go to the full reset which is
> > serialised, both with individual engine resets and other globals.
> > 
> > > Did you intend that as an additional change to
> > > the current code? I don't think it is necessary. Each thread wants to
> > > reset some subset of engines, so it seems the thread can safely exit as soon
> > > as it knows each of those engines has been reset or is being reset as part
> > > of another thread that got the lock first. If any of the threads fail to
> > > reset an engine they "own", then full gpu reset is assured.
> > 
> > It's unexpected for this function to return before the reset.
> > -Chris
> 
> I'm a bit confused, so let's go back to the original code that I was trying
> to fix:
> 
> 
>         /*
>          * Try engine reset when available. We fall back to full reset if
>          * single reset fails.
>          */
>         if (intel_has_reset_engine(dev_priv)) {
>                 for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
>                         BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE);
>                         if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
>                                              &dev_priv->gpu_error.flags))
>                                 continue;
> 
>                         if (i915_reset_engine(engine, 0) == 0)
>                                 engine_mask &= ~intel_engine_flag(engine);
> 
>                         clear_bit(I915_RESET_ENGINE + engine->id,
>                                   &dev_priv->gpu_error.flags);
>                         wake_up_bit(&dev_priv->gpu_error.flags,
>                                     I915_RESET_ENGINE + engine->id);
>                 }
>         }
> 
>         if (!engine_mask)
>                 goto out;
> 
>         /* Full reset needs the mutex, stop any other user trying to do so. */
> 
> Let's say that 2 threads are here intending to reset render. #1 gets the lock
> and starts the render engine-only reset. #2 fails to get the lock which implies
> that someone else is in the process of resetting the render engine (with single
> engine reset or full gpu reset). #2 continues on without waiting but doesn't
> clear the render bit in engine_mask. So #2 will proceed to initiate a full
> gpu reset when it may not be necessary. That's the problem I was trying
> to address with my initial patch. Do you agree that #2 must clear this bit
> to avoid always triggering full gpu reset? If the engine-only reset done by
> #1 fails, #1 will do the fallback to full gpu reset, so there is no risk that
> we would miss the full gpu reset if it is really needed.
> 
> Then there is the question of whether #2 should wait around for the
> render engine reset by #1 to complete. It doesn't in current code and I don't
> see why it needs to. But that can be a separate discussion from the above.

It very much does in the current code. If we can not do the per-engine
reset, it falls back to the global reset. The global reset is serialised
with itself and the per-engine resets. Ergo it waits, and that was
intentional.
-Chris