[Intel-gfx] [PATCH] drm/i915: Clear local engine-needs-reset bit if in progress elsewhere

Tue Aug 29 15:22:40 UTC 2017

Quoting Jeff McGee (2017-08-28 20:46:00)
> On Mon, Aug 28, 2017 at 12:41:58PM -0700, Michel Thierry wrote:
> > On 28/08/17 12:25, jeff.mcgee at intel.com wrote:
> > >From: Jeff McGee <jeff.mcgee at intel.com>
> > >
> > >If someone else is resetting the engine we should clear our own bit as
> > >part of skipping that engine. Otherwise we will later believe that it
> > >has not been reset successfully and then trigger full gpu reset. If the
> > >other guy's reset actually fails, he will trigger the full gpu reset.
> > >
> > 
> > Did you hit this by manually setting wedged to 'x' ring repeatedly?
> > 
> I haven't actually reproduced it. Have just been looking at the code a
> lot to try to develop reset for preemption enforcement. The implementation
> will call i915_handle_error from another work item that can run concurrent
> with hangcheck.

Note to hit it in practice is a nasty bug. The assumption is that between
a pair of resets there was sufficient time for the engine to recover,
and so if we reset too quickly we conclude that the reset/recovery
mechanism is broken.

And if you do start playing with fast resets, you very quickly find that
kthread_park is a livelock waiting to happen.
-Chris