[Intel-gfx] [PATCH] drm/i915: Park the breadcrumbs signaler across a GPU reset

Mon Feb 13 10:34:26 UTC 2017

Chris Wilson <chris at chris-wilson.co.uk> writes:

> On Mon, Feb 13, 2017 at 11:56:46AM +0200, Mika Kuoppala wrote:
>> Chris Wilson <chris at chris-wilson.co.uk> writes:
>> 
>> > The signal threads may be running concurrently with the GPU reset. The
>> > completion from the GPU run asynchronous with the reset and two threads
>> > may see different snapshots of the state, and the signaler may mark a
>> > request as complete as we try to reset it. We don't tolerate 2 different
>> > views of the same state and complain if we try to mark a request as
>> > failed if it is already complete. Disable the signal threads during
>> > reset to prevent this conflict (even though the conflict implies that
>> > the state we resetting to is invalid, we have already made our
>> > decision!).
>> >
>> > References: https://bugs.freedesktop.org/show_bug.cgi?id=99671
>> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>> > Cc: Mika Kuoppala <mika.kuoppala at intel.com>
>> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>> > ---
>> >  drivers/gpu/drm/i915/i915_gem.c          | 16 +++++++++++++++-
>> >  drivers/gpu/drm/i915/intel_breadcrumbs.c |  3 +++
>> >  2 files changed, 18 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> > index 48922ff454e6..95582295b219 100644
>> > --- a/drivers/gpu/drm/i915/i915_gem.c
>> > +++ b/drivers/gpu/drm/i915/i915_gem.c
>> > @@ -35,6 +35,7 @@
>> >  #include "intel_frontbuffer.h"
>> >  #include "intel_mocs.h"
>> >  #include <linux/dma-fence-array.h>
>> > +#include <linux/kthread.h>
>> >  #include <linux/reservation.h>
>> >  #include <linux/shmem_fs.h>
>> >  #include <linux/slab.h>
>> > @@ -2643,6 +2644,17 @@ int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
>> >  	for_each_engine(engine, dev_priv, id) {
>> >  		struct drm_i915_gem_request *request;
>> >  
>> > +		/* Prevent the signaler thread from updating the request
>> > +		 * state (by calling dma_fence_signal) as we are processing
>> > +		 * the reset. The write from the GPU of the seqno is
>> > +		 * asynchronous and the signaler thread may see a different
>> > +		 * value to us and declare the request complete, even though
>> > +		 * the reset routine have picked that request as the active
>> > +		 * (incomplete) request. This conflict is not handled
>> > +		 * gracefully!
>> > +		 */
>> > +		kthread_park(engine->breadcrumbs.signaler);
>> > +
>> 
>> I was wondering here if it would be best to disable the tasklet first.
>> But stopping the car before we stop the engine seems wise thing to do.
>
> Note that we would have to do two passes over the engines to stop the
> signalers, and even then we still have other sources (other devices)
> that can cause tasklets to be invoked. There is not a foolproof answer
> here, they each serve a different purpose and can be disabled
> orthogonally.
>  
>> Another unrelated thing is that we might be better off just to
>> make find_active_request always seqno coherent and fork a
>> noncoherent version for error capture.
>
> It's slightly more than that. In the worst case irq_barrier is not
> sufficient, and a second call to active_request may return a different
> victim - which again raises some awkwards issues in how we assign blame.
>
> At least we now do the irq_barrier hammer once at the start in reset_prepare,
> so we should be better, but I'm wondering if we want to store the
> request from prepare and then double check in the actual reset.
>

#1 store seqno from hangcheck
#2 get mutex for reset
#3 barrier
#4 find_request (only once) 
#5 on prepare path, check the submachinery
against this req and if inconsistent, queue hangcheck
and return from prepare without resetting.

?
-Mika