[Intel-gfx] [PATCH 5/5] drm/i915: Prevent bonded requests from overtaking each other on preemption

Mon Sep 23 13:53:21 UTC 2019

Quoting Tvrtko Ursulin (2019-09-23 14:48:44)
> 
> On 21/09/2019 10:55, Chris Wilson wrote:
> > Force bonded requests to run on distinct engines so that they cannot be
> > shuffled onto the same engine where timeslicing will reverse the order.
> > A bonded request will often wait on a semaphore signaled by its master,
> > creating an implicit dependency -- if we ignore that implicit dependency
> > and allow the bonded request to run on the same engine and before its
> > master, we will cause a GPU hang. [Whether it will hang the GPU is
> > debatable, we should keep on timeslicing and each timeslice should be
> > "accidentally" counted as forward progress, in which case it should run
> > but at one-half to one-third speed.]
> > 
> > We can prevent this inversion by restricting which engines we allow
> > ourselves to jump to upon preemption, i.e. baking in the arrangement
> > established at first execution. (We should also consider capturing the
> > implicit dependency using i915_sched_add_dependency(), but first we need
> > to think about the constraints that requires on the execution/retirement
> > ordering.)
> > 
> > Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing")
> > References: ee1136908e9b ("drm/i915/execlists: Virtual engine bonding")
> > Testcase: igt/gem_exec_balancer/bonded-slice
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++--------
> >   1 file changed, 12 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index 3eadd294bcd7..9872bb4c99fc 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -3846,18 +3846,22 @@ static void
> >   virtual_bond_execute(struct i915_request *rq, struct dma_fence *signal)
> >   {
> >       struct virtual_engine *ve = to_virtual_engine(rq->engine);
> > +     intel_engine_mask_t allowed, exec;
> >       struct ve_bond *bond;
> >   
> > +     allowed = ~to_request(signal)->engine->mask;
> > +
> >       bond = virtual_find_bond(ve, to_request(signal)->engine);
> > -     if (bond) {
> > -             intel_engine_mask_t old, new, cmp;
> > +     if (bond)
> > +             allowed &= bond->sibling_mask;
> >   
> > -             cmp = READ_ONCE(rq->execution_mask);
> > -             do {
> > -                     old = cmp;
> > -                     new = cmp & bond->sibling_mask;
> > -             } while ((cmp = cmpxchg(&rq->execution_mask, old, new)) != old);
> > -     }
> > +     /* Restrict the bonded request to run on only the available engines */
> > +     exec = READ_ONCE(rq->execution_mask);
> > +     while (!try_cmpxchg(&rq->execution_mask, &exec, exec & allowed))
> > +             ;
> > +
> > +     /* Prevent the master from being re-run on the bonded engines */
> > +     to_request(signal)->execution_mask &= ~allowed;
> 
> I had an open here whether we should also fix the slave. At re-submit 
> time what is preventing the slave being put on the same engine as the 
> master?

We removed the master engine from the list of allowed engines for the
slave, and we never reset the execution_mask on unsubmit (and we don't
re-execute the bonding on resubmit). That's the hard bit I left open for
the reader.
-Chris