[Intel-gfx] [PATCH] drm/i915: Prevent bonded requests from overtaking each other on preemption

Fri Sep 20 15:11:54 UTC 2019

Quoting Tvrtko Ursulin (2019-09-20 15:51:35)
> 
> On 20/09/2019 13:42, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2019-09-20 13:24:47)
> >>
> >> On 20/09/2019 09:36, Chris Wilson wrote:
> >>> Force bonded requests to run on distinct engines so that they cannot be
> >>> shuffled onto the same engine where timeslicing will reverse the order.
> >>> A bonded request will often wait on a semaphore signaled by its master,
> >>> creating an implicit dependency -- if we ignore that implicit dependency
> >>> and allow the bonded request to run on the same engine and before its
> >>> master, we will cause a GPU hang.
> >>>
> >>> We can prevent this inversion by restricting which engines we allow
> >>> ourselves to jump to upon preemption, i.e. baking in the arrangement
> >>> established at first execution. (We should also consider capturing the
> >>> implicit dependency using i915_sched_add_dependency(), but first we need
> >>> to think about the constraints that requires on the execution/retirement
> >>> ordering.)
> >>>
> >>> Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing")
> >>> References: ee1136908e9b ("drm/i915/execlists: Virtual engine bonding")
> >>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/gt/intel_lrc.c | 19 +++++++++++--------
> >>>    1 file changed, 11 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> index a99166a2d2eb..7920649e4d87 100644
> >>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> @@ -3755,18 +3755,21 @@ static void
> >>>    virtual_bond_execute(struct i915_request *rq, struct dma_fence *signal)
> >>>    {
> >>>        struct virtual_engine *ve = to_virtual_engine(rq->engine);
> >>> +     intel_engine_mask_t allowed, exec;
> >>>        struct ve_bond *bond;
> >>>    
> >>>        bond = virtual_find_bond(ve, to_request(signal)->engine);
> >>> -     if (bond) {
> >>> -             intel_engine_mask_t old, new, cmp;
> >>> +     if (!bond)
> >>> +             return;
> >>>    
> >>> -             cmp = READ_ONCE(rq->execution_mask);
> >>> -             do {
> >>> -                     old = cmp;
> >>> -                     new = cmp & bond->sibling_mask;
> >>> -             } while ((cmp = cmpxchg(&rq->execution_mask, old, new)) != old);
> >>> -     }
> >>> +     /* Restrict the bonded request to run on only the slaved engines */
> >>> +     allowed = bond->sibling_mask & ~to_request(signal)->engine->mask;
> >>
> >> Hmm.. isn't it a miss on the uapi level that we allow master to be
> >> mentioned in the list of bonds? That's the only scenario where this line
> >> does something I think. So should we just forbid this setup on the uapi
> >> level?
> > 
> > That's just a lot of digging!
> 
> It's simple, in set_engines__bond in the bonds loop we just do "if 
> (master == bond) oh_no_you_wont". Am I missing something?

There doesn't even need to be a bond, which is something I forgot
to handle. So I'm looking at something more like

static void
virtual_bond_execute(struct i915_request *rq, struct dma_fence *signal)
{
        struct virtual_engine *ve = to_virtual_engine(rq->engine);
        intel_engine_mask_t allowed, exec;
        struct ve_bond *bond;

        allowed = ~to_request(signal)->engine->mask;

        bond = virtual_find_bond(ve, to_request(signal)->engine);
        if (bond)
                allowed &= bond->sibling_mask;

        /* Restrict the bonded request to run on only the available engines */
        exec = READ_ONCE(rq->execution_mask);
        while (!try_cmpxchg(&rq->execution_mask, &exec, exec & allowed))
                ;

        /* Prevent the master from being re-run on the bonded engines */
        to_request(signal)->execution_mask &= ~allowed;
}

(The joy of trying to write a test case.)
-Chris