[Intel-gfx] [PATCH 6/7] drm/i915/gt: Split the breadcrumb spinlock between global and contexts

Fri Aug 7 12:27:04 UTC 2020

Quoting Tvrtko Ursulin (2020-08-07 12:54:48)
> 
> On 07/08/2020 09:32, Chris Wilson wrote:
> > As we funnel more and more contexts into the breadcrumbs on an engine,
> > the hold time of b->irq_lock grows. As we may then contend with the
> > b->irq_lock during request submission, this increases the burden upon
> > the engine->active.lock and so directly impacts both our execution
> > latency and client latency. If we split the b->irq_lock by introducing a
> > per-context spinlock to manage the signalers within a context, we then
> > only need the b->irq_lock for enabling/disabling the interrupt and can
> > avoid taking the lock for walking the list of contexts within the signal
> > worker. Even with the current setup, this greatly reduces the number of
> > times we have to take and fight for b->irq_lock.
> > 
> > Furthermore, this closes the race between enabling the signaling context
> > while it is in the process of being signaled and removed:
> 
> Where is this race present? Hope not in the patch preceding this one?!

No, unfortunately this patch was fixing a bug as a part of the previous
series so it wasn't detected by CI until after the first pair were merged
(even though that pair were run separately just in case).

So what's happening is that as we remove the request from ce->signals,
we add a request to ce->signals ... on a different engine.

We took b->irq_lock for rq1->engine[vcs0] and then rq2->engine[vcs1].
Since we are holding the b->irq_lock on the active engine for each
request...

Ouch.

> > -     list_for_each_entry_safe(ce, cn, &b->signalers, signal_link) {
> > -             GEM_BUG_ON(list_empty(&ce->signals));
> > +             if (!spin_trylock(&ce->signal_lock))
> 
> Trylock makes me think the solution is not completely elegant even 
> before I figured it all out. So going back to the first question - can 
> you fix the race in scope of the current locking scheme to start with?

The trylock is just because we can have signal_irq_worker running on
multiple cpus simultaneously. It's nothing to do with the locking
requirements, just skipping over a wait before doing no work.
-Chris