[Intel-gfx] [PATCH] drm/i915: Reduce i915_request.lock contention for i915_request_wait
Chris Wilson
chris at chris-wilson.co.uk
Wed Jul 15 14:47:15 UTC 2020
Quoting Tvrtko Ursulin (2020-07-15 13:26:23)
>
> On 15/07/2020 13:06, Tvrtko Ursulin wrote:
> >
> > On 15/07/2020 11:50, Chris Wilson wrote:
> >> Currently, we use i915_request_completed() directly in
> >> i915_request_wait() and follow up with a manual invocation of
> >> dma_fence_signal(). This appears to cause a large number of contentions
> >> on i915_request.lock as when the process is woken up after the fence is
> >> signaled by an interrupt, we will then try and call dma_fence_signal()
> >> ourselves while the signaler is still holding the lock.
> >> dma_fence_is_signaled() has the benefit of checking the
> >> DMA_FENCE_FLAG_SIGNALED_BIT prior to calling dma_fence_signal() and so
> >> avoids most of that contention.
> >>
> >> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >> Cc: Matthew Auld <matthew.auld at intel.com>
> >> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >> ---
> >> drivers/gpu/drm/i915/i915_request.c | 12 ++++--------
> >> 1 file changed, 4 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/i915_request.c
> >> b/drivers/gpu/drm/i915/i915_request.c
> >> index 0b2fe55e6194..bb4eb1a8780e 100644
> >> --- a/drivers/gpu/drm/i915/i915_request.c
> >> +++ b/drivers/gpu/drm/i915/i915_request.c
> >> @@ -1640,7 +1640,7 @@ static bool busywait_stop(unsigned long timeout,
> >> unsigned int cpu)
> >> return this_cpu != cpu;
> >> }
> >> -static bool __i915_spin_request(const struct i915_request * const rq,
> >> int state)
> >> +static bool __i915_spin_request(struct i915_request * const rq, int
> >> state)
> >> {
> >> unsigned long timeout_ns;
> >> unsigned int cpu;
> >> @@ -1673,7 +1673,7 @@ static bool __i915_spin_request(const struct
> >> i915_request * const rq, int state)
> >> timeout_ns = READ_ONCE(rq->engine->props.max_busywait_duration_ns);
> >> timeout_ns += local_clock_ns(&cpu);
> >> do {
> >> - if (i915_request_completed(rq))
> >> + if (dma_fence_is_signaled(&rq->fence))
> >> return true;
> >> if (signal_pending_state(state, current))
> >> @@ -1766,10 +1766,8 @@ long i915_request_wait(struct i915_request *rq,
> >> * duration, which we currently lack.
> >> */
> >> if (IS_ACTIVE(CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT) &&
> >> - __i915_spin_request(rq, state)) {
> >> - dma_fence_signal(&rq->fence);
> >> + __i915_spin_request(rq, state))
> >> goto out;
> >> - }
> >> /*
> >> * This client is about to stall waiting for the GPU. In many cases
> >> @@ -1796,10 +1794,8 @@ long i915_request_wait(struct i915_request *rq,
> >> for (;;) {
> >> set_current_state(state);
> >> - if (i915_request_completed(rq)) {
> >> - dma_fence_signal(&rq->fence);
> >> + if (dma_fence_is_signaled(&rq->fence))
> >> break;
> >> - }
> >> intel_engine_flush_submission(rq->engine);
> >>
> >
> > In other words putting some latency back into the waiters, which is
> > probably okay, since sync waits is not our primary model.
> >
> > I have a slight concern about the remaining value of busy spinning if
> > i915_request_completed check is removed from there as well. Of course it
> > doesn't make sense to have different completion criteria between the
> > two.. We could wait a bit longer if real check in busyspin said request
> > is actually completed, just not signal it but wait for the breadcrumbs
> > to do it.
>
> What a load of nonsense.. :)
>
> Okay, I think the only real question is i915_request_completed vs
> dma_fence_signaled in __i915_spin_request. Do we want to burn CPU cycles
> waiting on GPU and breadcrumb irq work, or just the GPU.
dma_fence_is_signaled() {
if (test_bit(SIGNALED_BIT))
return true;
if (i915_request_completed()) {
dma_fence_signal();
return true;
}
return false;
}
with the indirection. So the question is whether the indirection is
worth the extra test bit. Just purely looking at the i915_request.lock
contention suggests that it probably is. For the spinner, burning a few
extra CPU cycles for *vfunc is not an issue, it's the wakeup latency,
and since we are calling dma_fence_signal() upon wakeup we do take the
spinlock without checking for an early return from the SIGNALED_BIT.
So I think it's a net positive. The alternative was to write
if (i915_request_completed()) {
if (!i915_request_is_signaled())
dma_fence_signal();
break;
}
but
if (dma_fence_is_signaled())
break;
does appear simpler, if only by virtue of hiding the details in an
inline.
-Chris
More information about the Intel-gfx
mailing list