[Intel-gfx] [PATCH] drm/i915: Reduce i915_request.lock contention for i915_request_wait

Thu Jul 16 08:41:17 UTC 2020

On 15/07/2020 15:47, Chris Wilson wrote:
> Quoting Chris Wilson (2020-07-15 15:47:15)
>> Quoting Tvrtko Ursulin (2020-07-15 13:26:23)
>>>
>>> On 15/07/2020 13:06, Tvrtko Ursulin wrote:
>>>>
>>>> On 15/07/2020 11:50, Chris Wilson wrote:
>>>>> Currently, we use i915_request_completed() directly in
>>>>> i915_request_wait() and follow up with a manual invocation of
>>>>> dma_fence_signal(). This appears to cause a large number of contentions
>>>>> on i915_request.lock as when the process is woken up after the fence is
>>>>> signaled by an interrupt, we will then try and call dma_fence_signal()
>>>>> ourselves while the signaler is still holding the lock.
>>>>> dma_fence_is_signaled() has the benefit of checking the
>>>>> DMA_FENCE_FLAG_SIGNALED_BIT prior to calling dma_fence_signal() and so
>>>>> avoids most of that contention.
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> Cc: Matthew Auld <matthew.auld at intel.com>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>> ---
>>>>>    drivers/gpu/drm/i915/i915_request.c | 12 ++++--------
>>>>>    1 file changed, 4 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_request.c
>>>>> b/drivers/gpu/drm/i915/i915_request.c
>>>>> index 0b2fe55e6194..bb4eb1a8780e 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_request.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_request.c
>>>>> @@ -1640,7 +1640,7 @@ static bool busywait_stop(unsigned long timeout,
>>>>> unsigned int cpu)
>>>>>        return this_cpu != cpu;
>>>>>    }
>>>>> -static bool __i915_spin_request(const struct i915_request * const rq,
>>>>> int state)
>>>>> +static bool __i915_spin_request(struct i915_request * const rq, int
>>>>> state)
>>>>>    {
>>>>>        unsigned long timeout_ns;
>>>>>        unsigned int cpu;
>>>>> @@ -1673,7 +1673,7 @@ static bool __i915_spin_request(const struct
>>>>> i915_request * const rq, int state)
>>>>>        timeout_ns = READ_ONCE(rq->engine->props.max_busywait_duration_ns);
>>>>>        timeout_ns += local_clock_ns(&cpu);
>>>>>        do {
>>>>> -        if (i915_request_completed(rq))
>>>>> +        if (dma_fence_is_signaled(&rq->fence))
>>>>>                return true;
>>>>>            if (signal_pending_state(state, current))
>>>>> @@ -1766,10 +1766,8 @@ long i915_request_wait(struct i915_request *rq,
>>>>>         * duration, which we currently lack.
>>>>>         */
>>>>>        if (IS_ACTIVE(CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT) &&
>>>>> -        __i915_spin_request(rq, state)) {
>>>>> -        dma_fence_signal(&rq->fence);
>>>>> +        __i915_spin_request(rq, state))
>>>>>            goto out;
>>>>> -    }
>>>>>        /*
>>>>>         * This client is about to stall waiting for the GPU. In many cases
>>>>> @@ -1796,10 +1794,8 @@ long i915_request_wait(struct i915_request *rq,
>>>>>        for (;;) {
>>>>>            set_current_state(state);
>>>>> -        if (i915_request_completed(rq)) {
>>>>> -            dma_fence_signal(&rq->fence);
>>>>> +        if (dma_fence_is_signaled(&rq->fence))
>>>>>                break;
>>>>> -        }
>>>>>            intel_engine_flush_submission(rq->engine);
>>>>>
>>>>
>>>> In other words putting some latency back into the waiters, which is
>>>> probably okay, since sync waits is not our primary model.
>>>>
>>>> I have a slight concern about the remaining value of busy spinning if
>>>> i915_request_completed check is removed from there as well. Of course it
>>>> doesn't make sense to have different completion criteria between the
>>>> two.. We could wait a bit longer if real check in busyspin said request
>>>> is actually completed, just not signal it but wait for the breadcrumbs
>>>> to do it.
>>>
>>> What a load of nonsense.. :)
>>>
>>> Okay, I think the only real question is i915_request_completed vs
>>> dma_fence_signaled in __i915_spin_request. Do we want to burn CPU cycles
>>> waiting on GPU and breadcrumb irq work, or just the GPU.
>>
>> dma_fence_is_signaled() {
>>          if (test_bit(SIGNALED_BIT))
>>                  return true;
>>          
>>          if (i915_request_completed()) {
>>                  dma_fence_signal();
>>                  return true;
>>          }
>>          
>>          return false;
>> }
>>
>> with the indirection. So the question is whether the indirection is
>> worth the extra test bit. Just purely looking at the i915_request.lock
>> contention suggests that it probably is. For the spinner, burning a few
>> extra CPU cycles for *vfunc is not an issue, it's the wakeup latency,
>> and since we are calling dma_fence_signal() upon wakeup we do take the
>> spinlock without checking for an early return from the SIGNALED_BIT.
>> So I think it's a net positive. The alternative was to write
>>
>>          if (i915_request_completed()) {
>>                  if (!i915_request_is_signaled())
>>                          dma_fence_signal();
>>                  break;
>>          }
>>
>> but
>>
>>          if (dma_fence_is_signaled())
>>                  break;
>>
>> does appear simpler, if only by virtue of hiding the details in an
>> inline.
> 
> Or a patch to add test_bit() to dma_fence_signal, but this looked
> better.

Right I missed dma_fence_is_signaled calls i915_request_completed.

In this case the remaining question is do we care about wait ioctl 
potentially returning before the hypothetical sync fence for the same 
request is signaled? This seems like a slight change in user observable 
behaviour.

Regards,

Tvrtko