[Intel-gfx] [PATCH v10b 4/6] drm/i915: Interrupt driven fences

Tue Jun 21 16:27:05 UTC 2016

On 21/06/16 11:44, Tvrtko Ursulin wrote:
>
> On 17/06/16 12:05, John.C.Harrison at Intel.com wrote:
>> From: John Harrison <John.C.Harrison at Intel.com>
>>
>> The intended usage model for struct fence is that the signalled status
>> should be set on demand rather than polled. That is, there should not
>> be a need for a 'signaled' function to be called everytime the status
>> is queried. Instead, 'something' should be done to enable a signal
>> callback from the hardware which will update the state directly. In
>> the case of requests, this is the seqno update interrupt. The idea is
>> that this callback will only be enabled on demand when something
>> actually tries to wait on the fence.
>>
>> This change removes the polling test and replaces it with the callback
>> scheme. Each fence is added to a 'please poke me' list at the start of
>> i915_add_request(). The interrupt handler (via a deferred work queue)
>> then scans through the 'poke me' list when a new seqno pops out and
>> signals any matching fence/request. The fence is then removed from the
>> list so the entire request stack does not need to be scanned every
>> time. The fence is added to the list before the commands to generate
>> the seqno interrupt are added to the ring. Thus the sequence is
>> guaranteed to be race free if the interrupt is already enabled.
>>
>> Note that the interrupt is only enabled on demand (i.e. when
>> __wait_request() is called). Thus there is still a potential race when
>> enabling the interrupt as the request may already have completed.
>> However, this is simply solved by calling the interrupt processing
>> code immediately after enabling the interrupt and thereby checking for
>> already completed requests.
>>
>> Lastly, the ring clean up code has the possibility to cancel
>> outstanding requests (e.g. because TDR has reset the ring). These
>> requests will never get signalled and so must be removed from the
>> signal list manually. This is done by setting a 'cancelled' flag and
>> then calling the regular notify/retire code path rather than
>> attempting to duplicate the list manipulatation and clean up code in
>> multiple places. This also avoids any race condition where the
>> cancellation request might occur after/during the completion interrupt
>> actually arriving.
>>
>> v2: Updated to take advantage of the request unreference no longer
>> requiring the mutex lock.
>>
>> v3: Move the signal list processing around to prevent unsubmitted
>> requests being added to the list. This was occurring on Android
>> because the native sync implementation calls the
>> fence->enable_signalling API immediately on fence creation.
>>
>> Updated after review comments by Tvrtko Ursulin. Renamed list nodes to
>> 'link' instead of 'list'. Added support for returning an error code on
>> a cancelled fence. Update list processing to be more efficient/safer
>> with respect to spinlocks.
>>
>> v5: Made i915_gem_request_submit a static as it is only ever called
>> from one place.
>>
>> Fixed up the low latency wait optimisation. The time delay between the
>> seqno value being to memory and the drive's ISR running can be
>> significant, at least for the wait request micro-benchmark. This can
>> be greatly improved by explicitly checking for seqno updates in the
>> pre-wait busy poll loop. Also added some documentation comments to the
>> busy poll code.
>>
>> Fixed up support for the faking of lost interrupts
>> (test_irq_rings/missed_irq_rings). That is, there is an IGT test that
>> tells the driver to loose interrupts deliberately and then check that
>> everything still works as expected (albeit much slower).
>>
>> Updates from review comments: use non IRQ-save spinlocking, early exit
>> on WARN and improved comments (Tvrtko Ursulin).
>>
>> v6: Updated to newer nigthly and resolved conflicts around the
>> wait_request busy spin optimisation. Also fixed a race condition
>> between this early exit path and the regular completion path.
>>
>> v7: Updated to newer nightly - lots of ring -> engine renaming plus an
>> interface change on get_seqno(). Also added a list_empty() check
>> before acquring spinlocks and doing list processing.
>>
>> v8: Updated to newer nightly - changes to request clean up code mean
>> non of the deferred free mess is needed any more.
>>
>> v9: Moved the request completion processing out of the interrupt
>> handler and into a worker thread (Chris Wilson).
>>
>> v10: Changed to an un-ordered work queue to allow parallel processing
>> of different engines. Also set the high priority flag for reduced
>> latency. Removed some unnecessary checks for invalid seqno values.
>> Improved/added some comments and WARNs. Moved a spinlock release a few
>> lines later to make the 'locked' parameter of
>> i915_gem_request_enable_interrupt redundant and removed it. Also
>> shuffled the function around in the file so as to make it static and
>> remove it from the header file. Corrected the use of
>> fence_signal_locked() to fence_signal() in the retire code. Dropped
>> the irq save part of the spin lock calls in the notify code as this is
>> no longer called from the ISR. Changed the call of
>> i915_gem_retire_requests_ring() in the reset cleanup code to
>> i915_gem_request_notify() instead as the former is just duplicating a
>> lot of operations.
>> [Review comments from Maarten Lankhorst & Tvrtko Ursulin]
>>
>> Made the call to _notify() from _retire_requests_ring() conditional on
>> interrupts not being enabled as it is only a race condition work
>> around for that case. Also re-instated the lazy_coherency flag (but
>> now on the _notify() function) to reduce the overhead of
>> _retire_requests_ring() calling _notify() lots and lots (even with the
>> anti-interrupt check).
>>
>> Updated for yet more nightly changes (u64 for fence context).
>>
>> v10b: Re-ordered the fence signal and IRQ release in _notify() to fix
>> a race
>> condition when disabling interrupts. Also, moved the wake_up_all()
>> call from the IRQ handler to the worker thread to prevent the wake up
>> of waiting threads from overtaking the signalling of the request.
>
> I was concerned by the second part of this change which will increase
> the wake-up latency for the waiters and has asked John do do some quick
> low-level (gem_latency) testing.
>
> Preliminary results were a bit strange with small batches experiencing
> the expected slowdown but large one being significantly faster.
>
> Since he is out this week I will try and run some more benchmarks on this.
>
> To re-iterate, concern is moving the wake_up_all(&engine->irq_queue)
> from notify_ring (hard irq) to the fence notify worker. This adds one
> additional scheduling wakeup latency to the waiters which use the i915 API.

Okay, benchmarking is done and good news first.

SynMark2 and GfxBench show probably only noise on almost all tests. What 
regresses is OglBatch[567] between 2-5%, OglMultithread by 10%, 
OglDrvState by 2% and OglDrvCtx by 30%.

As far as I remember those are not important in general so that is good.

But as expected low-level testing with gem_latency, with three sets of 
batches, zero length, 100 NOPs and 1000 NOPs show on average 5-6% less 
throughput and around 18% worse wakeup latency.

Regards,

Tvrtko