[Intel-gfx] [PATCH v10b 4/6] drm/i915: Interrupt driven fences

Mon Jun 27 18:28:21 UTC 2016

On 21/06/2016 17:27, Tvrtko Ursulin wrote:
>
> On 21/06/16 11:44, Tvrtko Ursulin wrote:
>>
>> On 17/06/16 12:05, John.C.Harrison at Intel.com wrote:
>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>
>>> The intended usage model for struct fence is that the signalled status
>>> should be set on demand rather than polled. That is, there should not
>>> be a need for a 'signaled' function to be called everytime the status
>>> is queried. Instead, 'something' should be done to enable a signal
>>> callback from the hardware which will update the state directly. In
>>> the case of requests, this is the seqno update interrupt. The idea is
>>> that this callback will only be enabled on demand when something
>>> actually tries to wait on the fence.
>>>
>>> This change removes the polling test and replaces it with the callback
>>> scheme. Each fence is added to a 'please poke me' list at the start of
>>> i915_add_request(). The interrupt handler (via a deferred work queue)
>>> then scans through the 'poke me' list when a new seqno pops out and
>>> signals any matching fence/request. The fence is then removed from the
>>> list so the entire request stack does not need to be scanned every
>>> time. The fence is added to the list before the commands to generate
>>> the seqno interrupt are added to the ring. Thus the sequence is
>>> guaranteed to be race free if the interrupt is already enabled.
>>>
>>> Note that the interrupt is only enabled on demand (i.e. when
>>> __wait_request() is called). Thus there is still a potential race when
>>> enabling the interrupt as the request may already have completed.
>>> However, this is simply solved by calling the interrupt processing
>>> code immediately after enabling the interrupt and thereby checking for
>>> already completed requests.
>>>
>>> Lastly, the ring clean up code has the possibility to cancel
>>> outstanding requests (e.g. because TDR has reset the ring). These
>>> requests will never get signalled and so must be removed from the
>>> signal list manually. This is done by setting a 'cancelled' flag and
>>> then calling the regular notify/retire code path rather than
>>> attempting to duplicate the list manipulatation and clean up code in
>>> multiple places. This also avoids any race condition where the
>>> cancellation request might occur after/during the completion interrupt
>>> actually arriving.
>>>
>>> v2: Updated to take advantage of the request unreference no longer
>>> requiring the mutex lock.
>>>
>>> v3: Move the signal list processing around to prevent unsubmitted
>>> requests being added to the list. This was occurring on Android
>>> because the native sync implementation calls the
>>> fence->enable_signalling API immediately on fence creation.
>>>
>>> Updated after review comments by Tvrtko Ursulin. Renamed list nodes to
>>> 'link' instead of 'list'. Added support for returning an error code on
>>> a cancelled fence. Update list processing to be more efficient/safer
>>> with respect to spinlocks.
>>>
>>> v5: Made i915_gem_request_submit a static as it is only ever called
>>> from one place.
>>>
>>> Fixed up the low latency wait optimisation. The time delay between the
>>> seqno value being to memory and the drive's ISR running can be
>>> significant, at least for the wait request micro-benchmark. This can
>>> be greatly improved by explicitly checking for seqno updates in the
>>> pre-wait busy poll loop. Also added some documentation comments to the
>>> busy poll code.
>>>
>>> Fixed up support for the faking of lost interrupts
>>> (test_irq_rings/missed_irq_rings). That is, there is an IGT test that
>>> tells the driver to loose interrupts deliberately and then check that
>>> everything still works as expected (albeit much slower).
>>>
>>> Updates from review comments: use non IRQ-save spinlocking, early exit
>>> on WARN and improved comments (Tvrtko Ursulin).
>>>
>>> v6: Updated to newer nigthly and resolved conflicts around the
>>> wait_request busy spin optimisation. Also fixed a race condition
>>> between this early exit path and the regular completion path.
>>>
>>> v7: Updated to newer nightly - lots of ring -> engine renaming plus an
>>> interface change on get_seqno(). Also added a list_empty() check
>>> before acquring spinlocks and doing list processing.
>>>
>>> v8: Updated to newer nightly - changes to request clean up code mean
>>> non of the deferred free mess is needed any more.
>>>
>>> v9: Moved the request completion processing out of the interrupt
>>> handler and into a worker thread (Chris Wilson).
>>>
>>> v10: Changed to an un-ordered work queue to allow parallel processing
>>> of different engines. Also set the high priority flag for reduced
>>> latency. Removed some unnecessary checks for invalid seqno values.
>>> Improved/added some comments and WARNs. Moved a spinlock release a few
>>> lines later to make the 'locked' parameter of
>>> i915_gem_request_enable_interrupt redundant and removed it. Also
>>> shuffled the function around in the file so as to make it static and
>>> remove it from the header file. Corrected the use of
>>> fence_signal_locked() to fence_signal() in the retire code. Dropped
>>> the irq save part of the spin lock calls in the notify code as this is
>>> no longer called from the ISR. Changed the call of
>>> i915_gem_retire_requests_ring() in the reset cleanup code to
>>> i915_gem_request_notify() instead as the former is just duplicating a
>>> lot of operations.
>>> [Review comments from Maarten Lankhorst & Tvrtko Ursulin]
>>>
>>> Made the call to _notify() from _retire_requests_ring() conditional on
>>> interrupts not being enabled as it is only a race condition work
>>> around for that case. Also re-instated the lazy_coherency flag (but
>>> now on the _notify() function) to reduce the overhead of
>>> _retire_requests_ring() calling _notify() lots and lots (even with the
>>> anti-interrupt check).
>>>
>>> Updated for yet more nightly changes (u64 for fence context).
>>>
>>> v10b: Re-ordered the fence signal and IRQ release in _notify() to fix
>>> a race
>>> condition when disabling interrupts. Also, moved the wake_up_all()
>>> call from the IRQ handler to the worker thread to prevent the wake up
>>> of waiting threads from overtaking the signalling of the request.
>>
>> I was concerned by the second part of this change which will increase
>> the wake-up latency for the waiters and has asked John do do some quick
>> low-level (gem_latency) testing.
>>
>> Preliminary results were a bit strange with small batches experiencing
>> the expected slowdown but large one being significantly faster.
>>
>> Since he is out this week I will try and run some more benchmarks on 
>> this.
>>
>> To re-iterate, concern is moving the wake_up_all(&engine->irq_queue)
>> from notify_ring (hard irq) to the fence notify worker. This adds one
>> additional scheduling wakeup latency to the waiters which use the 
>> i915 API.
>
> Okay, benchmarking is done and good news first.
>
> SynMark2 and GfxBench show probably only noise on almost all tests. 
> What regresses is OglBatch[567] between 2-5%, OglMultithread by 10%, 
> OglDrvState by 2% and OglDrvCtx by 30%.
>
> As far as I remember those are not important in general so that is good.
>
> But as expected low-level testing with gem_latency, with three sets of 
> batches, zero length, 100 NOPs and 1000 NOPs show on average 5-6% less 
> throughput and around 18% worse wakeup latency.
>
> Regards,
>
> Tvrtko

Have only done a brief test so far but moving the wake up earlier again 
seems to fix the latency problems.

Maarten, do you have any other comments to add on this patch? If I 
repost it with Tvrtko's issues resolved will you be happy to r-b it?

Thanks,
John.