[Intel-gfx] [PATCH 09/13] drm/i915: Interrupt driven fences
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Fri Dec 11 08:07:22 PST 2015
On 11/12/15 15:30, John Harrison wrote:
> Reply moved from earlier patch set which has now been superceeded by
> this set...
>
> On 11/12/2015 12:17, Tvrtko Ursulin wrote:
>>
>> Hi,
>>
>> Some random comments, mostly from the point of view of solving the
>> thundering herd problem.
>>
>> On 23/11/15 11:34, John.C.Harrison at Intel.com wrote:
>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>
>>> The intended usage model for struct fence is that the signalled status
>>> should be set on demand rather than polled. That is, there should not
>>> be a need for a 'signaled' function to be called everytime the status
>>> is queried. Instead, 'something' should be done to enable a signal
>>> callback from the hardware which will update the state directly. In
>>> the case of requests, this is the seqno update interrupt. The idea is
>>> that this callback will only be enabled on demand when something
>>> actually tries to wait on the fence.
>>>
>>> This change removes the polling test and replaces it with the callback
>>> scheme. Each fence is added to a 'please poke me' list at the start of
>>> i915_add_request(). The interrupt handler then scans through the 'poke
>>> me' list when a new seqno pops out and signals any matching
>>> fence/request. The fence is then removed from the list so the entire
>>> request stack does not need to be scanned every time. Note that the
>>> fence is added to the list before the commands to generate the seqno
>>> interrupt are added to the ring. Thus the sequence is guaranteed to be
>>> race free if the interrupt is already enabled.
>>>
>>> Note that the interrupt is only enabled on demand (i.e. when
>>> __wait_request() is called). Thus there is still a potential race when
>>> enabling the interrupt as the request may already have completed.
>>> However, this is simply solved by calling the interrupt processing
>>> code immediately after enabling the interrupt and thereby checking for
>>> already completed requests.
>>>
>>> Lastly, the ring clean up code has the possibility to cancel
>>> outstanding requests (e.g. because TDR has reset the ring). These
>>> requests will never get signalled and so must be removed from the
>>> signal list manually. This is done by setting a 'cancelled' flag and
>>> then calling the regular notify/retire code path rather than
>>> attempting to duplicate the list manipulatation and clean up code in
>>> multiple places. This also avoid any race condition where the
>>> cancellation request might occur after/during the completion interrupt
>>> actually arriving.
>>>
>>> v2: Updated to take advantage of the request unreference no longer
>>> requiring the mutex lock.
>>>
>>> v3: Move the signal list processing around to prevent unsubmitted
>>> requests being added to the list. This was occurring on Android
>>> because the native sync implementation calls the
>>> fence->enable_signalling API immediately on fence creation.
>>>
>>> Updated after review comments by Tvrtko Ursulin. Renamed list nodes to
>>> 'link' instead of 'list'. Added support for returning an error code on
>>> a cancelled fence. Update list processing to be more efficient/safer
>>> with respect to spinlocks.
>>>
>>> For: VIZ-5190
>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_drv.h | 10 ++
>>> drivers/gpu/drm/i915/i915_gem.c | 187
>>> ++++++++++++++++++++++++++++++--
>>> drivers/gpu/drm/i915/i915_irq.c | 2 +
>>> drivers/gpu/drm/i915/intel_lrc.c | 2 +
>>> drivers/gpu/drm/i915/intel_ringbuffer.c | 2 +
>>> drivers/gpu/drm/i915/intel_ringbuffer.h | 2 +
>>> 6 files changed, 196 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>>> b/drivers/gpu/drm/i915/i915_drv.h
>>> index fbf591f..d013c6d 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>> @@ -2187,7 +2187,12 @@ void i915_gem_track_fb(struct
>>> drm_i915_gem_object *old,
>>> struct drm_i915_gem_request {
>>> /** Underlying object for implementing the signal/wait stuff. */
>>> struct fence fence;
>>> + struct list_head signal_link;
>>> + struct list_head unsignal_link;
>>> struct list_head delayed_free_link;
>>> + bool cancelled;
>>> + bool irq_enabled;
>>> + bool signal_requested;
>>>
>>> /** On Which ring this request was generated */
>>> struct drm_i915_private *i915;
>>> @@ -2265,6 +2270,11 @@ int i915_gem_request_alloc(struct
>>> intel_engine_cs *ring,
>>> struct drm_i915_gem_request **req_out);
>>> void i915_gem_request_cancel(struct drm_i915_gem_request *req);
>>>
>>> +void i915_gem_request_submit(struct drm_i915_gem_request *req);
>>> +void i915_gem_request_enable_interrupt(struct drm_i915_gem_request
>>> *req,
>>> + bool fence_locked);
>>> +void i915_gem_request_notify(struct intel_engine_cs *ring, bool
>>> fence_locked);
>>> +
>>> int i915_create_fence_timeline(struct drm_device *dev,
>>> struct intel_context *ctx,
>>> struct intel_engine_cs *ring);
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
>>> b/drivers/gpu/drm/i915/i915_gem.c
>>> index 171ae5f..2a0b346 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -1165,6 +1165,8 @@ static int __i915_spin_request(struct
>>> drm_i915_gem_request *req)
>>>
>>> timeout = jiffies + 1;
>>> while (!need_resched()) {
>>> + i915_gem_request_notify(req->ring, false);
>>> +
>>
>> This looks is a bit heavyweight, hammering on the spinlock and
>> interrupts on-off, why it is required to call this here instead of
>> relying on interrupts which have been enabled already?
>
> It isn't required. It was is just an attempt to implement the super-fast
> spin wait rather than waiting for the full interrupt latency. Chris W
> seems to think polling for completion for a brief period before going to
> sleep is a significant performance improvement. With interrupt based
> completion then the spin will require running the interrupt processing
> code on each iteration otherwise you are still waiting for whatever
> latency is in the interrupt path itself.
It is fine to be removed then I think. Primary thing with busy waits is
to avoid setup cost of turning on interrupts and scheduling latency of
putting the waiter to sleep.
Your code already enabled user interrupts before the busy wait so you
can spin on i915_gem_request_completed which will get signaled from the
irq handler with little cost.
>>
>>> if (i915_gem_request_completed(req))
>>> return 0;
>>>
>>> @@ -1173,6 +1175,9 @@ static int __i915_spin_request(struct
>>> drm_i915_gem_request *req)
>>>
>>> cpu_relax_lowlatency();
>>> }
>>> +
>>> + i915_gem_request_notify(req->ring, false);
>>> +
>>> if (i915_gem_request_completed(req))
>>> return 0;
>>>
>>> @@ -1214,9 +1219,14 @@ int __i915_wait_request(struct
>>> drm_i915_gem_request *req,
>>>
>>> WARN(!intel_irqs_enabled(dev_priv), "IRQs disabled");
>>>
>>> - if (list_empty(&req->list))
>>> + if (i915_gem_request_completed(req))
>>> return 0;
>>>
>>> + /*
>>> + * Enable interrupt completion of the request.
>>> + */
>>> + fence_enable_sw_signaling(&req->fence);
>>> +
>>
>> There is duplicated user interrupt handling in this function now. Code
>> which does irq get/put directly is still there, but I think it should
>> be removed.
> Not sure exactly what you think is duplicated? The fence function just
> calls back to the i915 code which sets stuff up for interrupt signalling
> (if it hasn't already been done). If the fence is already set up then
> the call is a no-op.
__i915_wait request does explicitly ring->irq_get and irq_put after you
already enable them via fence_enable_sw_signaling. So it is redundant in
that respect.
You also probably need to fix up the magical fault injection code which
is dev_priv->gpu_error.test_irq_rings.
>
>>
>>> if (i915_gem_request_completed(req))
>>> return 0;
>>>
>>> @@ -1377,6 +1387,19 @@ static void i915_gem_request_retire(struct
>>> drm_i915_gem_request *request)
>>> list_del_init(&request->list);
>>> i915_gem_request_remove_from_client(request);
>>>
>>> + /* In case the request is still in the signal pending list */
>>
>> When would this happen? (comment hint)
>
> TDR, pre-emption, anything else which can abort a request/batch buffer
> before it has completed. Will update the comment.
>
>>
>>> + if (!list_empty(&request->signal_link)) {
>>> + /*
>>> + * The request must be marked as cancelled and the underlying
>>> + * fence as both failed. NB: There is no explicit fence fail
>>> + * API, there is only a manual poke and signal.
>>> + */
>>> + request->cancelled = true;
>>> + /* How to propagate to any associated sync_fence??? */
>>> + request->fence.status = -EIO;
>>> + fence_signal_locked(&request->fence);
>>
>> Does the cancelled request has to stay on the signal list? If it could
>> be moved outside of the irq hot list it might be beneficial for list
>> length and simplifying the code in i915_gem_request_notify.
> The point of leaving it on the list is to get all the other processing
> without duplicating any code. E.g. later on the scheduler hooks into the
> notify function as it needs to know when each request has completed.
I was thinking along the lines to make the signal_list as short as
possible so to do as little as possible in the expensive irq handler.
>
>>
>>> + }
>>> +
>>> i915_gem_request_unreference(request);
>>> }
>>>
>>> @@ -2535,6 +2558,12 @@ void __i915_add_request(struct
>>> drm_i915_gem_request *request,
>>> */
>>> request->postfix = intel_ring_get_tail(ringbuf);
>>>
>>> + /*
>>> + * Add the fence to the pending list before emitting the
>>> commands to
>>> + * generate a seqno notification interrupt.
>>> + */
>>> + i915_gem_request_submit(request);
>>> +
>>> if (i915.enable_execlists)
>>> ret = ring->emit_request(request);
>>> else {
>>> @@ -2654,25 +2683,135 @@ static void i915_gem_request_free(struct
>>> drm_i915_gem_request *req)
>>> i915_gem_context_unreference(ctx);
>>> }
>>>
>>> + if (req->irq_enabled)
>>> + req->ring->irq_put(req->ring);
>>> +
>>> kmem_cache_free(req->i915->requests, req);
>>> }
>>>
>>> -static bool i915_gem_request_enable_signaling(struct fence *req_fence)
>>> +/*
>>> + * The request is about to be submitted to the hardware so add the
>>> fence to
>>> + * the list of signalable fences.
>>> + *
>>> + * NB: This does not necessarily enable interrupts yet. That only
>>> occurs on
>>> + * demand when the request is actually waited on. However, adding it
>>> to the
>>> + * list early ensures that there is no race condition where the
>>> interrupt
>>> + * could pop out prematurely and thus be completely lost. The race
>>> is merely
>>> + * that the interrupt must be manually checked for after being enabled.
>>> + */
>>> +void i915_gem_request_submit(struct drm_i915_gem_request *req)
>>> {
>>> - /* Interrupt driven fences are not implemented yet.*/
>>> - WARN(true, "This should not be called!");
>>> - return true;
>>> + unsigned long flags;
>>> +
>>> + /*
>>> + * Always enable signal processing for the request's fence object
>>> + * before that request is submitted to the hardware. Thus there
>>> is no
>>> + * race condition whereby the interrupt could pop out before the
>>> + * request has been added to the signal list. Hence no need to
>>> check
>>> + * for completion, undo the list add and return false.
>>> + */
>>> + i915_gem_request_reference(req);
>>> + spin_lock_irqsave(&req->ring->fence_lock, flags);
>>
>> This only gets called from non-irq context so could use spin_lock_irq.
> The fence_lock is acquired from within i915_gem_request_notify() which
> is called within the IRQ handler and therefore at IRQ context.
Correct, fence_lock needs to be irq safe, but i915_gem_request_submit is
only called from process context, no?
So you don't need to use spin_lock_irqsave but cheaper spin_lock_irq.
>>
>>> + WARN_ON(!list_empty(&req->signal_link));
>>> + list_add_tail(&req->signal_link, &req->ring->fence_signal_list);
>>> + spin_unlock_irqrestore(&req->ring->fence_lock, flags);
>>> +
>>> + /*
>>> + * NB: Interrupts are only enabled on demand. Thus there is still a
>>> + * race where the request could complete before the interrupt has
>>> + * been enabled. Thus care must be taken at that point.
>>> + */
>>> +
>>> + /* Have interrupts already been requested? */
>>> + if (req->signal_requested)
>>
>> Some code path can enable signaling before the execbuf completed?
> Once the scheduler has arrived then a wait request call can be made long
> before the request has been submitted to the hardware. Likewise if a
> user land fence is requested for the batch buffer, that is allocated
> before execution begins and the internals of the fence code
> automatically enables signalling.
I forgot exactly what I meant there, but looked strange and still does.
I suppose the question is why it is not sufficient to enable interrupts
in enable_signaling?
>
>>> + i915_gem_request_enable_interrupt(req, false);
>>> +}
>>> +
>>> +/*
>>> + * The request is being actively waited on, so enable interrupt based
>>> + * completion signalling.
>>> + */
>>> +void i915_gem_request_enable_interrupt(struct drm_i915_gem_request
>>> *req,
>>> + bool fence_locked)
>>> +{
>>> + if (req->irq_enabled)
>>> + return;
>>> +
>>> + WARN_ON(!req->ring->irq_get(req->ring));
>>> + req->irq_enabled = true;
>>
>> Probably shouldn't set the flag if irq_get failed not to unbalance the
>> irq refcount.
> Will add an if around the WARN.
>
>>
>>> +
>>> + /*
>>> + * Because the interrupt is only enabled on demand, there is a race
>>> + * where the interrupt can fire before anyone is looking for it. So
>>> + * do an explicit check for missed interrupts.
>>> + */
>>> + i915_gem_request_notify(req->ring, fence_locked);
>>> }
>>>
>>> -static bool i915_gem_request_is_completed(struct fence *req_fence)
>>> +static bool i915_gem_request_enable_signaling(struct fence *req_fence)
>>> {
>>> struct drm_i915_gem_request *req = container_of(req_fence,
>>> typeof(*req), fence);
>>> +
>>> + /*
>>> + * No need to actually enable interrupt based processing until the
>>> + * request has been submitted to the hardware. At which point
>>> + * 'i915_gem_request_submit()' is called. So only really enable
>>> + * signalling in there. Just set a flag to say that interrupts are
>>> + * wanted when the request is eventually submitted. On the other
>>> hand
>>> + * if the request has already been submitted then interrupts do
>>> need
>>> + * to be enabled now.
>>> + */
>>> +
>>> + req->signal_requested = true;
>>> +
>>> + if (!list_empty(&req->signal_link))
>>> + i915_gem_request_enable_interrupt(req, true);
>>> +
>>> + return true;
>>> +}
>>> +
>>> +void i915_gem_request_notify(struct intel_engine_cs *ring, bool
>>> fence_locked)
>>
>> I think name could be improve since this made me expect a request in
>> the parameter list.
>>
>> Maybe i915_gem_notify_requests ?
> The idea behind the naming is that all the request code is prefixed with
> 'i915_gem_request_'. This is the notify function of that set of interfaces.
>
>>
>>> +{
>>> + struct drm_i915_gem_request *req, *req_next;
>>> + unsigned long flags;
>>> u32 seqno;
>>>
>>> - seqno = req->ring->get_seqno(req->ring, false/*lazy_coherency*/);
>>> + if (list_empty(&ring->fence_signal_list))
>>> + return;
>>> +
>>> + if (!fence_locked)
>>> + spin_lock_irqsave(&ring->fence_lock, flags);
>>> +
>>> + seqno = ring->get_seqno(ring, false);
>>
>> Could read the seqno outside the spinlock, although not that important
>> since the primary caller of this is (or should be) the irq handler.
> That fix is already done in patch #13 of the new series.
>
>>
>>> +
>>> + list_for_each_entry_safe(req, req_next,
>>> &ring->fence_signal_list, signal_link) {
>>> + if (!req->cancelled) {
>>> + if (!i915_seqno_passed(seqno, req->seqno))
>>> + break;
>>> + }
>>> +
>>> + /*
>>> + * Start by removing the fence from the signal list otherwise
>>> + * the retire code can run concurrently and get confused.
>>> + */
>>> + list_del_init(&req->signal_link);
>>> +
>>> + if (!req->cancelled) {
>>> + fence_signal_locked(&req->fence);
>>> + }
>>> +
>>> + if (req->irq_enabled) {
>>> + req->ring->irq_put(req->ring);
>>> + req->irq_enabled = false;
>>> + }
>>> +
>>> + /* Can't unreference here because that might grab fence_lock */
>>> + list_add_tail(&req->unsignal_link, &ring->fence_unsignal_list);
>>> + }
>>>
>>> - return i915_seqno_passed(seqno, req->seqno);
>>> + if (!fence_locked)
>>> + spin_unlock_irqrestore(&ring->fence_lock, flags);
>>> }
>>>
>>> static const char *i915_gem_request_get_driver_name(struct fence
>>> *req_fence)
>>> @@ -2712,7 +2851,6 @@ static void
>>> i915_gem_request_fence_value_str(struct fence *req_fence, char *str,
>>>
>>> static const struct fence_ops i915_gem_request_fops = {
>>> .enable_signaling = i915_gem_request_enable_signaling,
>>> - .signaled = i915_gem_request_is_completed,
>>> .wait = fence_default_wait,
>>> .release = i915_gem_request_release,
>>> .get_driver_name = i915_gem_request_get_driver_name,
>>> @@ -2795,6 +2933,7 @@ int i915_gem_request_alloc(struct
>>> intel_engine_cs *ring,
>>> goto err;
>>> }
>>>
>>> + INIT_LIST_HEAD(&req->signal_link);
>>> fence_init(&req->fence, &i915_gem_request_fops, &ring->fence_lock,
>>> ctx->engine[ring->id].fence_timeline.fence_context,
>>> i915_fence_timeline_get_next_seqno(&ctx->engine[ring->id].fence_timeline));
>>>
>>> @@ -2832,6 +2971,11 @@ void i915_gem_request_cancel(struct
>>> drm_i915_gem_request *req)
>>> {
>>> intel_ring_reserved_space_cancel(req->ringbuf);
>>>
>>> + req->cancelled = true;
>>> + /* How to propagate to any associated sync_fence??? */
>>> + req->fence.status = -EINVAL;
>>> + fence_signal_locked(&req->fence);
>>
>> Same question from before, could you just move it to unsignaled list
>> straight away? (And drop interrupts?)
> Again, it seemed a much simpler driver if the request completion code is
> only kept inside the notify function and not replicated to multiple places.
>
>>
>>> +
>>> i915_gem_request_unreference(req);
>>> }
>>>
>>> @@ -2925,6 +3069,13 @@ static void i915_gem_reset_ring_cleanup(struct
>>> drm_i915_private *dev_priv,
>>> i915_gem_request_retire(request);
>>> }
>>>
>>> + /*
>>> + * Tidy up anything left over. This includes a call to
>>> + * i915_gem_request_notify() which will make sure that any requests
>>> + * that were on the signal pending list get also cleaned up.
>>> + */
>>> + i915_gem_retire_requests_ring(ring);
>>> +
>>> /* Having flushed all requests from all queues, we know that all
>>> * ringbuffers must now be empty. However, since we do not reclaim
>>> * all space when retiring the request (to prevent HEADs colliding
>>> @@ -2974,6 +3125,13 @@ i915_gem_retire_requests_ring(struct
>>> intel_engine_cs *ring)
>>>
>>> WARN_ON(i915_verify_lists(ring->dev));
>>>
>>> + /*
>>> + * If no-one has waited on a request recently then interrupts will
>>> + * not have been enabled and thus no requests will ever be
>>> marked as
>>> + * completed. So do an interrupt check now.
>>> + */
>>> + i915_gem_request_notify(ring, false);
>>> +
>>> /* Retire requests first as we use it above for the early return.
>>> * If we retire requests last, we may use a later seqno and so
>>> clear
>>> * the requests lists without clearing the active list, leading to
>>> @@ -3015,6 +3173,15 @@ i915_gem_retire_requests_ring(struct
>>> intel_engine_cs *ring)
>>> i915_gem_request_assign(&ring->trace_irq_req, NULL);
>>> }
>>>
>>> + /* Tidy up any requests that were recently signalled */
>>> + spin_lock_irqsave(&ring->fence_lock, flags);
>>
>> Could also use spin_lock_irq here I think.
>>
>>> + list_splice_init(&ring->fence_unsignal_list, &list_head);
>>> + spin_unlock_irqrestore(&ring->fence_lock, flags);
>>> + list_for_each_entry_safe(req, req_next, &list_head,
>>> unsignal_link) {
>>> + list_del(&req->unsignal_link);
>>> + i915_gem_request_unreference(req);
>>> + }
>>> +
>>> /* Really free any requests that were recently unreferenced */
>>> spin_lock_irqsave(&ring->delayed_free_lock, flags);
>>> list_splice_init(&ring->delayed_free_list, &list_head);
>>> @@ -5066,6 +5233,8 @@ init_ring_lists(struct intel_engine_cs *ring)
>>> {
>>> INIT_LIST_HEAD(&ring->active_list);
>>> INIT_LIST_HEAD(&ring->request_list);
>>> + INIT_LIST_HEAD(&ring->fence_signal_list);
>>> + INIT_LIST_HEAD(&ring->fence_unsignal_list);
>>> INIT_LIST_HEAD(&ring->delayed_free_list);
>>> }
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c
>>> b/drivers/gpu/drm/i915/i915_irq.c
>>> index 68b094b..74f8552 100644
>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>> @@ -981,6 +981,8 @@ static void notify_ring(struct intel_engine_cs
>>> *ring)
>>>
>>> trace_i915_gem_request_notify(ring);
>>>
>>> + i915_gem_request_notify(ring, false);
>>> +
>>> wake_up_all(&ring->irq_queue);
>>
>> This is the big one - I think to solve the thundering herd problem
>> this patch should also include the removal of ring->irq_queue since I
>> don't think it needs to keep existing after it.
>>
>> For example adding a req->wait_queue on which __i915_wait_request
>> would wait instead and doing a wake_up_all on it from
>> i915_gem_request_notify?
>>
>> Because in this form, when I've ran this patch series and it caused
>> the number of context switches to go up more than six times on one
>> workload. Interrupts also went up but only by 50% in comparison and
>> time spent in irq handlers doubled.
>>
>> But context switches and 10% more cpu time look the most dramatic.
>>
>> So it would be interesting to see if simply moving the wait queue to
>> be per request could fix the majority of that or not.
>
> This idea was posted one version of the thundering herd thread already.
> I haven't implemented it yet as I haven't had the time to look into
> exactly what is required. But yes, in theory the fence code should mean
> the irq_queue is obsolete. The waiters can simply register a callback on
> the request's underlying fence object. When the interrupt handler
> signals the fence, that fence will process it's callback list and wake
> the waiter. All nicely targeted and no-one gets woken unnecessarily.
Cool, I think we need that if nothing for the 6x more context switches
per second this patch causes (13k vs 2k for pure nightly with the NBody
test - which good for stressing wait request, interrupts and busy
spinning.).
Regards,
Tvrtko
More information about the Intel-gfx
mailing list