[PATCH 1/2] drm/i915: Seal races between async GPU cancellation, retirement and signaling

Chris Wilson chris at chris-wilson.co.uk
Wed May 8 20:30:42 UTC 2019


Quoting Daniel Vetter (2019-05-08 13:53:30)
> On Wed, May 8, 2019 at 2:06 PM Chris Wilson <chris at chris-wilson.co.uk> wrote:
> >
> > Currently there is an underlying assumption that i915_request_unsubmit()
> > is synchronous wrt the GPU -- that is the request is no longer in flight
> > as we remove it. In the near future that may change, and this may upset
> > our signaling as we can process an interrupt for that request while it
> > is no longer in flight.
> >
> > CPU0                                    CPU1
> > intel_engine_breadcrumbs_irq
> > (queue request completion)
> >                                         i915_request_cancel_signaling
> > ...                                     ...
> >                                         i915_request_enable_signaling
> > dma_fence_signal
> >
> > Hence in the time it took us to drop the lock to signal the request, a
> > preemption event may have occurred and re-queued the request. In the
> > process, that request would have seen I915_FENCE_FLAG_SIGNAL clear and
> > so reused the rq->signal_link that was in use on CPU0, leading to bad
> > pointer chasing in intel_engine_breadcrumbs_irq.
> >
> > A related issue was that if someone started listening for a signal on a
> > completed but no longer in-flight request, we missed the opportunity to
> > immediately signal that request.
> >
> > Furthermore, as intel_contexts may be immediately released during
> > request retirement, in order to be entirely sure that
> > intel_engine_breadcrumbs_irq may no longer dereference the intel_context
> > (ce->signals and ce->signal_link), we must wait for irq spinlock.
> >
> > In order to prevent the race, we use a bit in the fence.flags to signal
> > the transfer onto the signal list inside intel_engine_breadcrumbs_irq.
> > For simplicity, we use the DMA_FENCE_FLAG_SIGNALED_BIT as it then
> > quickly signals to any outside observer that the fence is indeed signaled.
> >
> > v2: Sketch out potential dma-fence API for manual signaling
> > v3: And the test_and_set_bit()
> >
> > Fixes: 52c0fdb25c7c ("drm/i915: Replace global breadcrumbs with per-context interrupt tracking")
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> 
> Ok chatted a bit with Chris on irc, and we already allow drivers to
> pass whatever spinlock is suitable for their request/fence
> handling/retiring, so allowing to split up this all makes sense I
> think. Has my ack on the approach.
> 
> I think it'd be good to spell out the optimization this allows
> (synchronous cleanup instead of refcounting all. the. things.) plus
> showcase the link between the fence->lock pointer and the split-up
> __dma_signal functions in the kerneldoc. Definitely for the core
> patch.
> 
> Also not sure why __notify and __timestamp has a double underscore in
> the middle. That color choice confuses me a bit :-)

I like it for subphases. The overall action here is still the dma-fence
'signal', but now we are doing the 'set timestamp', 'notify callbacks'
etc. Otherwise we gain a subject to the verb, 'signal_notify' which says
to go and signal the notify, rather than do the notifications; for me
the '__' breaks the association. Maybe dma_fence_signal_do_notifies.
-Chris


More information about the dri-devel mailing list