[PATCH 1/2] drm/i915: Seal races between async GPU cancellation, retirement and signaling
Daniel Vetter
daniel at ffwll.ch
Thu May 9 07:59:29 UTC 2019
On Wed, May 08, 2019 at 09:30:42PM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2019-05-08 13:53:30)
> > On Wed, May 8, 2019 at 2:06 PM Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > >
> > > Currently there is an underlying assumption that i915_request_unsubmit()
> > > is synchronous wrt the GPU -- that is the request is no longer in flight
> > > as we remove it. In the near future that may change, and this may upset
> > > our signaling as we can process an interrupt for that request while it
> > > is no longer in flight.
> > >
> > > CPU0 CPU1
> > > intel_engine_breadcrumbs_irq
> > > (queue request completion)
> > > i915_request_cancel_signaling
> > > ... ...
> > > i915_request_enable_signaling
> > > dma_fence_signal
> > >
> > > Hence in the time it took us to drop the lock to signal the request, a
> > > preemption event may have occurred and re-queued the request. In the
> > > process, that request would have seen I915_FENCE_FLAG_SIGNAL clear and
> > > so reused the rq->signal_link that was in use on CPU0, leading to bad
> > > pointer chasing in intel_engine_breadcrumbs_irq.
> > >
> > > A related issue was that if someone started listening for a signal on a
> > > completed but no longer in-flight request, we missed the opportunity to
> > > immediately signal that request.
> > >
> > > Furthermore, as intel_contexts may be immediately released during
> > > request retirement, in order to be entirely sure that
> > > intel_engine_breadcrumbs_irq may no longer dereference the intel_context
> > > (ce->signals and ce->signal_link), we must wait for irq spinlock.
> > >
> > > In order to prevent the race, we use a bit in the fence.flags to signal
> > > the transfer onto the signal list inside intel_engine_breadcrumbs_irq.
> > > For simplicity, we use the DMA_FENCE_FLAG_SIGNALED_BIT as it then
> > > quickly signals to any outside observer that the fence is indeed signaled.
> > >
> > > v2: Sketch out potential dma-fence API for manual signaling
> > > v3: And the test_and_set_bit()
> > >
> > > Fixes: 52c0fdb25c7c ("drm/i915: Replace global breadcrumbs with per-context interrupt tracking")
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >
> > Ok chatted a bit with Chris on irc, and we already allow drivers to
> > pass whatever spinlock is suitable for their request/fence
> > handling/retiring, so allowing to split up this all makes sense I
> > think. Has my ack on the approach.
> >
> > I think it'd be good to spell out the optimization this allows
> > (synchronous cleanup instead of refcounting all. the. things.) plus
> > showcase the link between the fence->lock pointer and the split-up
> > __dma_signal functions in the kerneldoc. Definitely for the core
> > patch.
> >
> > Also not sure why __notify and __timestamp has a double underscore in
> > the middle. That color choice confuses me a bit :-)
>
> I like it for subphases. The overall action here is still the dma-fence
> 'signal', but now we are doing the 'set timestamp', 'notify callbacks'
> etc. Otherwise we gain a subject to the verb, 'signal_notify' which says
> to go and signal the notify, rather than do the notifications; for me
> the '__' breaks the association. Maybe dma_fence_signal_do_notifies.
at the cost of 2 more letters in already long function names, I think
_do_step1, _do_step2 are clearer ...
Aside: Could we merge the timestampe and do_notify steps together, maybe
with the spinlock in there? I think materially it doesn't matter whether
we set the timestampe before or in the spinlock protected section, as long
as we don't set it afterwards. And would simplify the interface a bit.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
More information about the dri-devel
mailing list