[Intel-gfx] [PATCH] drm/i915: Close race between processing unpin task and queueing the flip

Sun Dec 2 10:26:01 CET 2012

On Sun, 2 Dec 2012 02:15:23 +0100, Daniel Vetter <daniel at ffwll.ch> wrote:
> On Sat, Dec 1, 2012 at 11:32 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > On Sat, 1 Dec 2012 21:35:21 +0100, Daniel Vetter <daniel at ffwll.ch> wrote:
> >> On Sat, Dec 01, 2012 at 05:48:50PM +0000, Chris Wilson wrote:
> >> > Before queuing the flip but crucially after attaching the unpin-work to
> >> > the crtc, we continue to setup the unpin-work. However, should the
> >> > hardware fire early, we see the connected unpin-work and queue the task.
> >> > The task then promptly runs and unpins the fb before we finish taking
> >> > the required references or even pinning it... Havoc.
> >> >
> >> > To close the race, we use the flip-pending atomic to indicate when the
> >> > flip is finally setup and enqueued. So during the flip-done processing,
> >> > we can check more accurately whether the flip was expected.
> >> >
> >> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>
> >> Hm, can't this logic race?
> >>
> >> - emit the MI_FLIP
> >>
> >> - flip irq happens because the gpu is idle and completes it right away
> >> (or our thread is preempted), work->pending increments from 0 -> 1
> >>
> >> - queue_flip sets work->pending to 1
> >
> > -> write RING_TAIL, flush the commands to CS, begin execution of MI_FLIP
> 
> Yeah, that should be the normal course of events where the MI_FLIP
> gets executed after we set work->pending to 1 (and after all the stuff
> has been done). The race I see is that the real MI_FLIP (not a
> spurious one this patch defends against) happens before we set
> work->pending to 1, so that we essentially lose the increment to 2 and
> so block any further flips on this crtc (or modesets for the matter,
> once the finish_fb stuff is fixed) indefinitely.
> 
> Iow I think it's a bit too good at preventing unpins ;-)

There isn't a race with hardware. So are you concerned about the write
ordering, and so want some smb_mb()?

> > I'm not happy with the explanation, but I could reliably (100%) hit the
> > race whilst loading a 2+GiB image using eog under compiz on an 965gm
> > with only 2GIB of ram. As soon as it hit kswapd, the system would OOPS
> > with an unpin leak. Which means that was a flip pending/done prior to
> > the pinning + MI_FLIP. This patch adds a strong defence against that
> > spurious flip done, but doesn't explain where it came from.
> 
> Hm, I have no idea how that could cause the spurious flip - the most
> likely cause is that something introduces a nice delay somewhere
> (through kswapd), but I don't really see how that can happen. I guess
> I need to write a flip vs. swapping test. Was the swap due to
> unrelated memory pressue, or due to our own gem objects?

eog starts swapping long before it sends the image to X, but at the same
time it continues to render its progress bar.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre