[Intel-gfx] [PATCH] drm/i915: Wait for completion of pending flips when starved of fences

Mon Jan 20 11:37:42 CET 2014

On Mon, Jan 20, 2014 at 09:49:24AM +0000, Chris Wilson wrote:
> On Sun, Jan 19, 2014 at 10:55:26PM +0100, Daniel Vetter wrote:
> > On Sun, Jan 19, 2014 at 10:20 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > > On older generations (gen2, gen3) the GPU requires fences for many
> > > operations, such as blits. The display hardware also requires fences for
> > > scanouts and this leads to a situation where an arbitrary number of
> > > fences may be pinned by old scanouts following a pageflip but before we
> > > have executed the unpin workqueue. This is unpredictable by userspace
> > > and leads to random EDEADLK when submitting an otherwise benign
> > > execbuffer. However, we can detect when we have an outstanding flip and
> > > so cause userspace to wait upon their completion before finally
> > > declaring that the system is starved of fences. This is really no worse
> > > than forcing the GPU to stall waiting for older execbuffer to retire and
> > > release their fences before we can reallocate them for the next
> > > execbuffer.
> > >
> > > Reported-and-tested-by: dimon at gmx.net
> > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73696
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > 
> > New subtest for kms_flip which submits such a blt buffer while a
> > pageflip is still pending?
> 
> Correct.
> 
> > Also there's a certain chance we'll starve
> > the unpin work, similar to the issues around flushing the unpin work
> > in our pageflip implementation.
> 
> If you mean that we will never run the unpin workqueue, that's what the
> implementation will fix, eventually, after a busy-spin in userspace since
> set_need_resched() was removed. I can teach userspace to yield() after
> an EAGAIN which seems a reasonable compromise (userspace gets a bonus
> for being cooperative rather than penalized for using up its timeslice.)

yield won't help, we need to block on the work-queue draining like we do
in the pageflip code with flush_workqueue. At least we've had bug reports
in the past where someone found it intriguing to run his entire userspace
with rt prio, which ended up starving the sched_normal workqueue and so
livelocked the entire system.

Instead of busy-looping through userspace with -EAGAIN I think we should
keep all the unpin works on a spinlock-protected list and synchronously
unpin the buffers in the get_fence and evict_something paths (after the
flip completed, we've removed the unpin entry from the list and dropped
the spinlock ofc).

The only downside is that we have a notch more complexity since we need to
manually check for gpu hangs and bail out correctly if there is one. Which
means another kms_flip subtest, but that shouldn't be too much fuzz with
the combinatorial testflags we already have.

Since we don't have a test where rt threads starve our workers for the
normal pageflip code I think we can eshew that part here, too. I'll add it
to the i-g-t wishlist though for a rainy afternoon ;-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch