[Intel-gfx] [PATCH] drm/i915: Make vm eviction uninterruptible

Mon Apr 7 23:50:06 CEST 2014

On Mon, Apr 07, 2014 at 11:58:28AM -0700, Ben Widawsky wrote:
> On Mon, Apr 07, 2014 at 01:30:04PM +0100, Chris Wilson wrote:
> > On Mon, Apr 07, 2014 at 02:15:00PM +0200, Daniel Vetter wrote:
> > > On Mon, Apr 07, 2014 at 10:42:56AM +0100, Chris Wilson wrote:
> > > > On Sun, Apr 06, 2014 at 11:35:03AM -0700, Ben Widawsky wrote:
> > > > > On Sat, Apr 05, 2014 at 07:45:28PM -0700, Ben Widawsky wrote:
> > > > > > The issue I was seeing appeared to seeing from sigkill. In such a case,
> > > > > > the process may want to die before the context/work/address space is
> > > > > > freeable. For example:
> > > > > > 1. evict_vm called for whatever reason
> > > > > > 2. wait_seqno because the VMA is still active
> > > > > 
> > > > > hmm something isn't right here. Why did I get to wait_seqno if pin_count
> > > > > was 0? Just FYI, this wasn't hypothetical. I did trace it all the way to
> > > > > exactly ERESTARTSYS from wait_seqno.
> > > > > 
> > > > > By the way, another option in evict would be:
> > > > > while(ret = (i915_vma_unbind(vma) == -ERESTARTSYS));
> > > > > WARN_ON(ret);
> > > > > 
> > > > > > 3. receive signal break out of wait_seqno
> > > > > > 4. return to evict_vm and the above WARN
> > > > > > 
> > > > > > Our error handling from there just spirals.
> > > > > > 
> > > > > > One issue I have with our current code is I'd really like eviction to
> > > > > > not be able to fail (obviously extreme cases are unavoidable).
> > > > 
> > > > This is unrealistic since we must support X which uses sigtimer.
> > > > 
> > > > > > Perhaps
> > > > > > one other solution would be to make sure the context is idled before
> > > > > > evicting its VM.
> > > > 
> > > > Indeed.
> > > > 
> > > > Anyway, I do concur that wrapping i915_driver_preclose() with
> > > > 
> > > > dev_priv->mm.interruptible = false;
> > > > 
> > > > would make us both happy.
> > > 
> > > Isn't the backtrace just fallout from the lifetime rules being a bit
> > > funny? We didn't uninterruptibly stall for any still active bo when the
> > > drm fd gets closed, why do we suddenly need to do that with ppgtts? Iirc
> > > requests hold a ref on the context, contexts hold a ref on the ppgtt and
> > > so the entire thing should only dissipate once it's really idle.
> > > 
> > > Imo just doing uninterruptible sleeps tastes way too much like duct-tape.
> > > I can be convinced of duct-tape if the tradeoffs really strongly suggests
> > > it's the right thing (e.g. the shrinker lock stealing, even though we've
> > > paid a hefty price in accidental complexity with that one), but that needs
> > > some good justification.
> > 
> > Yes, it is duct-tape. But it should be duct-tape against future unknown
> > bugs (and the currently known bugs) in that the i915_driver_preclose()
> > cannot report failure and so should not allow its callees to fail (which
> > is more or less the contract given by .interruptible=false).
> > 
> > The alternative is to allow preclose() to support an error-code, which
> > has the issue that very few programs check for errors during close() and
> > that EINTR from close() is frowned upon by most.
> > -Chris
> > 
> 
> Do we have consensus? I am good with Chris' idea. I can write and test
> the patch.

If we just disable interrupts we'll never see the bug reports again, which
is imo bad. And the while loop above will loop a bit too long.

Adding a new special sleep mode which doesn't interrupt but WARNs badly if
any sleeping actually happens is also a bit tricky, and pretty much
guaranteed to blow up. I'm still for a "write testcase, wait for fix to
materialize" approach. Especially for upstream.

The testcase is especially important so that we can track it as a full
ppgtt validation criterion. I'll add this tomorrow as a task to the
relevant JIRA.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch