[Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

Tue Mar 27 07:01:17 UTC 2018

On Mon, Mar 26, 2018 at 09:08:33PM +0100, Chris Wilson wrote:
> Quoting Patchwork (2018-03-26 17:53:44)
> > Test gem_userptr_blits:
> >         Subgroup coherency-unsync:
> >                 pass       -> INCOMPLETE (shard-hsw)
> 
> Forgot that obj->userptr.mn may not exist.
> 
> >         Subgroup dmabuf-sync:
> >                 pass       -> DMESG-WARN (shard-hsw)
> 
> But this is the tricky lockdep one, warning of the recursion from gup
> into mmu_invalidate_range, i.e.
> 
> down_read(&i915_mmu_notifier->sem);
> down_read(&mm_struct->mmap_sem);
> 	gup();
> 		down_write(&i915_mmut_notifier->sem);
> 
> That seems a genuine deadlock... So I wonder how we managed to get a
> lockdep splat and not a dead machine. Maybe gup never triggers the
> recursion for our set of flags? Hmm.

Coffee starting to kick in. If we gup a range it's likely the mm won't
kick out the same range, but something else. I guess we'd need a really
huge userptr bo which can't fit into core completely to actually have a
reliably chance at triggering this. Would probably deadlock the box :-/

I think Jerome's recommendation is the sequence counter stuff from kvm,
plus retrying forever on the gup side. That would convert the same
deadlock into a livelock, but well can't have it all :-) And I think once
you've killed the task the gup worker hopefully realizes it's wasting time
and gives up.

For the kvm stuff: Look at #intel-gfx scrollback, we discussed all the
necessary bits. Plus Jerome showed some new helpers that would avoid the
hand-rolling.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch