[Intel-gfx] [PATCH] drm/i915/userptr: Disallow wrapping GTT into a userptr

Tue Mar 7 11:18:53 UTC 2017

On Tue, Mar 07, 2017 at 11:03:03AM +0000, Tvrtko Ursulin wrote:
> 
> On 07/03/2017 09:13, Chris Wilson wrote:
> >On Tue, Mar 07, 2017 at 08:42:26AM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 06/03/2017 15:36, Chris Wilson wrote:
> >>>If we allow the user to convert a GTT mmap address into a userptr, we
> >>>may end up in recursion hell, where currently we hit a mutex deadlock
> >>>but other possibilities include use-after-free during the
> >>>unbind/cancel_userptr.
> >>
> >>I thought we already disallowed that and indeed
> >>igt/gem_userptr_blits/invalid-gtt-mapping succeeds here - what am I
> >>missing?
> >
> >Michał presented this deadlock:
> >
> >[  143.203989] gem_userptr_bli D    0   902    898 0x00000000
> >[  143.204054] Call Trace:
> >[  143.204137]  __schedule+0x511/0x1180
> >[  143.204195]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> >[  143.204274]  schedule+0x57/0xe0
> >[  143.204327]  schedule_timeout+0x383/0x670
> >[  143.204374]  ? trace_hardirqs_on_caller+0x187/0x280
> >[  143.204457]  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >[  143.204507]  ? usleep_range+0x110/0x110
> >[  143.204657]  ? irq_exit+0x89/0x100
> >[  143.204710]  ? retint_kernel+0x2d/0x2d
> >[  143.204794]  ? trace_hardirqs_on_caller+0x187/0x280
> >[  143.204857]  ? _raw_spin_unlock_irq+0x33/0x60
> >[  143.204944]  wait_for_common+0x1f0/0x2f0
> >[  143.205006]  ? out_of_line_wait_on_atomic_t+0x170/0x170
> >[  143.205103]  ? wake_up_q+0xa0/0xa0
> >[  143.205159]  ? flush_workqueue_prep_pwqs+0x15a/0x2c0
> >[  143.205237]  wait_for_completion+0x1d/0x20
> >[  143.205292]  flush_workqueue+0x2e9/0xbb0
> >[  143.205339]  ? flush_workqueue+0x163/0xbb0
> >[  143.205418]  ? __schedule+0x533/0x1180
> >[  143.205498]  ? check_flush_dependency+0x1a0/0x1a0
> >[  143.205681]  i915_gem_userptr_mn_invalidate_range_start+0x1c7/0x270 [i915]
> >[  143.205865]  ? i915_gem_userptr_dmabuf_export+0x40/0x40 [i915]
> >[  143.205955]  __mmu_notifier_invalidate_range_start+0xc6/0x120
> >[  143.206044]  ? __mmu_notifier_invalidate_range_start+0x51/0x120
> >[  143.206123]  zap_page_range_single+0x1c7/0x1f0
> >[  143.206171]  ? unmap_single_vma+0x160/0x160
> >[  143.206260]  ? unmap_mapping_range+0xa9/0x1b0
> >[  143.206308]  ? vma_interval_tree_subtree_search+0x75/0xd0
> >[  143.206397]  unmap_mapping_range+0x18f/0x1b0
> >[  143.206444]  ? zap_vma_ptes+0x70/0x70
> >[  143.206524]  ? __pm_runtime_resume+0x67/0xa0
> >[  143.206723]  i915_gem_release_mmap+0x1ba/0x1c0 [i915]
> >[  143.206846]  i915_vma_unbind+0x5c2/0x690 [i915]
> >[  143.206925]  ? __lock_is_held+0x52/0x100
> >[  143.207076]  i915_gem_object_set_tiling+0x1db/0x650 [i915]
> >[  143.207236]  i915_gem_set_tiling_ioctl+0x1d3/0x3b0 [i915]
> >[  143.207377]  ? i915_gem_set_tiling_ioctl+0x5/0x3b0 [i915]
> >[  143.207457]  drm_ioctl+0x36c/0x670
> >[  143.207535]  ? debug_lockdep_rcu_enabled.part.0+0x1a/0x30
> >[  143.207730]  ? i915_gem_object_set_tiling+0x650/0x650 [i915]
> >[  143.207793]  ? drm_getunique+0x120/0x120
> >[  143.207875]  ? __handle_mm_fault+0x996/0x14a0
> >[  143.207939]  ? vm_insert_page+0x340/0x340
> >[  143.208028]  ? up_write+0x28/0x50
> >[  143.208086]  ? vm_mmap_pgoff+0x160/0x190
> >[  143.208163]  do_vfs_ioctl+0x12c/0xa60
> >[  143.208218]  ? debug_lockdep_rcu_enabled+0x35/0x40
> >[  143.208267]  ? ioctl_preallocate+0x150/0x150
> >[  143.208353]  ? __do_page_fault+0x36a/0x6e0
> >[  143.208400]  ? mark_held_locks+0x23/0xc0
> >[  143.208479]  ? up_read+0x1f/0x40
> >[  143.208526]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
> >[  143.208669]  ? __fget_light+0xa7/0xc0
> >[  143.208747]  SyS_ioctl+0x41/0x70
> >[  143.208808]  entry_SYSCALL_64_fastpath+0x23/0xc6
> 
> That would mean another process never exits cancel_userptr, correct?
> Do we have a trace of the other end?

Our worker, who is waiting on struct_mutex (in order to unbind) which we
are holding ourselves as we are in the middle of an unbind. The nasty
part here is that we can recurse into unbind on the same object (let
alone the mutex deadlock).

That's impossible if we prevent the object from activating itself on a
GTT mmap.

ARGH. The deadlock is between a non-userptr and a set of userptr (that
explains how we get to the unbind). It is just that the other userptr
are still in the process of running their workers when the time comes to
cancel the work. (Avoids the dilemma of how we ended up with bound
userptr of the GTT).

Simplest patch is then fun with obj->userptr.work, i.e. something like
	if (xchg(&obj->userptr.work, NULL)) return 0;
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre