[Intel-gfx] [igt-dev] ✓ Fi.CI.IGT: success for tests/gem_userptr_blits: Enhance invalid mapping exercise

Tue Feb 18 09:59:58 UTC 2020

Hi Chris,

Let me share some of my observations on the userptr(mmap_offset) case.

On Wednesday, January 29, 2020 12:24:39 PM CET Chris Wilson wrote:
> Quoting Patchwork (2020-01-23 20:22:25)
> > == Series Details ==
> > 
> > Series: tests/gem_userptr_blits: Enhance invalid mapping exercise
> > URL   : https://patchwork.freedesktop.org/series/72411/
> > State : success
> > 
> > == Summary ==
> > 
> > CI Bug Log - changes from CI_DRM_7794_full -> IGTPW_3969_full
> > ====================================================
> > 
> > Summary
> > -------
> > 
> >   **SUCCESS**
> > 
> >   No regressions found.
> > 
> >   External URL: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3969/index.html
> > 
> > New tests
> > ---------
> > 
> >   New tests have been introduced between CI_DRM_7794_full and IGTPW_3969_full:
> > 
> > ### New IGT tests (3) ###
> > 
> >   * igt at gem_userptr_blits@invalid-uc-mapping:
> >     - Statuses : 7 pass(s)
> >     - Exec time: [0.01, 0.02] s
> > 
> >   * igt at gem_userptr_blits@invalid-wb-mapping:
> >     - Statuses : 7 pass(s)
> >     - Exec time: [0.01, 0.02] s
> > 
> >   * igt at gem_userptr_blits@invalid-wc-mapping:
> >     - Statuses : 7 pass(s)
> >     - Exec time: [0.01, 0.02] s
> 
> Ok, we need a bit more work to trigger the lockdep loop. We need a
> cancel_userptr of the same object that is faulted via mmap-offset.
> 
> We're basically looking for the equivalent of this
> 
> <4> [175.968441] ======================================================
> <4> [175.968444] WARNING: possible circular locking dependency detected
> <4> [175.968447] 5.5.0-CI-CI_DRM_7828+ #1 Tainted: G     U
> <4> [175.968449] ------------------------------------------------------
> <4> [175.968452] gem_userptr_bli/1564 is trying to acquire lock:
> <4> [175.968455] ffff888208f49580 (&vm->mutex){+.+.}, at: i915_vma_unbind+0xae/0x110 [i915]
> <4> [175.968527]
> but task is already holding lock:
> <4> [175.968529] ffffffff82664d40 (mmu_notifier_invalidate_range_start){+.+.}, at: unmap_vmas+0x0/0x150
> <4> [175.968535]
> which lock already depends on the new lock.
> <4> [175.968538]
> the existing dependency chain (in reverse order) is:
> <4> [175.968541]
> -> #2 (mmu_notifier_invalidate_range_start){+.+.}:
> <4> [175.968546]        page_mkclean_one+0xda/0x210
> <4> [175.968548]        rmap_walk_file+0xff/0x260
> <4> [175.968551]        page_mkclean+0x9f/0xb0
> <4> [175.968555]        clear_page_dirty_for_io+0xa2/0x2f0
> <4> [175.968559]        mpage_submit_page+0x1a/0x70
> <4> [175.968561]        mpage_process_page_bufs+0xe7/0x110
> <4> [175.968564]        mpage_prepare_extent_to_map+0x1d2/0x2b0
> <4> [175.968568]        ext4_writepages+0x5ba/0x12b0
> <4> [175.968571]        do_writepages+0x46/0xe0
> <4> [175.968573]        __filemap_fdatawrite_range+0xc6/0x100
> <4> [175.968576]        file_write_and_wait_range+0x3c/0x90
> <4> [175.968579]        ext4_sync_file+0x1a4/0x540
> <4> [175.968582]        do_fsync+0x33/0x60
> <4> [175.968584]        __x64_sys_fsync+0xb/0x10
> <4> [175.968587]        do_syscall_64+0x4f/0x220
> <4> [175.968591]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> <4> [175.968594]
> -> #1 (&mapping->i_mmap_rwsem){++++}:
> <4> [175.968599]        down_write+0x33/0x70
> <4> [175.968601]        unmap_mapping_pages+0x48/0x130
> <4> [175.968671]        i915_vma_revoke_mmap.part.37+0x66/0x190 [i915]
> <4> [175.968715]        fence_update+0xfd/0x2d0 [i915]
> <4> [175.968759]        __i915_vma_unbind+0x1eb/0x530 [i915]
> <4> [175.968803]        i915_vma_release+0x101/0x220 [i915]
> <4> [175.968843]        __i915_gem_free_objects+0x113/0x530 [i915]
> <4> [175.968886]        i915_gem_create_ioctl+0x12/0x40 [i915]
> <4> [175.968890]        drm_ioctl_kernel+0xad/0xf0
> <4> [175.968893]        drm_ioctl+0x2e1/0x390
> <4> [175.968896]        do_vfs_ioctl+0x9c/0x730
> <4> [175.968899]        ksys_ioctl+0x35/0x60
> <4> [175.968901]        __x64_sys_ioctl+0x11/0x20
> <4> [175.968904]        do_syscall_64+0x4f/0x220
> <4> [175.968906]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> <4> [175.968909]
> -> #0 (&vm->mutex){+.+.}:
> <4> [175.968914]        __lock_acquire+0x1328/0x15d0
> <4> [175.968916]        lock_acquire+0xa7/0x1c0
> <4> [175.968919]        __mutex_lock+0x9a/0x9c0
> <4> [175.968962]        i915_vma_unbind+0xae/0x110 [i915]
> <4> [175.969004]        i915_gem_object_unbind+0x1dc/0x400 [i915]
> <4> [175.969045]        userptr_mn_invalidate_range_start+0xdd/0x190 [i915]

AFAICU, for that to happen, not only the userptr MMU notifier would have to be 
registered, but at least one userptr object would have to be added to the 
notifier's list of active objects.

For a userptr object to be ever added to the notifier's list of active 
objects,  it would have to be created without I915_USERPTR_UNSYNCHRONIZED flag 
set.  As "invalid-*-mapping" (or dynamic "invalid-mmap-offset" since 
addressing your comments in v3 of my patch) subtests create userptr objects 
with that flag set, we need "-sync" flavors of those subtests in order to have 
a chance for the lockdep loop to be exercised.  If that observation is not 
questionable to you, I'm going to address it in my next version of the patch.

Now, the only way to activate the MMU notifier for a userptr object is when 
__i915_gem_userptr_set_active() is called from i915_gem_userptr_get_pages(). 
That can happen when either all required pages have already been pinned before 
and are returned by __get_user_pages_fast(), or, if some pages are not yet 
pinned, when __i915_gem_userptr_get_pages_schedule() likely succeeds (returns 
-EAGAIN).  In the latter case, __i915_gem_userptr_get_pages_worker() work is 
scheduled.

In case of a userptr object backed by our mmap-offset mapping, 
get_user_pages_remote() called from __i915_gem_userptr_get_pages_worker() 
fails immediately with -EFAULT on (vm_flags & (VM_IO | VM_PFNMAP)) in 
mm/gup.c:check_vma_flags().  As a result, the MMU notifier is immediately 
deactivated for the object.  Then indeed, a time window with the MMU notifier 
being active for the object exists.  However, I still can't find out how 
userspace could trigger the lockdep loop *within* that time window in a 
reproducible manner.  Could you think of a way to do it?

On the other hand, if we could postpone activation of the userptr MMU notifier 
for an object, or otherwise prevent the notifier from doing its job until at 
least one page is successfully acquired, then a userptr object backed with 
mmap-offset mapping would never be able to trigger that lockdep loop, I 
believe.  If you find my conclusions not missing the point, I'm going to 
propose a patch (i915, not IGT).

Thanks,
Janusz

> <4> [175.969049]        __mmu_notifier_invalidate_range_start+0x148/0x250
> <4> [175.969052]        unmap_vmas+0x13e/0x150
> <4> [175.969055]        unmap_region+0xa3/0x100
> <4> [175.969057]        __do_munmap+0x26d/0x490
> <4> [175.969060]        __vm_munmap+0x66/0xc0
> <4> [175.969063]        __x64_sys_munmap+0x12/0x20
> <4> [175.969066]        do_syscall_64+0x4f/0x220
> <4> [175.969068]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> <4> [175.969071]
> 
> cycle but with obj->mm.lock.
> -Chris
>