[Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class

Wed Aug 5 12:12:33 UTC 2020

Quoting Thomas Hellström (Intel) (2020-07-29 14:44:41)
> 
> On 7/29/20 2:17 PM, Tvrtko Ursulin wrote:
> >
> > On 28/07/2020 12:17, Thomas Hellström (Intel) wrote:
> >> On 7/16/20 5:53 PM, Tvrtko Ursulin wrote:
> >>> On 15/07/2020 16:43, Maarten Lankhorst wrote:
> >>>> Op 15-07-2020 om 13:51 schreef Chris Wilson:
> >>>>> Our goal is to pull all memory reservations (next iteration
> >>>>> obj->ops->get_pages()) under a ww_mutex, and to align those 
> >>>>> reservations
> >>>>> with other drivers, i.e. control all such allocations with the
> >>>>> reservation_ww_class. Currently, this is under the purview of the
> >>>>> obj->mm.mutex, and while obj->mm remains an embedded struct we can
> >>>>> "simply" switch to using the reservation_ww_class 
> >>>>> obj->base.resv->lock
> >>>>>
> >>>>> The major consequence is the impact on the shrinker paths as the
> >>>>> reservation_ww_class is used to wrap allocations, and a ww_mutex does
> >>>>> not support subclassing so we cannot do our usual trick of knowing 
> >>>>> that
> >>>>> we never recurse inside the shrinker and instead have to finish the
> >>>>> reclaim with a trylock. This may result in us failing to release the
> >>>>> pages after having released the vma. This will have to do until a 
> >>>>> better
> >>>>> idea comes along.
> >>>>>
> >>>>> However, this step only converts the mutex over and continues to 
> >>>>> treat
> >>>>> everything as a single allocation and pinning the pages. With the
> >>>>> ww_mutex in place we can remove the temporary pinning, as we can then
> >>>>> reserve all storage en masse.
> >>>>>
> >>>>> One last thing to do: kill the implict page pinning for active vma.
> >>>>> This will require us to invalidate the vma->pages when the backing 
> >>>>> store
> >>>>> is removed (and we expect that while the vma is active, we mark the
> >>>>> backing store as active so that it cannot be removed while the HW is
> >>>>> busy.)
> >>>>>
> >>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>
> >>> [snip]
> >>>
> >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c 
> >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> >>>>> index dc8f052a0ffe..4e928103a38f 100644
> >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> >>>>> @@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct 
> >>>>> drm_i915_gem_object *obj,
> >>>>>       if (!(shrink & I915_SHRINK_BOUND))
> >>>>>           flags = I915_GEM_OBJECT_UNBIND_TEST;
> >>>>>   -    if (i915_gem_object_unbind(obj, flags) == 0)
> >>>>> -        __i915_gem_object_put_pages(obj);
> >>>>> -
> >>>>> -    return !i915_gem_object_has_pages(obj);
> >>>>> +    return i915_gem_object_unbind(obj, flags) == 0;
> >>>>>   }
> >>>>>     static void try_to_writeback(struct drm_i915_gem_object *obj,
> >>>>> @@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
> >>>>> spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
> >>>>>   -            if (unsafe_drop_pages(obj, shrink)) {
> >>>>> -                /* May arrive from get_pages on another bo */
> >>>>> -                mutex_lock(&obj->mm.lock);
> >>>>> +            if (unsafe_drop_pages(obj, shrink) &&
> >>>>> +                i915_gem_object_trylock(obj)) {
> >>>
> >>>> Why trylock? Because of the nesting? In that case, still use ww ctx 
> >>>> if provided please
> >>>
> >>> By "if provided" you mean for code paths where we are calling the 
> >>> shrinker ourselves, as opposed to reclaim, like shmem_get_pages?
> >>>
> >>> That indeed sounds like the right thing to do, since all the 
> >>> get_pages from execbuf are in the reservation phase, collecting a 
> >>> list of GEM objects to lock, the ones to shrink sound like should be 
> >>> on that list.
> >>>
> >>>>> + __i915_gem_object_put_pages(obj);
> >>>>>                   if (!i915_gem_object_has_pages(obj)) {
> >>>>>                       try_to_writeback(obj, shrink);
> >>>>>                       count += obj->base.size >> PAGE_SHIFT;
> >>>>>                   }
> >>>>> -                mutex_unlock(&obj->mm.lock);
> >>>>> +                i915_gem_object_unlock(obj);
> >>>>>               }
> >>>>>                 scanned += obj->base.size >> PAGE_SHIFT;
> >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c 
> >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> >>>>> index ff72ee2fd9cd..ac12e1c20e66 100644
> >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> >>>>> @@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct 
> >>>>> drm_i915_gem_object *obj,
> >>>>>        * pages to prevent them being swapped out and causing 
> >>>>> corruption
> >>>>>        * due to the change in swizzling.
> >>>>>        */
> >>>>> -    mutex_lock(&obj->mm.lock);
> >>>>>       if (i915_gem_object_has_pages(obj) &&
> >>>>>           obj->mm.madv == I915_MADV_WILLNEED &&
> >>>>>           i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
> >>>>> @@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct 
> >>>>> drm_i915_gem_object *obj,
> >>>>>               obj->mm.quirked = true;
> >>>>>           }
> >>>>>       }
> >>>>> -    mutex_unlock(&obj->mm.lock);
> >>>>>         spin_lock(&obj->vma.lock);
> >>>>>       for_each_ggtt_vma(vma, obj) {
> >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
> >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> >>>>> index e946032b13e4..80907c00c6fd 100644
> >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> >>>>> @@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct 
> >>>>> mmu_notifier *_mn,
> >>>>>           ret = i915_gem_object_unbind(obj,
> >>>>>                            I915_GEM_OBJECT_UNBIND_ACTIVE |
> >>>>> I915_GEM_OBJECT_UNBIND_BARRIER);
> >>>>> -        if (ret == 0)
> >>>>> -            ret = __i915_gem_object_put_pages(obj);
> >>>>> +        if (ret == 0) {
> >>>>> +            /* ww_mutex and mmu_notifier is fs_reclaim tainted */
> >>>>> +            if (i915_gem_object_trylock(obj)) {
> >>>>> +                ret = __i915_gem_object_put_pages(obj);
> >>>>> +                i915_gem_object_unlock(obj);
> >>>>> +            } else {
> >>>>> +                ret = -EAGAIN;
> >>>>> +            }
> >>>>> +        }
> >>>>
> >>>> I'm not sure upstream will agree with this kind of API:
> >>>>
> >>>> 1. It will deadlock when RT tasks are used.
> >>>
> >>> It will or it can? Which part? It will break out of the loop if 
> >>> trylock fails.
> >>>
> >>>>
> >>>> 2. You start throwing -EAGAIN because you don't have the correct 
> >>>> ordering of locking, this needs fixing first.
> >>>
> >>> Is it about correct ordering of locks or something else? If memory 
> >>> allocation is allowed under dma_resv.lock, then the opposite order 
> >>> cannot be taken in any case.
> >>>
> >>> I've had a brief look at the amdgpu solution and maybe I 
> >>> misunderstood something, but it looks like a BKL approach with the 
> >>> device level notifier_lock. Their userptr notifier blocks on that 
> >>> one, not on dma_resv lock, but that also means their command 
> >>> submission (amdgpu_cs_submit) blocks on the same lock while 
> >>> obtaining backing store.
> >>
> >> If I read Christian right, it blocks on that lock only just before 
> >> command submission to validate that sequence number. If there is a 
> >> mismatch, it needs to rerun CS. I'm not sure how common userptr 
> >> buffers are, but if a device-wide mutex hurts too much, There are 
> >> perhaps more fine-grained solutions. (Like an rw semaphore, and 
> >> unlock before the fence wait in the notifier: CS which are unaffected 
> >> shouldn't need to wait...).
> >
> > Right, I wasn't familiar with the mmu notifier seqno business. Hm i915 
> > and amdgpu seem to use different mmu notifier hooks. Could we use the 
> > same and does it mean "locking order" is irrelevant to the sub-story 
> > of userptr?
> >
> >>
> >>>
> >>> So it looks like a big hammer approach not directly related to the 
> >>> story of dma_resv locking. Maybe we could do the same big hammer 
> >>> approach, although I am not sure how it is deadlock free.
> >>>
> >>> What happens for instance if someone submits an userptr batch which 
> >>> gets unmapped while amdgpu_cs_submit is holding the notifier_lock?
> >>
> >> My understanding is the unmapping operation blocks on the 
> >> notifier_lock in the mmu notifier?
> >
> > Yes, the key will be understanding the difference between
> > invalidate_range_start and invalidate+seqno and whether we can use the 
> > same from i915 to avoid the trylock.
> >
> > But given how ww mutex is tainted fs reclaim does that mean we would 
> > also need to ensure no memory allocations under ww mutex?
> 
> So with the fs_reclaim, we can do memory allocations under ww_mutex, but 
> need trylock in shrinkers and mmu_notifiers. However get_user_pages() 
> must not run under ww_mutex due to mmap_sem inversion.

There's no real reason for the shrinker to call into mmu-notifiers,
certainly doesn't work for us where we handle the shrinking via another
shrinker. The debate is between adding skip for page_maybe_dma_pinned to
the lru shrink and for for never adding pinned dma pages to the lru list
in the first place.

> But with the AMD solution I figure we don't really need the ww_mutex in 
> the mmu_notifier at all? That would possibly mean holding on to and 
> exposing to GPU its pages that aren't used anymore, but a shrinker would 
> notice the sequence number mismatch and consider the pages purgeable, 
> and also the next cs the stale pages would be released.

You need some lock to coordinate to give the guarantee required by the
mmu-notifier interface we use. HMM is likely the only way around that,
by moving the goalposts entirely.
-Chris