[Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
Chris Wilson
chris at chris-wilson.co.uk
Wed Aug 5 12:12:33 UTC 2020
Quoting Thomas Hellström (Intel) (2020-07-29 14:44:41)
>
> On 7/29/20 2:17 PM, Tvrtko Ursulin wrote:
> >
> > On 28/07/2020 12:17, Thomas Hellström (Intel) wrote:
> >> On 7/16/20 5:53 PM, Tvrtko Ursulin wrote:
> >>> On 15/07/2020 16:43, Maarten Lankhorst wrote:
> >>>> Op 15-07-2020 om 13:51 schreef Chris Wilson:
> >>>>> Our goal is to pull all memory reservations (next iteration
> >>>>> obj->ops->get_pages()) under a ww_mutex, and to align those
> >>>>> reservations
> >>>>> with other drivers, i.e. control all such allocations with the
> >>>>> reservation_ww_class. Currently, this is under the purview of the
> >>>>> obj->mm.mutex, and while obj->mm remains an embedded struct we can
> >>>>> "simply" switch to using the reservation_ww_class
> >>>>> obj->base.resv->lock
> >>>>>
> >>>>> The major consequence is the impact on the shrinker paths as the
> >>>>> reservation_ww_class is used to wrap allocations, and a ww_mutex does
> >>>>> not support subclassing so we cannot do our usual trick of knowing
> >>>>> that
> >>>>> we never recurse inside the shrinker and instead have to finish the
> >>>>> reclaim with a trylock. This may result in us failing to release the
> >>>>> pages after having released the vma. This will have to do until a
> >>>>> better
> >>>>> idea comes along.
> >>>>>
> >>>>> However, this step only converts the mutex over and continues to
> >>>>> treat
> >>>>> everything as a single allocation and pinning the pages. With the
> >>>>> ww_mutex in place we can remove the temporary pinning, as we can then
> >>>>> reserve all storage en masse.
> >>>>>
> >>>>> One last thing to do: kill the implict page pinning for active vma.
> >>>>> This will require us to invalidate the vma->pages when the backing
> >>>>> store
> >>>>> is removed (and we expect that while the vma is active, we mark the
> >>>>> backing store as active so that it cannot be removed while the HW is
> >>>>> busy.)
> >>>>>
> >>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>
> >>> [snip]
> >>>
> >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> >>>>> index dc8f052a0ffe..4e928103a38f 100644
> >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> >>>>> @@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct
> >>>>> drm_i915_gem_object *obj,
> >>>>> if (!(shrink & I915_SHRINK_BOUND))
> >>>>> flags = I915_GEM_OBJECT_UNBIND_TEST;
> >>>>> - if (i915_gem_object_unbind(obj, flags) == 0)
> >>>>> - __i915_gem_object_put_pages(obj);
> >>>>> -
> >>>>> - return !i915_gem_object_has_pages(obj);
> >>>>> + return i915_gem_object_unbind(obj, flags) == 0;
> >>>>> }
> >>>>> static void try_to_writeback(struct drm_i915_gem_object *obj,
> >>>>> @@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
> >>>>> spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
> >>>>> - if (unsafe_drop_pages(obj, shrink)) {
> >>>>> - /* May arrive from get_pages on another bo */
> >>>>> - mutex_lock(&obj->mm.lock);
> >>>>> + if (unsafe_drop_pages(obj, shrink) &&
> >>>>> + i915_gem_object_trylock(obj)) {
> >>>
> >>>> Why trylock? Because of the nesting? In that case, still use ww ctx
> >>>> if provided please
> >>>
> >>> By "if provided" you mean for code paths where we are calling the
> >>> shrinker ourselves, as opposed to reclaim, like shmem_get_pages?
> >>>
> >>> That indeed sounds like the right thing to do, since all the
> >>> get_pages from execbuf are in the reservation phase, collecting a
> >>> list of GEM objects to lock, the ones to shrink sound like should be
> >>> on that list.
> >>>
> >>>>> + __i915_gem_object_put_pages(obj);
> >>>>> if (!i915_gem_object_has_pages(obj)) {
> >>>>> try_to_writeback(obj, shrink);
> >>>>> count += obj->base.size >> PAGE_SHIFT;
> >>>>> }
> >>>>> - mutex_unlock(&obj->mm.lock);
> >>>>> + i915_gem_object_unlock(obj);
> >>>>> }
> >>>>> scanned += obj->base.size >> PAGE_SHIFT;
> >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> >>>>> index ff72ee2fd9cd..ac12e1c20e66 100644
> >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> >>>>> @@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct
> >>>>> drm_i915_gem_object *obj,
> >>>>> * pages to prevent them being swapped out and causing
> >>>>> corruption
> >>>>> * due to the change in swizzling.
> >>>>> */
> >>>>> - mutex_lock(&obj->mm.lock);
> >>>>> if (i915_gem_object_has_pages(obj) &&
> >>>>> obj->mm.madv == I915_MADV_WILLNEED &&
> >>>>> i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
> >>>>> @@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct
> >>>>> drm_i915_gem_object *obj,
> >>>>> obj->mm.quirked = true;
> >>>>> }
> >>>>> }
> >>>>> - mutex_unlock(&obj->mm.lock);
> >>>>> spin_lock(&obj->vma.lock);
> >>>>> for_each_ggtt_vma(vma, obj) {
> >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> >>>>> index e946032b13e4..80907c00c6fd 100644
> >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> >>>>> @@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct
> >>>>> mmu_notifier *_mn,
> >>>>> ret = i915_gem_object_unbind(obj,
> >>>>> I915_GEM_OBJECT_UNBIND_ACTIVE |
> >>>>> I915_GEM_OBJECT_UNBIND_BARRIER);
> >>>>> - if (ret == 0)
> >>>>> - ret = __i915_gem_object_put_pages(obj);
> >>>>> + if (ret == 0) {
> >>>>> + /* ww_mutex and mmu_notifier is fs_reclaim tainted */
> >>>>> + if (i915_gem_object_trylock(obj)) {
> >>>>> + ret = __i915_gem_object_put_pages(obj);
> >>>>> + i915_gem_object_unlock(obj);
> >>>>> + } else {
> >>>>> + ret = -EAGAIN;
> >>>>> + }
> >>>>> + }
> >>>>
> >>>> I'm not sure upstream will agree with this kind of API:
> >>>>
> >>>> 1. It will deadlock when RT tasks are used.
> >>>
> >>> It will or it can? Which part? It will break out of the loop if
> >>> trylock fails.
> >>>
> >>>>
> >>>> 2. You start throwing -EAGAIN because you don't have the correct
> >>>> ordering of locking, this needs fixing first.
> >>>
> >>> Is it about correct ordering of locks or something else? If memory
> >>> allocation is allowed under dma_resv.lock, then the opposite order
> >>> cannot be taken in any case.
> >>>
> >>> I've had a brief look at the amdgpu solution and maybe I
> >>> misunderstood something, but it looks like a BKL approach with the
> >>> device level notifier_lock. Their userptr notifier blocks on that
> >>> one, not on dma_resv lock, but that also means their command
> >>> submission (amdgpu_cs_submit) blocks on the same lock while
> >>> obtaining backing store.
> >>
> >> If I read Christian right, it blocks on that lock only just before
> >> command submission to validate that sequence number. If there is a
> >> mismatch, it needs to rerun CS. I'm not sure how common userptr
> >> buffers are, but if a device-wide mutex hurts too much, There are
> >> perhaps more fine-grained solutions. (Like an rw semaphore, and
> >> unlock before the fence wait in the notifier: CS which are unaffected
> >> shouldn't need to wait...).
> >
> > Right, I wasn't familiar with the mmu notifier seqno business. Hm i915
> > and amdgpu seem to use different mmu notifier hooks. Could we use the
> > same and does it mean "locking order" is irrelevant to the sub-story
> > of userptr?
> >
> >>
> >>>
> >>> So it looks like a big hammer approach not directly related to the
> >>> story of dma_resv locking. Maybe we could do the same big hammer
> >>> approach, although I am not sure how it is deadlock free.
> >>>
> >>> What happens for instance if someone submits an userptr batch which
> >>> gets unmapped while amdgpu_cs_submit is holding the notifier_lock?
> >>
> >> My understanding is the unmapping operation blocks on the
> >> notifier_lock in the mmu notifier?
> >
> > Yes, the key will be understanding the difference between
> > invalidate_range_start and invalidate+seqno and whether we can use the
> > same from i915 to avoid the trylock.
> >
> > But given how ww mutex is tainted fs reclaim does that mean we would
> > also need to ensure no memory allocations under ww mutex?
>
> So with the fs_reclaim, we can do memory allocations under ww_mutex, but
> need trylock in shrinkers and mmu_notifiers. However get_user_pages()
> must not run under ww_mutex due to mmap_sem inversion.
There's no real reason for the shrinker to call into mmu-notifiers,
certainly doesn't work for us where we handle the shrinking via another
shrinker. The debate is between adding skip for page_maybe_dma_pinned to
the lru shrink and for for never adding pinned dma pages to the lru list
in the first place.
> But with the AMD solution I figure we don't really need the ww_mutex in
> the mmu_notifier at all? That would possibly mean holding on to and
> exposing to GPU its pages that aren't used anymore, but a shrinker would
> notice the sequence number mismatch and consider the pages purgeable,
> and also the next cs the stale pages would be released.
You need some lock to coordinate to give the guarantee required by the
mmu-notifier interface we use. HMM is likely the only way around that,
by moving the goalposts entirely.
-Chris
More information about the Intel-gfx
mailing list