[PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
Kuehling, Felix
Felix.Kuehling at amd.com
Tue Oct 29 22:04:45 UTC 2019
I haven't had enough time to fully understand the deferred logic in this
change. I spotted one problem, see comments inline.
On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg at mellanox.com>
>
> Of the 13 users of mmu_notifiers, 8 of them use only
> invalidate_range_start/end() and immediately intersect the
> mmu_notifier_range with some kind of internal list of VAs. 4 use an
> interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
> of some kind (scif_dma, vhost, gntdev, hmm)
>
> And the remaining 5 either don't use invalidate_range_start() or do some
> special thing with it.
>
> It turns out that building a correct scheme with an interval tree is
> pretty complicated, particularly if the use case is synchronizing against
> another thread doing get_user_pages(). Many of these implementations have
> various subtle and difficult to fix races.
>
> This approach puts the interval tree as common code at the top of the mmu
> notifier call tree and implements a shareable locking scheme.
>
> It includes:
> - An interval tree tracking VA ranges, with per-range callbacks
> - A read/write locking scheme for the interval tree that avoids
> sleeping in the notifier path (for OOM killer)
> - A sequence counter based collision-retry locking scheme to tell
> device page fault that a VA range is being concurrently invalidated.
>
> This is based on various ideas:
> - hmm accumulates invalidated VA ranges and releases them when all
> invalidates are done, via active_invalidate_ranges count.
> This approach avoids having to intersect the interval tree twice (as
> umem_odp does) at the potential cost of a longer device page fault.
>
> - kvm/umem_odp use a sequence counter to drive the collision retry,
> via invalidate_seq
>
> - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
> This makes adding/removing interval tree members more deterministic
>
> - seqlock, except this version makes the seqlock idea multi-holder on the
> write side by protecting it with active_invalidate_ranges and a spinlock
>
> To minimize MM overhead when only the interval tree is being used, the
> entire SRCU and hlist overheads are dropped using some simple
> branches. Similarly the interval tree overhead is dropped when in hlist
> mode.
>
> The overhead from the mandatory spinlock is broadly the same as most of
> existing users which already had a lock (or two) of some sort on the
> invalidation path.
>
> Cc: Andrea Arcangeli <aarcange at redhat.com>
> Cc: Michal Hocko <mhocko at kernel.org>
> Acked-by: Christian König <christian.koenig at amd.com>
> Signed-off-by: Jason Gunthorpe <jgg at mellanox.com>
> ---
> include/linux/mmu_notifier.h | 98 +++++++
> mm/Kconfig | 1 +
> mm/mmu_notifier.c | 533 +++++++++++++++++++++++++++++++++--
> 3 files changed, 607 insertions(+), 25 deletions(-)
>
[snip]
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 367670cfd02b7b..d02d3c8c223eb7 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
[snip]
> * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> @@ -52,17 +286,24 @@ struct mmu_notifier_mm {
> * can't go away from under us as exit_mmap holds an mm_count pin
> * itself.
> */
> -void __mmu_notifier_release(struct mm_struct *mm)
> +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
> + struct mm_struct *mm)
> {
> struct mmu_notifier *mn;
> int id;
>
> + if (mmn_mm->has_interval)
> + mn_itree_release(mmn_mm, mm);
> +
> + if (hlist_empty(&mmn_mm->list))
> + return;
This seems to duplicate the conditions in __mmu_notifier_release. See my
comments below, I think one of them is wrong. I suspect this one,
because __mmu_notifier_release follows the same pattern as the other
notifiers.
> +
> /*
> * SRCU here will block mmu_notifier_unregister until
> * ->release returns.
> */
> id = srcu_read_lock(&srcu);
> - hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
> + hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist)
> /*
> * If ->release runs before mmu_notifier_unregister it must be
> * handled, as it's the only way for the driver to flush all
> @@ -72,9 +313,9 @@ void __mmu_notifier_release(struct mm_struct *mm)
> if (mn->ops->release)
> mn->ops->release(mn, mm);
>
> - spin_lock(&mm->mmu_notifier_mm->lock);
> - while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> - mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> + spin_lock(&mmn_mm->lock);
> + while (unlikely(!hlist_empty(&mmn_mm->list))) {
> + mn = hlist_entry(mmn_mm->list.first,
> struct mmu_notifier,
> hlist);
> /*
> @@ -85,7 +326,7 @@ void __mmu_notifier_release(struct mm_struct *mm)
> */
> hlist_del_init_rcu(&mn->hlist);
> }
> - spin_unlock(&mm->mmu_notifier_mm->lock);
> + spin_unlock(&mmn_mm->lock);
> srcu_read_unlock(&srcu, id);
>
> /*
> @@ -100,6 +341,17 @@ void __mmu_notifier_release(struct mm_struct *mm)
> synchronize_srcu(&srcu);
> }
>
> +void __mmu_notifier_release(struct mm_struct *mm)
> +{
> + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
> +
> + if (mmn_mm->has_interval)
> + mn_itree_release(mmn_mm, mm);
If mmn_mm->list is not empty, this will be done twice because
mn_hlist_release duplicates this.
> +
> + if (!hlist_empty(&mmn_mm->list))
> + mn_hlist_release(mmn_mm, mm);
mn_hlist_release checks the same condition itself.
> +}
> +
> /*
> * If no young bitflag is supported by the hardware, ->clear_flush_young can
> * unmap the address and return 1 or 0 depending if the mapping previously
[snip]
More information about the amd-gfx
mailing list