[PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

Tue Oct 29 22:04:45 UTC 2019

I haven't had enough time to fully understand the deferred logic in this 
change. I spotted one problem, see comments inline.

On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg at mellanox.com>
>
> Of the 13 users of mmu_notifiers, 8 of them use only
> invalidate_range_start/end() and immediately intersect the
> mmu_notifier_range with some kind of internal list of VAs.  4 use an
> interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
> of some kind (scif_dma, vhost, gntdev, hmm)
>
> And the remaining 5 either don't use invalidate_range_start() or do some
> special thing with it.
>
> It turns out that building a correct scheme with an interval tree is
> pretty complicated, particularly if the use case is synchronizing against
> another thread doing get_user_pages().  Many of these implementations have
> various subtle and difficult to fix races.
>
> This approach puts the interval tree as common code at the top of the mmu
> notifier call tree and implements a shareable locking scheme.
>
> It includes:
>   - An interval tree tracking VA ranges, with per-range callbacks
>   - A read/write locking scheme for the interval tree that avoids
>     sleeping in the notifier path (for OOM killer)
>   - A sequence counter based collision-retry locking scheme to tell
>     device page fault that a VA range is being concurrently invalidated.
>
> This is based on various ideas:
> - hmm accumulates invalidated VA ranges and releases them when all
>    invalidates are done, via active_invalidate_ranges count.
>    This approach avoids having to intersect the interval tree twice (as
>    umem_odp does) at the potential cost of a longer device page fault.
>
> - kvm/umem_odp use a sequence counter to drive the collision retry,
>    via invalidate_seq
>
> - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
>    This makes adding/removing interval tree members more deterministic
>
> - seqlock, except this version makes the seqlock idea multi-holder on the
>    write side by protecting it with active_invalidate_ranges and a spinlock
>
> To minimize MM overhead when only the interval tree is being used, the
> entire SRCU and hlist overheads are dropped using some simple
> branches. Similarly the interval tree overhead is dropped when in hlist
> mode.
>
> The overhead from the mandatory spinlock is broadly the same as most of
> existing users which already had a lock (or two) of some sort on the
> invalidation path.
>
> Cc: Andrea Arcangeli <aarcange at redhat.com>
> Cc: Michal Hocko <mhocko at kernel.org>
> Acked-by: Christian König <christian.koenig at amd.com>
> Signed-off-by: Jason Gunthorpe <jgg at mellanox.com>
> ---
>   include/linux/mmu_notifier.h |  98 +++++++
>   mm/Kconfig                   |   1 +
>   mm/mmu_notifier.c            | 533 +++++++++++++++++++++++++++++++++--
>   3 files changed, 607 insertions(+), 25 deletions(-)
>
[snip]
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 367670cfd02b7b..d02d3c8c223eb7 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
[snip]
>    * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> @@ -52,17 +286,24 @@ struct mmu_notifier_mm {
>    * can't go away from under us as exit_mmap holds an mm_count pin
>    * itself.
>    */
> -void __mmu_notifier_release(struct mm_struct *mm)
> +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
> +			     struct mm_struct *mm)
>   {
>   	struct mmu_notifier *mn;
>   	int id;
>   
> +	if (mmn_mm->has_interval)
> +		mn_itree_release(mmn_mm, mm);
> +
> +	if (hlist_empty(&mmn_mm->list))
> +		return;

This seems to duplicate the conditions in __mmu_notifier_release. See my 
comments below, I think one of them is wrong. I suspect this one, 
because __mmu_notifier_release follows the same pattern as the other 
notifiers.

> +
>   	/*
>   	 * SRCU here will block mmu_notifier_unregister until
>   	 * ->release returns.
>   	 */
>   	id = srcu_read_lock(&srcu);
> -	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
> +	hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist)
>   		/*
>   		 * If ->release runs before mmu_notifier_unregister it must be
>   		 * handled, as it's the only way for the driver to flush all
> @@ -72,9 +313,9 @@ void __mmu_notifier_release(struct mm_struct *mm)
>   		if (mn->ops->release)
>   			mn->ops->release(mn, mm);
>   
> -	spin_lock(&mm->mmu_notifier_mm->lock);
> -	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> -		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> +	spin_lock(&mmn_mm->lock);
> +	while (unlikely(!hlist_empty(&mmn_mm->list))) {
> +		mn = hlist_entry(mmn_mm->list.first,
>   				 struct mmu_notifier,
>   				 hlist);
>   		/*
> @@ -85,7 +326,7 @@ void __mmu_notifier_release(struct mm_struct *mm)
>   		 */
>   		hlist_del_init_rcu(&mn->hlist);
>   	}
> -	spin_unlock(&mm->mmu_notifier_mm->lock);
> +	spin_unlock(&mmn_mm->lock);
>   	srcu_read_unlock(&srcu, id);
>   
>   	/*
> @@ -100,6 +341,17 @@ void __mmu_notifier_release(struct mm_struct *mm)
>   	synchronize_srcu(&srcu);
>   }
>   
> +void __mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
> +
> +	if (mmn_mm->has_interval)
> +		mn_itree_release(mmn_mm, mm);

If mmn_mm->list is not empty, this will be done twice because 
mn_hlist_release duplicates this.

> +
> +	if (!hlist_empty(&mmn_mm->list))
> +		mn_hlist_release(mmn_mm, mm);

mn_hlist_release checks the same condition itself.

> +}
> +
>   /*
>    * If no young bitflag is supported by the hardware, ->clear_flush_young can
>    * unmap the address and return 1 or 0 depending if the mapping previously
[snip]