[RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

Alistair Popple apopple at nvidia.com
Mon Jul 24 13:36:47 UTC 2023


"Kasireddy, Vivek" <vivek.kasireddy at intel.com> writes:

> Hi Alistair,
>
>> 
>> 
>> "Kasireddy, Vivek" <vivek.kasireddy at intel.com> writes:
>>
>> Yes, although obviously as I think you point out below you wouldn't be
>> able to take any sleeping locks in mmu_notifier_update_mapping().
> Yes, I understand that, but I am not sure how we can prevent any potential
> notifier callback from taking sleeping locks other than adding clear comments.

Oh of course not, but is such a restriction on not taking sleeping locks
acceptable for your implementation of the notifier callback? I notice in
patch 2 update_udmabuf() takes a mutex so I assumed not being able to
sleep in the callback would be an issue.

>> 
>> > In which case I'd need to make a similar change in the shmem path as well.
>> > And, also redo (or eliminate) the locking in udmabuf (patch) which seems a
>> > bit excessive on a second look given our use-case (where reads and writes
>> do
>> > not happen simultaneously due to fence synchronization in the guest
>> driver).
>> 
>> I'm not at all familiar with the udmabuf use case but that sounds
>> brittle and effectively makes this notifier udmabuf specific right?
> Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> buffers. In other words, from a core mm standpoint, udmabuf just
> collects a bunch of pages (associated with buffers) scattered inside
> the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> them in a dmabuf fd. And, since we provide zero-copy access, we
> use DMA fences to ensure that the components on the Host and
> Guest do not access the buffer simultaneously.

Thanks for the background!

>> contemplated adding a notifier for PTE updates for drivers using
>> hmm_range_fault() as it would save some expensive device faults and it
>> this could be useful for that.
>> 
>> So if we're adding a notifier for PTE updates I think it would be good
>> if it covered all cases and was robust enough to allow mirroring of the
>> correct PTE value (ie. by being called under PTL or via some other
>> synchronisation like hmm_range_fault()).
> Ok; in order to make it clear that the notifier is associated with PTE updates,
> I think it needs to have a more suitable name such as mmu_notifier_update_pte()
> or mmu_notifier_new_pte(). But we already have mmu_notifier_change_pte,
> which IIUC is used mainly for PTE updates triggered by KSM. So, I am inclining
> towards dropping this new notifier and instead adding a new flag to change_pte
> to distinguish between KSM triggered notifications and others. Something along
> the lines of:
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 218ddc3b4bc7..6afce2287143 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -129,7 +129,8 @@ struct mmu_notifier_ops {
>         void (*change_pte)(struct mmu_notifier *subscription,
>                            struct mm_struct *mm,
>                            unsigned long address,
> -                          pte_t pte);
> +                          pte_t pte,
> +                          bool ksm_update);
> @@ -658,7 +659,7 @@ static inline void mmu_notifier_range_init_owner(
>         unsigned long ___address = __address;                           \
>         pte_t ___pte = __pte;                                           \
>                                                                         \
> -       mmu_notifier_change_pte(___mm, ___address, ___pte);             \
> +       mmu_notifier_change_pte(___mm, ___address, ___pte, true);       \
>
> And replace mmu_notifier_update_mapping(vma->vm_mm, address, pte_pfn(*ptep))
> in the current patch with
> mmu_notifier_change_pte(vma->vm_mm, address, ptep, false));

I wonder if we actually need the flag? IIUC it is already used for more
than just KSM. For example it can be called as part of fault handling by
set_pte_at_notify() in in wp_page_copy().

> Would that work for your HMM use-case -- assuming we call change_pte after
> taking PTL?

I suspect being called under the PTL could be an issue. For HMM we use a
combination of sequence numbers and a mutex to synchronise PTEs. To
avoid calling the notifier while holding PTL we might be able to record
the sequence number (subscriptions->invalidate_seq) while holding PTL,
release the PTL and provide that sequence number to the notifier
callback along with the PTE.

Some form of mmu_interval_read_retry() could then be used to detect if
the PTE has changed between dropping the PTL and calling the
update_pte()/change_pte() notifier.

Of course if your design can handle being called while holding the PTL
then the above is probably unnecessarily complex for your use-case.

My primary issue with this patch is the notifier is called without the
PTL while providing a PTE value. Without some form of synchronisation it
isn't safe to use the result of eg. pte_page(pte) or pte_write(pte) in
the notifier callback. Based on your comments it seems udmabuf might
have some other synchronisation that makes it safe, but being external
to the notifier calls make it's hard to reason about.

 - Alistair

> Thanks,
> Vivek
>
>> 
>> Thanks.
>> 
>> > Thanks,
>> > Vivek
>> >
>> >>
>> >> > +		return ret;
>> >> >
>> >> >  	ret = 0;
>> >> >
>> >> > @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct
>> *mm,
>> >> struct vm_area_struct *vma,
>> >> >  	 */
>> >> >  	if (need_wait_lock)
>> >> >  		folio_wait_locked(folio);
>> >> > +	if (!ret)
>> >> > +		mmu_notifier_update_mapping(vma->vm_mm, address,
>> >> > +					    pte_pfn(*ptep));
>> >> >  	return ret;
>> >> >  }
>> >> >
>> >> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
>> >> > index 50c0dde1354f..6421405334b9 100644
>> >> > --- a/mm/mmu_notifier.c
>> >> > +++ b/mm/mmu_notifier.c
>> >> > @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct
>> >> mm_struct *mm, unsigned long address,
>> >> >  	srcu_read_unlock(&srcu, id);
>> >> >  }
>> >> >
>> >> > +void __mmu_notifier_update_mapping(struct mm_struct *mm,
>> unsigned
>> >> long address,
>> >> > +				   unsigned long pfn)
>> >> > +{
>> >> > +	struct mmu_notifier *subscription;
>> >> > +	int id;
>> >> > +
>> >> > +	id = srcu_read_lock(&srcu);
>> >> > +	hlist_for_each_entry_rcu(subscription,
>> >> > +				 &mm->notifier_subscriptions->list, hlist,
>> >> > +				 srcu_read_lock_held(&srcu)) {
>> >> > +		if (subscription->ops->update_mapping)
>> >> > +			subscription->ops->update_mapping(subscription,
>> >> mm,
>> >> > +							  address, pfn);
>> >> > +	}
>> >> > +	srcu_read_unlock(&srcu, id);
>> >> > +}
>> >> > +
>> >> >  static int mn_itree_invalidate(struct mmu_notifier_subscriptions
>> >> *subscriptions,
>> >> >  			       const struct mmu_notifier_range *range)
>> >> >  {
>> >> > diff --git a/mm/shmem.c b/mm/shmem.c
>> >> > index 2f2e0e618072..e59eb5fafadb 100644
>> >> > --- a/mm/shmem.c
>> >> > +++ b/mm/shmem.c
>> >> > @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt;
>> >> >  #include <linux/fcntl.h>
>> >> >  #include <uapi/linux/memfd.h>
>> >> >  #include <linux/rmap.h>
>> >> > +#include <linux/mmu_notifier.h>
>> >> >  #include <linux/uuid.h>
>> >> >
>> >> >  #include <linux/uaccess.h>
>> >> > @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct
>> vm_fault
>> >> *vmf)
>> >> >  				  gfp, vma, vmf, &ret);
>> >> >  	if (err)
>> >> >  		return vmf_error(err);
>> >> > -	if (folio)
>> >> > +	if (folio) {
>> >> >  		vmf->page = folio_file_page(folio, vmf->pgoff);
>> >> > +		if (ret == VM_FAULT_LOCKED)
>> >> > +			mmu_notifier_update_mapping(vma->vm_mm, vmf-
>> >> >address,
>> >> > +						    page_to_pfn(vmf->page));
>> >> > +	}
>> >> >  	return ret;
>> >> >  }



More information about the dri-devel mailing list