[PATCH v8 5/8] mm: Device exclusive memory access

Tue May 18 17:27:42 UTC 2021

Alistair,

While I got one reply below to your previous email, I also looked at the rest
code (majorly restore and fork sides) today and added a few more comments.

On Tue, May 18, 2021 at 11:19:05PM +1000, Alistair Popple wrote:

[...]

> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 3a5705cfc891..556ff396f2e9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct
> > > *vma, unsigned long addr,> 
> > >  }
> > >  #endif
> > > 
> > > +static void restore_exclusive_pte(struct vm_area_struct *vma,
> > > +                               struct page *page, unsigned long address,
> > > +                               pte_t *ptep)
> > > +{
> > > +     pte_t pte;
> > > +     swp_entry_t entry;
> > > +
> > > +     pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> > > +     if (pte_swp_soft_dirty(*ptep))
> > > +             pte = pte_mksoft_dirty(pte);
> > > +
> > > +     entry = pte_to_swp_entry(*ptep);
> > > +     if (pte_swp_uffd_wp(*ptep))
> > > +             pte = pte_mkuffd_wp(pte);
> > > +     else if (is_writable_device_exclusive_entry(entry))
> > > +             pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > > +
> > > +     set_pte_at(vma->vm_mm, address, ptep, pte);
> > > +
> > > +     /*
> > > +      * No need to take a page reference as one was already
> > > +      * created when the swap entry was made.
> > > +      */
> > > +     if (PageAnon(page))
> > > +             page_add_anon_rmap(page, vma, address, false);
> > > +     else
> > > +             page_add_file_rmap(page, false);

This seems to be another leftover; maybe WARN_ON_ONCE(!PageAnon(page))?

> > > +
> > > +     if (vma->vm_flags & VM_LOCKED)
> > > +             mlock_vma_page(page);
> > > +
> > > +     /*
> > > +      * No need to invalidate - it was non-present before. However
> > > +      * secondary CPUs may have mappings that need invalidating.
> > > +      */
> > > +     update_mmu_cache(vma, address, ptep);
> > > +}

[...]

> > >  /*
> > >  
> > >   * copy one vm_area from one task to the other. Assumes the page tables
> > >   * already present in the new task to be cleared in the whole range
> > > 
> > > @@ -781,6 +859,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
> > > mm_struct *src_mm,> 
> > >                               pte = pte_swp_mkuffd_wp(pte);
> > >                       
> > >                       set_pte_at(src_mm, addr, src_pte, pte);
> > >               
> > >               }
> > > 
> > > +     } else if (is_device_exclusive_entry(entry)) {
> > > +             /* COW mappings should be dealt with by removing the entry
> > > */

Here the comment says "removing the entry" however I think it didn't remove the
pte, instead it keeps it (as there's no "return", so set_pte_at() will be
called below), so I got a bit confused.

> > > +             VM_BUG_ON(is_cow_mapping(vm_flags));

Also here, if PageAnon() is the only case to support so far, doesn't that
easily satisfy is_cow_mapping()? Maybe I missed something..

I also have a pure and high level question regarding a process fork() when
there're device exclusive ptes: would the two processes then own the device
together?  Is this a real usecase?

Indeed it'll be odd for a COW page since for COW page then it means after
parent/child writting to the page it'll clone into two, then it's a mistery on
which one will be the one that "exclusived owned" by the device..

> > > +             page = pfn_swap_entry_to_page(entry);
> > > +             get_page(page);
> > > +             rss[mm_counter(page)]++;
> > > 
> > >       }
> > >       set_pte_at(dst_mm, addr, dst_pte, pte);
> > >       return 0;
> > > 
> > > @@ -947,6 +1031,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct
> > > vm_area_struct *src_vma,> 
> > >       int rss[NR_MM_COUNTERS];
> > >       swp_entry_t entry = (swp_entry_t){0};
> > >       struct page *prealloc = NULL;
> > > 
> > > +     struct page *locked_page = NULL;
> > > 
> > >  again:
> > >       progress = 0;
> > > 
> > > @@ -980,13 +1065,36 @@ copy_pte_range(struct vm_area_struct *dst_vma,
> > > struct vm_area_struct *src_vma,> 
> > >                       continue;
> > >               
> > >               }
> > >               if (unlikely(!pte_present(*src_pte))) {
> > > 
> > > -                     entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> > > -                                                     dst_pte, src_pte,
> > > -                                                     src_vma, addr, rss);
> > > -                     if (entry.val)
> > > -                             break;
> > > -                     progress += 8;
> > > -                     continue;
> > > +                     swp_entry_t swp_entry = pte_to_swp_entry(*src_pte);

(Just a side note to all of us: this will be one more place that I'll need to
 look after in my uffd-wp series if this series lands first, as after that
 series we can only call pte_to_swp_entry after a pte_has_swap_entry check, as
 sometimes non-present pte won't contain a swap entry at all)

> > > +
> > > +                     if (unlikely(is_cow_mapping(src_vma->vm_flags) &&
> > > +                         is_device_exclusive_entry(swp_entry))) {
> > > +                             /*
> > > +                              * Normally this would require sending mmu
> > > +                              * notifiers, but copy_page_range() has
> > > already +                              * done that for COW mappings.
> > > +                              */
> > > +                             ret = try_restore_exclusive_pte(src_mm,
> > > src_pte, +                                                            
> > > src_vma, addr, +                                                         
> > >    &locked_page); +                             if (ret == -EBUSY)
> > > +                                     break;

Would it be possible that we put all the handling of device exclusive ptes into
copy_nonpresent_pte()?  As IMHO all device exclusive ptes should still be one
kind of non-present pte.  Splitting the cases really make it (at least to
me...) even harder to read.

Maybe you wanted to avoid the rework of copy_nonpresent_pte() as it currently
returns a entry.val which is indeed not straightforward already.. I wanted to
clean that up but not yet.

An easier option is perhaps failing the fork() directly when trylock_page()
failed when restoring the pte? So the userspace could try again the whole
fork(). However that'll also depend on my previous question on whether this is
a valid scenario after all.  If "maintaining fork correctness" is the only
thing we persue for, maybe still worth to consider?

> > > +                     } else {
> > > +                             entry.val = copy_nonpresent_pte(dst_mm,
> > > src_mm, +                                                            
> > > dst_pte, src_pte, +                                                      
> > >       src_vma, addr, +                                                   
> > >          rss); +                             if (entry.val)
> > > +                                     break;
> > > +                             progress += 8;
> > > +                             continue;
> > > +                     }
> > > +             }
> > > +             /* a non-present pte became present after dropping the ptl
> > > */
> > > +             if (unlikely(locked_page)) {
> > > +                     unlock_page(locked_page);
> > > +                     put_page(locked_page);
> > > +                     locked_page = NULL;
> > > 
> > >               }
> > >               /* copy_present_pte() will clear `*prealloc' if consumed */
> > >               ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
> > > 

[...]

> > > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > > +{
> > > +     struct page *page = vmf->page;
> > > +     struct vm_area_struct *vma = vmf->vma;
> > > +     struct page_vma_mapped_walk pvmw = {
> > > +             .page = page,
> > > +             .vma = vma,
> > > +             .address = vmf->address,
> > > +             .flags = PVMW_SYNC,
> > > +     };
> > > +     vm_fault_t ret = 0;
> > > +     struct mmu_notifier_range range;
> > > +
> > > +     if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
> > > +             return VM_FAULT_RETRY;
> > > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma,
> > > vma->vm_mm,
> > > +                             vmf->address & PAGE_MASK,
> > > +                             (vmf->address & PAGE_MASK) + PAGE_SIZE);
> > > +     mmu_notifier_invalidate_range_start(&range);
> > 
> > I looked at MMU_NOTIFIER_CLEAR document and it tells me:
> > 
> >  * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
> >  * madvise() or replacing a page by another one, ...).
> > 
> > Does MMU_NOTIFIER_CLEAR suite for this case?  Normally I think for such a
> > case (existing pte is invalid) we don't need to notify at all.  However
> > from what I read from the whole series, this seems to be a critical point
> > where we'd like to kick the owner/driver to synchronously stop doing atomic
> > operations from the device.  Not sure whether we'd like a new notifier
> > type, or maybe at least comment on why to use CLEAR?
> 
> Right, notifying the owner/driver when it no longer has exclusive access to 
> the page and allowing it to stop atomic operations is the critical point and 
> why it notifies when we ordinarily wouldn't (ie. invalid -> valid).
> 
> I did consider adding a new type, but in the driver implementation it ends up 
> being treated the same as a CLEAR notification anyway so didn't think it was 
> necessary. But I suppose adding a different type would allow other listening 
> notifiers to filter these which might be worthwhile.

Sounds good to me.

[...]

> > > +             /*
> > > +              * Check that our target page is still mapped at the
> > > expected
> > > +              * address.
> > > +              */
> > > +             if (ttp->mm == mm && ttp->address == address &&
> > > +                 pte_write(pteval))
> > > +                     ttp->valid = true;
> > 
> > I think I get the point of doing this (as after GUP the pte could have been
> > changed to point to another page), however it smells a bit odd to me (or
> > it's also possible that I'm not familiar enough with the code base..). 
> > IIUC this is the _only_ reason that we passed in "address" into
> > try_to_protect() too, and further into the ttp_args.
> 
> Yes, this is why "address" is passed up to ttp_args.
> 
> > The odd part is the remote GUP should have walked the page table already, so
> > since the target here is the vaddr to replace, the 1st page table walk
> > should be able to both trylock/lock the page, then modify the pte with
> > pgtable lock held, return the locked page, then walk the rmap again to
> > remove all the rest of the ptes that are mapping to this page.  In that
> > case before we call the rmap_walk() we know this must be the page we want
> > to take care of, and no one will be able to restore the original mm pte
> > either (as we're with the page lock).  Then we don't need this check,
> > neither do we need ttp->address.
> 
> If I am understanding you correctly I think this would be similar to the 
> approach that was taken in v2. However it pretty much ended up being just an 
> open-coded version of gup which is useful anyway to fault the page in.

I see.  For easier reference this is v2 patch 1:

https://lore.kernel.org/lkml/20210219020750.16444-2-apopple@nvidia.com/

Indeed that looks like it, it's just that instead of grabbing the page only in
hmm_exclusive_pmd() we can do the pte modification along the way to seal the
whole thing (address/pte & page).  I saw Christoph and Jason commented in that
patch, but not regarding to this approach.  So is there a reason that you
switched?  Do you think it'll work?

I have no strong opinion either, it's just not crystal clear why we'd need that
ttp->address at all for a rmap walk along with that "valid" field. Meanwhile it
should be slightly less efficient too to go with current approach, especially
when the page array gets huge, I think: since there'll be longer time we do GUP
before doing the rmap walk, so higher possibility that the GUPed pages got
replaced for whatever reason.  Then the call to make_device_exclusive_range()
will fail as a whole just for a single page replacement within the range.

Thanks,

-- 
Peter Xu