[PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()

Mon Apr 28 16:16:21 UTC 2025

On 28.04.25 18:08, Peter Xu wrote:
> On Fri, Apr 25, 2025 at 10:36:55PM +0200, David Hildenbrand wrote:
>> On 25.04.25 22:23, Peter Xu wrote:
>>> On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
>>>> Let's use our new interface. In remap_pfn_range(), we'll now decide
>>>> whether we have to track (full VMA covered) or only sanitize the pgprot
>>>> (partial VMA covered).
>>>>
>>>> Remember what we have to untrack by linking it from the VMA. When
>>>> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
>>>> to anon VMA names, and use a kref to share the tracking.
>>>>
>>>> Once the last VMA un-refs our tracking data, we'll do the untracking,
>>>> which simplifies things a lot and should sort our various issues we saw
>>>> recently, for example, when partially unmapping/zapping a tracked VMA.
>>>>
>>>> This change implies that we'll keep tracking the original PFN range even
>>>> after splitting + partially unmapping it: not too bad, because it was
>>>> not working reliably before. The only thing that kind-of worked before
>>>> was shrinking such a mapping using mremap(): we managed to adjust the
>>>> reservation in a hacky way, now we won't adjust the reservation but
>>>> leave it around until all involved VMAs are gone.
>>>>
>>>> Signed-off-by: David Hildenbrand <david at redhat.com>
>>>> ---
>>>>    include/linux/mm_inline.h |  2 +
>>>>    include/linux/mm_types.h  | 11 ++++++
>>>>    kernel/fork.c             | 54 ++++++++++++++++++++++++--
>>>>    mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>>>>    mm/mremap.c               |  4 --
>>>>    5 files changed, 128 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>>>> index f9157a0c42a5c..89b518ff097e6 100644
>>>> --- a/include/linux/mm_inline.h
>>>> +++ b/include/linux/mm_inline.h
>>>> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>>>>    #endif  /* CONFIG_ANON_VMA_NAME */
>>>> +void pfnmap_track_ctx_release(struct kref *ref);
>>>> +
>>>>    static inline void init_tlb_flush_pending(struct mm_struct *mm)
>>>>    {
>>>>    	atomic_set(&mm->tlb_flush_pending, 0);
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 56d07edd01f91..91124761cfda8 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -764,6 +764,14 @@ struct vma_numab_state {
>>>>    	int prev_scan_seq;
>>>>    };
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> +struct pfnmap_track_ctx {
>>>> +	struct kref kref;
>>>> +	unsigned long pfn;
>>>> +	unsigned long size;
>>>> +};
>>>> +#endif
>>>> +
>>>>    /*
>>>>     * This struct describes a virtual memory area. There is one of these
>>>>     * per VM-area/task. A VM area is any part of the process virtual memory
>>>> @@ -877,6 +885,9 @@ struct vm_area_struct {
>>>>    	struct anon_vma_name *anon_name;
>>>>    #endif
>>>>    	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
>>>> +#endif
>>>
>>> So this was originally the small concern (or is it small?) that this will
>>> grow every vma on x86, am I right?
>>
>> Yeah, and last time I looked into this, it would have grown it such that it would
>> require a bigger slab. Right now:
> 
> Probably due to what config you have.  E.g., when I'm looking mine it's
> much bigger and already consuming 256B, but it's because I enabled more
> things (userfaultfd, lockdep, etc.).

Note that I enabled everything that you would expect on a production 
system (incld. userfaultfd, mempolicy, per-vma locks), so I didn't 
enable lockep.

Thanks for verifying!

-- 
Cheers,

David / dhildenb