[PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)

Thu Dec 7 09:44:14 UTC 2023

On 07.12.23 06:09, Kasireddy, Vivek wrote:
> Hi David,
> 
>> On 05.12.23 06:35, Vivek Kasireddy wrote:
>>> For drivers that would like to longterm-pin the pages associated
>>> with a memfd, the pin_user_pages_fd() API provides an option to
>>> not only pin the pages via FOLL_PIN but also to check and migrate
>>> them if they reside in movable zone or CMA block. This API
>>> currently works with memfds but it should work with any files
>>> that belong to either shmemfs or hugetlbfs. Files belonging to
>>> other filesystems are rejected for now.
>>>
>>> The pages need to be located first before pinning them via FOLL_PIN.
>>> If they are found in the page cache, they can be immediately pinned.
>>> Otherwise, they need to be allocated using the filesystem specific
>>> APIs and then pinned.
>>>
>>> v2:
>>> - Drop gup_flags and improve comments and commit message (David)
>>> - Allocate a page if we cannot find in page cache for the hugetlbfs
>>>     case as well (David)
>>> - Don't unpin pages if there is a migration related failure (David)
>>> - Drop the unnecessary nr_pages <= 0 check (Jason)
>>> - Have the caller of the API pass in file * instead of fd (Jason)
>>>
>>> v3: (David)
>>> - Enclose the huge page allocation code with #ifdef
>> CONFIG_HUGETLB_PAGE
>>>     (Build error reported by kernel test robot <lkp at intel.com>)
>>> - Don't forget memalloc_pin_restore() on non-migration related errors
>>> - Improve the readability of the cleanup code associated with
>>>     non-migration related errors
>>> - Augment the comments by describing FOLL_LONGTERM like behavior
>>> - Include the R-b tag from Jason
>>>
>>> v4:
>>> - Remove the local variable "page" and instead use 3 return statements
>>>     in alloc_file_page() (David)
>>> - Add the R-b tag from David
>>>
>>> v5: (David)
>>> - For hugetlb case, ensure that we only obtain head pages from the
>>>     mapping by using __filemap_get_folio() instead of find_get_page_flags()
>>> - Handle -EEXIST when two or more potential users try to simultaneously
>>>     add a huge page to the mapping by forcing them to retry on failure
>>>
>>> v6: (Christoph)
>>> - Rename this API to memfd_pin_user_pages() to make it clear that it
>>>     is intended for memfds
>>> - Move the memfd page allocation helper from gup.c to memfd.c
>>> - Fix indentation errors in memfd_pin_user_pages()
>>> - For contiguous ranges of folios, use a helper such as
>>>     filemap_get_folios_contig() to lookup the page cache in batches
>>>
>>> Cc: David Hildenbrand <david at redhat.com>
>>> Cc: Christoph Hellwig <hch at infradead.org>
>>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>>> Cc: Mike Kravetz <mike.kravetz at oracle.com>
>>> Cc: Hugh Dickins <hughd at google.com>
>>> Cc: Peter Xu <peterx at redhat.com>
>>> Cc: Gerd Hoffmann <kraxel at redhat.com>
>>> Cc: Dongwon Kim <dongwon.kim at intel.com>
>>> Cc: Junxiao Chang <junxiao.chang at intel.com>
>>> Suggested-by: Jason Gunthorpe <jgg at nvidia.com>
>>> Reviewed-by: Jason Gunthorpe <jgg at nvidia.com> (v2)
>>> Reviewed-by: David Hildenbrand <david at redhat.com> (v3)
>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy at intel.com>
>>> ---
>>>    include/linux/memfd.h |   5 +++
>>>    include/linux/mm.h    |   2 +
>>>    mm/gup.c              | 102 ++++++++++++++++++++++++++++++++++++++++++
>>>    mm/memfd.c            |  34 ++++++++++++++
>>>    4 files changed, 143 insertions(+)
>>>
>>> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
>>> index e7abf6fa4c52..6fc0d1282151 100644
>>> --- a/include/linux/memfd.h
>>> +++ b/include/linux/memfd.h
>>> @@ -6,11 +6,16 @@
>>>
>>>    #ifdef CONFIG_MEMFD_CREATE
>>>    extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int
>> arg);
>>> +extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx);
>>>    #else
>>>    static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a)
>>>    {
>>>    	return -EINVAL;
>>>    }
>>> +static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t
>> idx)
>>> +{
>>> +	return ERR_PTR(-EINVAL);
>>> +}
>>>    #endif
>>>
>>>    #endif /* __LINUX_MEMFD_H */
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 418d26608ece..ac69db45509f 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long
>> start, unsigned long nr_pages,
>>>    		    struct page **pages, unsigned int gup_flags);
>>>    long pin_user_pages_unlocked(unsigned long start, unsigned long
>> nr_pages,
>>>    		    struct page **pages, unsigned int gup_flags);
>>> +long memfd_pin_user_pages(struct file *file, pgoff_t start,
>>> +			  unsigned long nr_pages, struct page **pages);
>>>
>>>    int get_user_pages_fast(unsigned long start, int nr_pages,
>>>    			unsigned int gup_flags, struct page **pages);
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 231711efa390..eb93d1ec9dc6 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -5,6 +5,7 @@
>>>    #include <linux/spinlock.h>
>>>
>>>    #include <linux/mm.h>
>>> +#include <linux/memfd.h>
>>>    #include <linux/memremap.h>
>>>    #include <linux/pagemap.h>
>>>    #include <linux/rmap.h>
>>> @@ -17,6 +18,7 @@
>>>    #include <linux/hugetlb.h>
>>>    #include <linux/migrate.h>
>>>    #include <linux/mm_inline.h>
>>> +#include <linux/pagevec.h>
>>>    #include <linux/sched/mm.h>
>>>    #include <linux/shmem_fs.h>
>>>
>>> @@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned long
>> start, unsigned long nr_pages,
>>>    				     &locked, gup_flags);
>>>    }
>>>    EXPORT_SYMBOL(pin_user_pages_unlocked);
>>> +
>>> +/**
>>> + * memfd_pin_user_pages() - pin user pages associated with a memfd
>>> + * @memfd:      the memfd whose pages are to be pinned
>>> + * @start:      starting memfd offset
>>> + * @nr_pages:   number of pages from start to pin
>>> + * @pages:      array that receives pointers to the pages pinned.
>>> + *              Should be at-least nr_pages long.
>>> + *
>>> + * Attempt to pin pages associated with a memfd; given that a memfd is
>> either
>>> + * backed by shmem or hugetlb, the pages can either be found in the page
>> cache
>>> + * or need to be allocated if necessary. Once the pages are located, they
>> are
>>> + * all pinned via FOLL_PIN. And, these pinned pages need to be released
>> either
>>> + * using unpin_user_pages() or unpin_user_page().
>>> + *
>>> + * It must be noted that the pages may be pinned for an indefinite amount
>>> + * of time. And, in most cases, the duration of time they may stay pinned
>>> + * would be controlled by the userspace. This behavior is effectively the
>>> + * same as using FOLL_LONGTERM with other GUP APIs.
>>> + *
>>> + * Returns number of pages pinned. This would be equal to the number of
>>> + * pages requested. If no pages were pinned, it returns -errno.
>>> + */
>>> +long memfd_pin_user_pages(struct file *memfd, pgoff_t start,
>>> +			  unsigned long nr_pages, struct page **pages)
>>> +{
>>> +	pgoff_t start_idx, end_idx = start + nr_pages - 1;
>>> +	unsigned int flags, nr_folios, i, j;
>>> +	struct folio_batch fbatch;
>>> +	struct page *page = NULL;
>>> +	struct folio *folio;
>>> +	long ret;
>>> +
>>> +	if (!nr_pages)
>>> +		return -EINVAL;
>>> +
>>> +	if (!memfd)
>>> +		return -EINVAL;
>>> +
>>> +	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
>>> +		return -EINVAL;
>>> +
>>> +	flags = memalloc_pin_save();
>>> +	do {
>>> +		folio_batch_init(&fbatch);
>>> +		start_idx = start;
>>> +		i = 0;
>>> +
>>> +		while (start_idx <= end_idx) {
>>> +			/*
>>> +			 * In most cases, we should be able to find the page
>>> +			 * in the page cache. If we cannot find it for some
>>> +			 * reason, we try to allocate one and add it to the
>>> +			 * page cache.
>>> +			 */
>>> +			nr_folios = filemap_get_folios_contig(memfd-
>>> f_mapping,
>>> +							      &start_idx,
>>> +							      end_idx,
>>> +							      &fbatch);
>>> +			if (page) {
>>> +				put_page(page);
>>> +				page = NULL;
>>> +			}
>>> +			for (j = 0; j < nr_folios; j++) {
>>> +				folio = fbatch.folios[j];
>>> +				ret = try_grab_page(&folio->page, FOLL_PIN);
>>> +				if (unlikely(ret)) {
>>> +					folio_batch_release(&fbatch);
>>> +					goto err;
>>> +				}
>>> +
>>> +				pages[i++] = &folio->page;
>>> +			}
>>
>> I might be wrong, but that interface is still inconsistent. I think your
>> intention is to always return folios (head pages), but why are we
>> returning pages from this interface then?
>>
>> It would be more consistent regarding the other GUP interfaces to return
>> the actual tail pages that fit the given "pgoff_t start". So if you
>> punch in "nr_pages" you expect to get "nr_pages" pages, and not some
>> other number of folios.
>>
>> Otherwise, this interface is highly confusing.
>>
>> If you always want to return folios, then better name it
>> "memfd_pin_user_folios" (or just "memfd_pin_folios") and pass in a range
>> (instead of a nr_pages parameter), and somehow indicate to the caller
>> how many folio were in that range, and if that range was fully covered.
> I think it makes sense to return folios from this interface; and considering my
> use-case, I'd like have this API return an error if it cannot pin (or allocate)
> the exact number of folios the caller requested.

Okay, then better use folios.

Assuming a caller puts in "start = X" and gets some large folio back. 
How is the caller supposed to know at which offset to look into that 
folio (IOW< which subpage)? For "pages" it was obvious (you get the 
actual subpages), but as soon as we return a large folio, some 
information is missing for the caller.

How can the caller figure that out?

> 
>>
>> Or am I missing something?
> I can make the udmabuf driver use folios instead of pages too but the function
> check_and_migrate_movable_pages() in GUP still takes a list of pages. Do you
> think it is ok to use a local variable to collect all the head pages for this?

I think you can simply pass in the head page, because only whole folios 
can be converted. At some point we should convert that one to use folios 
as well.

-- 
Cheers,

David / dhildenb