[PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)
David Hildenbrand
david at redhat.com
Thu Dec 7 09:44:14 UTC 2023
On 07.12.23 06:09, Kasireddy, Vivek wrote:
> Hi David,
>
>> On 05.12.23 06:35, Vivek Kasireddy wrote:
>>> For drivers that would like to longterm-pin the pages associated
>>> with a memfd, the pin_user_pages_fd() API provides an option to
>>> not only pin the pages via FOLL_PIN but also to check and migrate
>>> them if they reside in movable zone or CMA block. This API
>>> currently works with memfds but it should work with any files
>>> that belong to either shmemfs or hugetlbfs. Files belonging to
>>> other filesystems are rejected for now.
>>>
>>> The pages need to be located first before pinning them via FOLL_PIN.
>>> If they are found in the page cache, they can be immediately pinned.
>>> Otherwise, they need to be allocated using the filesystem specific
>>> APIs and then pinned.
>>>
>>> v2:
>>> - Drop gup_flags and improve comments and commit message (David)
>>> - Allocate a page if we cannot find in page cache for the hugetlbfs
>>> case as well (David)
>>> - Don't unpin pages if there is a migration related failure (David)
>>> - Drop the unnecessary nr_pages <= 0 check (Jason)
>>> - Have the caller of the API pass in file * instead of fd (Jason)
>>>
>>> v3: (David)
>>> - Enclose the huge page allocation code with #ifdef
>> CONFIG_HUGETLB_PAGE
>>> (Build error reported by kernel test robot <lkp at intel.com>)
>>> - Don't forget memalloc_pin_restore() on non-migration related errors
>>> - Improve the readability of the cleanup code associated with
>>> non-migration related errors
>>> - Augment the comments by describing FOLL_LONGTERM like behavior
>>> - Include the R-b tag from Jason
>>>
>>> v4:
>>> - Remove the local variable "page" and instead use 3 return statements
>>> in alloc_file_page() (David)
>>> - Add the R-b tag from David
>>>
>>> v5: (David)
>>> - For hugetlb case, ensure that we only obtain head pages from the
>>> mapping by using __filemap_get_folio() instead of find_get_page_flags()
>>> - Handle -EEXIST when two or more potential users try to simultaneously
>>> add a huge page to the mapping by forcing them to retry on failure
>>>
>>> v6: (Christoph)
>>> - Rename this API to memfd_pin_user_pages() to make it clear that it
>>> is intended for memfds
>>> - Move the memfd page allocation helper from gup.c to memfd.c
>>> - Fix indentation errors in memfd_pin_user_pages()
>>> - For contiguous ranges of folios, use a helper such as
>>> filemap_get_folios_contig() to lookup the page cache in batches
>>>
>>> Cc: David Hildenbrand <david at redhat.com>
>>> Cc: Christoph Hellwig <hch at infradead.org>
>>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>>> Cc: Mike Kravetz <mike.kravetz at oracle.com>
>>> Cc: Hugh Dickins <hughd at google.com>
>>> Cc: Peter Xu <peterx at redhat.com>
>>> Cc: Gerd Hoffmann <kraxel at redhat.com>
>>> Cc: Dongwon Kim <dongwon.kim at intel.com>
>>> Cc: Junxiao Chang <junxiao.chang at intel.com>
>>> Suggested-by: Jason Gunthorpe <jgg at nvidia.com>
>>> Reviewed-by: Jason Gunthorpe <jgg at nvidia.com> (v2)
>>> Reviewed-by: David Hildenbrand <david at redhat.com> (v3)
>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy at intel.com>
>>> ---
>>> include/linux/memfd.h | 5 +++
>>> include/linux/mm.h | 2 +
>>> mm/gup.c | 102 ++++++++++++++++++++++++++++++++++++++++++
>>> mm/memfd.c | 34 ++++++++++++++
>>> 4 files changed, 143 insertions(+)
>>>
>>> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
>>> index e7abf6fa4c52..6fc0d1282151 100644
>>> --- a/include/linux/memfd.h
>>> +++ b/include/linux/memfd.h
>>> @@ -6,11 +6,16 @@
>>>
>>> #ifdef CONFIG_MEMFD_CREATE
>>> extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int
>> arg);
>>> +extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx);
>>> #else
>>> static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a)
>>> {
>>> return -EINVAL;
>>> }
>>> +static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t
>> idx)
>>> +{
>>> + return ERR_PTR(-EINVAL);
>>> +}
>>> #endif
>>>
>>> #endif /* __LINUX_MEMFD_H */
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 418d26608ece..ac69db45509f 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long
>> start, unsigned long nr_pages,
>>> struct page **pages, unsigned int gup_flags);
>>> long pin_user_pages_unlocked(unsigned long start, unsigned long
>> nr_pages,
>>> struct page **pages, unsigned int gup_flags);
>>> +long memfd_pin_user_pages(struct file *file, pgoff_t start,
>>> + unsigned long nr_pages, struct page **pages);
>>>
>>> int get_user_pages_fast(unsigned long start, int nr_pages,
>>> unsigned int gup_flags, struct page **pages);
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 231711efa390..eb93d1ec9dc6 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -5,6 +5,7 @@
>>> #include <linux/spinlock.h>
>>>
>>> #include <linux/mm.h>
>>> +#include <linux/memfd.h>
>>> #include <linux/memremap.h>
>>> #include <linux/pagemap.h>
>>> #include <linux/rmap.h>
>>> @@ -17,6 +18,7 @@
>>> #include <linux/hugetlb.h>
>>> #include <linux/migrate.h>
>>> #include <linux/mm_inline.h>
>>> +#include <linux/pagevec.h>
>>> #include <linux/sched/mm.h>
>>> #include <linux/shmem_fs.h>
>>>
>>> @@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned long
>> start, unsigned long nr_pages,
>>> &locked, gup_flags);
>>> }
>>> EXPORT_SYMBOL(pin_user_pages_unlocked);
>>> +
>>> +/**
>>> + * memfd_pin_user_pages() - pin user pages associated with a memfd
>>> + * @memfd: the memfd whose pages are to be pinned
>>> + * @start: starting memfd offset
>>> + * @nr_pages: number of pages from start to pin
>>> + * @pages: array that receives pointers to the pages pinned.
>>> + * Should be at-least nr_pages long.
>>> + *
>>> + * Attempt to pin pages associated with a memfd; given that a memfd is
>> either
>>> + * backed by shmem or hugetlb, the pages can either be found in the page
>> cache
>>> + * or need to be allocated if necessary. Once the pages are located, they
>> are
>>> + * all pinned via FOLL_PIN. And, these pinned pages need to be released
>> either
>>> + * using unpin_user_pages() or unpin_user_page().
>>> + *
>>> + * It must be noted that the pages may be pinned for an indefinite amount
>>> + * of time. And, in most cases, the duration of time they may stay pinned
>>> + * would be controlled by the userspace. This behavior is effectively the
>>> + * same as using FOLL_LONGTERM with other GUP APIs.
>>> + *
>>> + * Returns number of pages pinned. This would be equal to the number of
>>> + * pages requested. If no pages were pinned, it returns -errno.
>>> + */
>>> +long memfd_pin_user_pages(struct file *memfd, pgoff_t start,
>>> + unsigned long nr_pages, struct page **pages)
>>> +{
>>> + pgoff_t start_idx, end_idx = start + nr_pages - 1;
>>> + unsigned int flags, nr_folios, i, j;
>>> + struct folio_batch fbatch;
>>> + struct page *page = NULL;
>>> + struct folio *folio;
>>> + long ret;
>>> +
>>> + if (!nr_pages)
>>> + return -EINVAL;
>>> +
>>> + if (!memfd)
>>> + return -EINVAL;
>>> +
>>> + if (!shmem_file(memfd) && !is_file_hugepages(memfd))
>>> + return -EINVAL;
>>> +
>>> + flags = memalloc_pin_save();
>>> + do {
>>> + folio_batch_init(&fbatch);
>>> + start_idx = start;
>>> + i = 0;
>>> +
>>> + while (start_idx <= end_idx) {
>>> + /*
>>> + * In most cases, we should be able to find the page
>>> + * in the page cache. If we cannot find it for some
>>> + * reason, we try to allocate one and add it to the
>>> + * page cache.
>>> + */
>>> + nr_folios = filemap_get_folios_contig(memfd-
>>> f_mapping,
>>> + &start_idx,
>>> + end_idx,
>>> + &fbatch);
>>> + if (page) {
>>> + put_page(page);
>>> + page = NULL;
>>> + }
>>> + for (j = 0; j < nr_folios; j++) {
>>> + folio = fbatch.folios[j];
>>> + ret = try_grab_page(&folio->page, FOLL_PIN);
>>> + if (unlikely(ret)) {
>>> + folio_batch_release(&fbatch);
>>> + goto err;
>>> + }
>>> +
>>> + pages[i++] = &folio->page;
>>> + }
>>
>> I might be wrong, but that interface is still inconsistent. I think your
>> intention is to always return folios (head pages), but why are we
>> returning pages from this interface then?
>>
>> It would be more consistent regarding the other GUP interfaces to return
>> the actual tail pages that fit the given "pgoff_t start". So if you
>> punch in "nr_pages" you expect to get "nr_pages" pages, and not some
>> other number of folios.
>>
>> Otherwise, this interface is highly confusing.
>>
>> If you always want to return folios, then better name it
>> "memfd_pin_user_folios" (or just "memfd_pin_folios") and pass in a range
>> (instead of a nr_pages parameter), and somehow indicate to the caller
>> how many folio were in that range, and if that range was fully covered.
> I think it makes sense to return folios from this interface; and considering my
> use-case, I'd like have this API return an error if it cannot pin (or allocate)
> the exact number of folios the caller requested.
Okay, then better use folios.
Assuming a caller puts in "start = X" and gets some large folio back.
How is the caller supposed to know at which offset to look into that
folio (IOW< which subpage)? For "pages" it was obvious (you get the
actual subpages), but as soon as we return a large folio, some
information is missing for the caller.
How can the caller figure that out?
>
>>
>> Or am I missing something?
> I can make the udmabuf driver use folios instead of pages too but the function
> check_and_migrate_movable_pages() in GUP still takes a list of pages. Do you
> think it is ok to use a local variable to collect all the head pages for this?
I think you can simply pass in the head page, because only whole folios
can be converted. At some point we should convert that one to use folios
as well.
--
Cheers,
David / dhildenb
More information about the dri-devel
mailing list