[PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
David Hildenbrand
david at redhat.com
Tue Dec 12 12:13:57 UTC 2023
On 12.12.23 08:38, Vivek Kasireddy wrote:
> For drivers that would like to longterm-pin the folios associated
> with a memfd, the memfd_pin_folios() API provides an option to
> not only pin the folios via FOLL_PIN but also to check and migrate
> them if they reside in movable zone or CMA block. This API
> currently works with memfds but it should work with any files
> that belong to either shmemfs or hugetlbfs. Files belonging to
> other filesystems are rejected for now.
>
> The folios need to be located first before pinning them via FOLL_PIN.
> If they are found in the page cache, they can be immediately pinned.
> Otherwise, they need to be allocated using the filesystem specific
> APIs and then pinned.
>
> v2:
> - Drop gup_flags and improve comments and commit message (David)
> - Allocate a page if we cannot find in page cache for the hugetlbfs
> case as well (David)
> - Don't unpin pages if there is a migration related failure (David)
> - Drop the unnecessary nr_pages <= 0 check (Jason)
> - Have the caller of the API pass in file * instead of fd (Jason)
>
> v3: (David)
> - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
> (Build error reported by kernel test robot <lkp at intel.com>)
> - Don't forget memalloc_pin_restore() on non-migration related errors
> - Improve the readability of the cleanup code associated with
> non-migration related errors
> - Augment the comments by describing FOLL_LONGTERM like behavior
> - Include the R-b tag from Jason
>
> v4:
> - Remove the local variable "page" and instead use 3 return statements
> in alloc_file_page() (David)
> - Add the R-b tag from David
>
> v5: (David)
> - For hugetlb case, ensure that we only obtain head pages from the
> mapping by using __filemap_get_folio() instead of find_get_page_flags()
> - Handle -EEXIST when two or more potential users try to simultaneously
> add a huge page to the mapping by forcing them to retry on failure
>
> v6: (Christoph)
> - Rename this API to memfd_pin_user_pages() to make it clear that it
> is intended for memfds
> - Move the memfd page allocation helper from gup.c to memfd.c
> - Fix indentation errors in memfd_pin_user_pages()
> - For contiguous ranges of folios, use a helper such as
> filemap_get_folios_contig() to lookup the page cache in batches
>
> v7:
> - Rename this API to memfd_pin_folios() and make it return folios
> and offsets instead of pages (David)
> - Don't continue processing the folios in the batch returned by
> filemap_get_folios_contig() if they do not have correct next_idx
> - Add the R-b tag from Christoph
>
Sorry, I'm still not happy about the current state, because (1) the
folio vs. pages handling is still mixed (2) we're returning+pinning a
large folio multiple times.
See below if there is an easy way to clean this up.
>> @@ -5,6 +5,7 @@
> #include <linux/spinlock.h>
>
> #include <linux/mm.h>
> +#include <linux/memfd.h>
> #include <linux/memremap.h>
> #include <linux/pagemap.h>
> #include <linux/rmap.h>
> @@ -17,6 +18,7 @@
> #include <linux/hugetlb.h>
> #include <linux/migrate.h>
> #include <linux/mm_inline.h>
> +#include <linux/pagevec.h>
> #include <linux/sched/mm.h>
> #include <linux/shmem_fs.h>
>
> @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> &locked, gup_flags);
> }
> EXPORT_SYMBOL(pin_user_pages_unlocked);
> +
> +/**
> + * memfd_pin_folios() - pin folios associated with a memfd
> + * @memfd: the memfd whose folios are to be pinned
> + * @start: starting memfd offset
> + * @nr_pages: number of pages from start to pin
We're not pinning pages. An inclusive range [start, end] would be clearer.
> + * @folios: array that receives pointers to the folios pinned.
> + * Should be at-least nr_pages long.
> + * @offsets: array that receives offsets of pages in their folios.
> + * Should be at-least nr_pages long.
See below, I'm wondering if this is really required once we return each folio
only once.
> + *
> + * Attempt to pin folios associated with a memfd; given that a memfd is
> + * either backed by shmem or hugetlb, the folios can either be found in
> + * the page cache or need to be allocated if necessary. Once the folios
> + * are located, they are all pinned via FOLL_PIN and the @offsets array
> + * is populated with offsets of the pages in their respective folios.
> + * Therefore, for each page the caller requested, there will be a
> + * corresponding entry in both @folios and @offsets. And, eventually,
> + * these pinned folios need to be released either using unpin_user_pages()
> + * or unpin_user_page().
Oh, we don't have a folio function yet? Should be easy to add, and we'd really
add them.
> + *
> + * It must be noted that the folios may be pinned for an indefinite amount
> + * of time. And, in most cases, the duration of time they may stay pinned
> + * would be controlled by the userspace. This behavior is effectively the
> + * same as using FOLL_LONGTERM with other GUP APIs.
> + *
> + * Returns number of folios pinned. This would be equal to the number of
> + * pages requested. If no folios were pinned, it returns -errno.
> + */
> +long memfd_pin_folios(struct file *memfd, unsigned long start,
> + unsigned long nr_pages, struct folio **folios,
> + pgoff_t *offsets)
> +{
> + unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1;
> + unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT;
> + pgoff_t start_idx, end_idx, next_idx;
> + unsigned int flags, nr_folios, i, j;
> + struct folio *folio = NULL;
> + struct folio_batch fbatch;
> + struct page **pages;
> + struct hstate *h;
> + long ret;
> +
> + if (!nr_pages)
> + return -EINVAL;
> +
> + if (!memfd)
> + return -EINVAL;
> +
> + if (!shmem_file(memfd) && !is_file_hugepages(memfd))
> + return -EINVAL;
> +
> + pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
> + if (!pages)
> + return -ENOMEM;
> +
> + if (is_file_hugepages(memfd)) {
> + h = hstate_file(memfd);
> + pgshift = huge_page_shift(h);
> + }
> +
> + flags = memalloc_pin_save();
> + do {
> + i = 0;
> + start_idx = start >> pgshift;
> + end_idx = end >> pgshift;
> + if (is_file_hugepages(memfd)) {
> + start_idx <<= huge_page_order(h);
> + end_idx <<= huge_page_order(h);
> + }
> +
> + folio_batch_init(&fbatch);
> + while (start_idx <= end_idx) {
> + /*
> + * In most cases, we should be able to find the folios
> + * in the page cache. If we cannot find them for some
> + * reason, we try to allocate them and add them to the
> + * page cache.
> + */
> + nr_folios = filemap_get_folios_contig(memfd->f_mapping,
> + &start_idx,
> + end_idx,
> + &fbatch);
> + if (folio) {
> + folio_put(folio);
> + folio = NULL;
> + }
> +
> + next_idx = 0;
> + for (j = 0; j < nr_folios; j++) {
> + if (next_idx &&
> + next_idx != folio_index(fbatch.folios[j]))
> + continue;
> +
> + folio = try_grab_folio(&fbatch.folios[j]->page,
> + 1, FOLL_PIN);
> + if (!folio) {
> + folio_batch_release(&fbatch);
> + kfree(pages);
> + goto err;
> + }
> +
> + max_pgs = folio_nr_pages(folio);
> + if (i == 0) {
> + pgoff = offset_in_folio(folio, start);
> + pgoff >>= PAGE_SHIFT;
> + }
> +
> + do {
> + folios[i] = folio;
> + offsets[i] = pgoff << PAGE_SHIFT;
> + pages[i] = folio_page(folio, 0);
> + folio_add_pin(folio);
> +
> + pgoff++;
> + i++;
> + } while (pgoff < max_pgs && i < nr_pages);
> +
> + pgoff = 0;
> + next_idx = folio_next_index(folio);
> + gup_put_folio(folio, 1, FOLL_PIN);
> + }
> +
> + folio = NULL;
> + folio_batch_release(&fbatch);
> + if (!nr_folios) {
> + folio = memfd_alloc_folio(memfd, start_idx);
> + if (IS_ERR(folio)) {
> + ret = PTR_ERR(folio);
> + if (ret != -EEXIST) {
> + kfree(pages);
> + goto err;
> + }
> + }
> + }
> + }
> +
> + ret = check_and_migrate_movable_pages(nr_pages, pages);
Having a folio variant would avoid having to mess with pages here at all.
Further, we're now returning+pinning the same folio multiple times, instead of
just once like the folio batching variant would.
I'm wondering if the following wouldn't make more sense, assuming we add
check_and_migrate_movable_folios(), which should be pretty easy to add.
Obviously untested, just to express what I have in mind:
/**
* memfd_pin_folios() - pin folios associated with a memfd
* @memfd: the memfd whose folios are to be pinned
* @start: the starting memfd offset
* @end: the final memfd offset (inclusive)
* @folios: array that receives pointers to the folios pinned
* @max_folios: the number of entries in the array for folios
* @offsets: the offset into the first folio
*
* Attempt to pin folios associated with a memfd; given that a memfd is
* either backed by shmem or hugetlb, the folios can either be found in
* the page cache or need to be allocated if necessary. Once the folios
* are located, they are all pinned via FOLL_PIN and @offset is populated
* with the offset into the first folio.
*
* Pinned folios must be released using unpin_folio() or unpin_folios().
*
* It must be noted that the folios may be pinned for an indefinite amount
* of time. And, in most cases, the duration of time they may stay pinned
* would be controlled by the userspace. This behavior is effectively the
* same as using FOLL_LONGTERM with other GUP APIs.
*
* Returns number of folios pinned, which might be less than @max_folios
* only if the whole range was pinned. If no folios were pinned, it returns
* -errno.
*/
long memfd_pin_folios(struct file *memfd, unsigned long start,
unsigned long end, struct folio **folios,
unsigned int max_folios, unsigned long *offset)
{
unsigned int pgshift = PAGE_SHIFT;
unsigned int flags, nr_folios, cur_folios, i;
pgoff_t start_idx, end_idx;
struct folio_batch fbatch;
struct folio *folio;
struct hstate *h;
long ret;
if (start > end || !max_folios)
return -EINVAL;
if (!memfd)
return -EINVAL;
if (!shmem_file(memfd) && !is_file_hugepages(memfd))
return -EINVAL;
if (is_file_hugepages(memfd)) {
h = hstate_file(memfd);
pgshift = huge_page_shift(h);
}
flags = memalloc_pin_save();
folio_batch_init(&fbatch);
do {
nr_folios = 0;
start_idx = start >> pgshift;
end_idx = end >> pgshift;
if (is_file_hugepages(memfd)) {
start_idx <<= huge_page_order(h);
end_idx <<= huge_page_order(h);
}
while (start_idx <= end_idx) {
/*
* In most cases, we should be able to find the folios
* in the page cache. If we cannot find them for some
* reason, we try to allocate them and add them to the
* page cache.
*/
folio_batch_release(&fbatch);
cur_folios = filemap_get_folios_contig(memfd->f_mapping,
&start_idx,
end_idx,
&fbatch);
if (!cur_folios) {
folio = memfd_alloc_folio(memfd, start_idx);
if (IS_ERR(folio)) {
ret = PTR_ERR(folio);
if (ret != -EEXIST)
goto err;
}
folio_put(folio);
continue;
}
/* Let's pin each folio, which shouldn't really fail. */
for (i = 0; i < cur_folios; i++) {
folio = try_grab_folio(&fbatch.folios[i]->page,
1, FOLL_PIN);
if (!folio)
goto err;
if (!nr_folios)
*offset = offset_in_folio(folio, start);
folios[nr_folios++] = folio;
if (max_folios == nr_folios)
break;
}
if (max_folios == nr_folios)
break;
}
folio_batch_release(&fbatch);
ret = check_and_migrate_movable_folios(nr_folios, folios);
} while (ret == -EAGAIN);
memalloc_pin_restore(flags);
return ret ? ret : nr_folios;
err:
folio_batch_release(&fbatch);
memalloc_pin_restore(flags);
while (i-- > 0)
if (folios[i])
gup_put_folio(folios[i], 1, FOLL_PIN);
return ret;
}
EXPORT_SYMBOL_GPL(memfd_pin_folios);
I'm still wondering about the offset handling, though. Could it happen that why we are
repeatedly calling filemap_get_folios_contig(), that we would need offset!=0 on any of
the other folios besides the first one? My current understanding (and looking at
filemap_get_folios_contig()) is: no.
I'm primarily concerned about concurrent fallocate(PUNCH_HOLE) and THP collapse/splitting.
--
Cheers,
David / dhildenb
More information about the dri-devel
mailing list