[PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)

Tue Dec 12 12:13:57 UTC 2023

On 12.12.23 08:38, Vivek Kasireddy wrote:
> For drivers that would like to longterm-pin the folios associated
> with a memfd, the memfd_pin_folios() API provides an option to
> not only pin the folios via FOLL_PIN but also to check and migrate
> them if they reside in movable zone or CMA block. This API
> currently works with memfds but it should work with any files
> that belong to either shmemfs or hugetlbfs. Files belonging to
> other filesystems are rejected for now.
> 
> The folios need to be located first before pinning them via FOLL_PIN.
> If they are found in the page cache, they can be immediately pinned.
> Otherwise, they need to be allocated using the filesystem specific
> APIs and then pinned.
> 
> v2:
> - Drop gup_flags and improve comments and commit message (David)
> - Allocate a page if we cannot find in page cache for the hugetlbfs
>    case as well (David)
> - Don't unpin pages if there is a migration related failure (David)
> - Drop the unnecessary nr_pages <= 0 check (Jason)
> - Have the caller of the API pass in file * instead of fd (Jason)
> 
> v3: (David)
> - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
>    (Build error reported by kernel test robot <lkp at intel.com>)
> - Don't forget memalloc_pin_restore() on non-migration related errors
> - Improve the readability of the cleanup code associated with
>    non-migration related errors
> - Augment the comments by describing FOLL_LONGTERM like behavior
> - Include the R-b tag from Jason
> 
> v4:
> - Remove the local variable "page" and instead use 3 return statements
>    in alloc_file_page() (David)
> - Add the R-b tag from David
> 
> v5: (David)
> - For hugetlb case, ensure that we only obtain head pages from the
>    mapping by using __filemap_get_folio() instead of find_get_page_flags()
> - Handle -EEXIST when two or more potential users try to simultaneously
>    add a huge page to the mapping by forcing them to retry on failure
> 
> v6: (Christoph)
> - Rename this API to memfd_pin_user_pages() to make it clear that it
>    is intended for memfds
> - Move the memfd page allocation helper from gup.c to memfd.c
> - Fix indentation errors in memfd_pin_user_pages()
> - For contiguous ranges of folios, use a helper such as
>    filemap_get_folios_contig() to lookup the page cache in batches
> 
> v7:
> - Rename this API to memfd_pin_folios() and make it return folios
>    and offsets instead of pages (David)
> - Don't continue processing the folios in the batch returned by
>    filemap_get_folios_contig() if they do not have correct next_idx
> - Add the R-b tag from Christoph
> 

Sorry, I'm still not happy about the current state, because (1) the
folio vs. pages handling is still mixed (2) we're returning+pinning a
large folio multiple times.

See below if there is an easy way to clean this up.

>> @@ -5,6 +5,7 @@
>   #include <linux/spinlock.h>
>   
>   #include <linux/mm.h>
> +#include <linux/memfd.h>
>   #include <linux/memremap.h>
>   #include <linux/pagemap.h>
>   #include <linux/rmap.h>
> @@ -17,6 +18,7 @@
>   #include <linux/hugetlb.h>
>   #include <linux/migrate.h>
>   #include <linux/mm_inline.h>
> +#include <linux/pagevec.h>
>   #include <linux/sched/mm.h>
>   #include <linux/shmem_fs.h>
>   
> @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>   				     &locked, gup_flags);
>   }
>   EXPORT_SYMBOL(pin_user_pages_unlocked);
> +
> +/**
> + * memfd_pin_folios() - pin folios associated with a memfd
> + * @memfd:      the memfd whose folios are to be pinned
> + * @start:      starting memfd offset
> + * @nr_pages:   number of pages from start to pin

We're not pinning pages. An inclusive range [start, end] would be clearer.

> + * @folios:     array that receives pointers to the folios pinned.
> + *              Should be at-least nr_pages long.
> + * @offsets:    array that receives offsets of pages in their folios.
> + *              Should be at-least nr_pages long.

See below, I'm wondering if this is really required once we return each folio
only once.

> + *
> + * Attempt to pin folios associated with a memfd; given that a memfd is
> + * either backed by shmem or hugetlb, the folios can either be found in
> + * the page cache or need to be allocated if necessary. Once the folios
> + * are located, they are all pinned via FOLL_PIN and the @offsets array
> + * is populated with offsets of the pages in their respective folios.
> + * Therefore, for each page the caller requested, there will be a
> + * corresponding entry in both @folios and @offsets. And, eventually,
> + * these pinned folios need to be released either using unpin_user_pages()
> + * or unpin_user_page().

Oh, we don't have a folio function yet? Should be easy to add, and we'd really
add them.

> + *
> + * It must be noted that the folios may be pinned for an indefinite amount
> + * of time. And, in most cases, the duration of time they may stay pinned
> + * would be controlled by the userspace. This behavior is effectively the
> + * same as using FOLL_LONGTERM with other GUP APIs.
> + *
> + * Returns number of folios pinned. This would be equal to the number of
> + * pages requested. If no folios were pinned, it returns -errno.
> + */
> +long memfd_pin_folios(struct file *memfd, unsigned long start,
> +		      unsigned long nr_pages, struct folio **folios,
> +		      pgoff_t *offsets)
> +{
> +	unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1;
> +	unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT;
> +	pgoff_t start_idx, end_idx, next_idx;
> +	unsigned int flags, nr_folios, i, j;
> +	struct folio *folio = NULL;
> +	struct folio_batch fbatch;
> +	struct page **pages;
> +	struct hstate *h;
> +	long ret;
> +
> +	if (!nr_pages)
> +		return -EINVAL;
> +
> +	if (!memfd)
> +		return -EINVAL;
> +
> +	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
> +		return -EINVAL;
> +
> +	pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	if (is_file_hugepages(memfd)) {
> +		h = hstate_file(memfd);
> +		pgshift = huge_page_shift(h);
> +	}
> +
> +	flags = memalloc_pin_save();
> +	do {
> +		i = 0;
> +		start_idx = start >> pgshift;
> +		end_idx = end >> pgshift;
> +		if (is_file_hugepages(memfd)) {
> +			start_idx <<= huge_page_order(h);
> +			end_idx <<= huge_page_order(h);
> +		}
> +
> +		folio_batch_init(&fbatch);
> +		while (start_idx <= end_idx) {
> +			/*
> +			 * In most cases, we should be able to find the folios
> +			 * in the page cache. If we cannot find them for some
> +			 * reason, we try to allocate them and add them to the
> +			 * page cache.
> +			 */
> +			nr_folios = filemap_get_folios_contig(memfd->f_mapping,
> +							      &start_idx,
> +							      end_idx,
> +							      &fbatch);
> +			if (folio) {
> +				folio_put(folio);
> +				folio = NULL;
> +			}
> +
> +			next_idx = 0;
> +			for (j = 0; j < nr_folios; j++) {
> +				if (next_idx &&
> +				    next_idx != folio_index(fbatch.folios[j]))
> +					continue;
> +
> +				folio = try_grab_folio(&fbatch.folios[j]->page,
> +						       1, FOLL_PIN);
> +				if (!folio) {
> +					folio_batch_release(&fbatch);
> +					kfree(pages);
> +					goto err;
> +				}
> +
> +				max_pgs = folio_nr_pages(folio);
> +				if (i == 0) {
> +					pgoff = offset_in_folio(folio, start);
> +					pgoff >>= PAGE_SHIFT;
> +				}
> +
> +				do {
> +					folios[i] = folio;
> +					offsets[i] = pgoff << PAGE_SHIFT;
> +					pages[i] = folio_page(folio, 0);
> +					folio_add_pin(folio);
> +
> +					pgoff++;
> +					i++;
> +				} while (pgoff < max_pgs && i < nr_pages);
> +
> +				pgoff = 0;
> +				next_idx = folio_next_index(folio);
> +				gup_put_folio(folio, 1, FOLL_PIN);
> +			}
> +
> +			folio = NULL;
> +			folio_batch_release(&fbatch);
> +			if (!nr_folios) {
> +				folio = memfd_alloc_folio(memfd, start_idx);
> +				if (IS_ERR(folio)) {
> +					ret = PTR_ERR(folio);
> +					if (ret != -EEXIST) {
> +						kfree(pages);
> +						goto err;
> +					}
> +				}
> +			}
> +		}
> +
> +		ret = check_and_migrate_movable_pages(nr_pages, pages);

Having a folio variant would avoid having to mess with pages here at all.
Further, we're now returning+pinning the same folio multiple times, instead of
just once like the folio batching variant would.

I'm wondering if the following wouldn't make more sense, assuming we add
check_and_migrate_movable_folios(), which should be pretty easy to add.

Obviously untested, just to express what I have in mind:

/**
  * memfd_pin_folios() - pin folios associated with a memfd
  * @memfd:      the memfd whose folios are to be pinned
  * @start:      the starting memfd offset
  * @end:        the final memfd offset (inclusive)
  * @folios:     array that receives pointers to the folios pinned
  * @max_folios: the number of entries in the array for folios
  * @offsets:    the offset into the first folio
  *
  * Attempt to pin folios associated with a memfd; given that a memfd is
  * either backed by shmem or hugetlb, the folios can either be found in
  * the page cache or need to be allocated if necessary. Once the folios
  * are located, they are all pinned via FOLL_PIN and @offset is populated
  * with the offset into the first folio.
  *
  * Pinned folios must be released using unpin_folio() or unpin_folios().
  *
  * It must be noted that the folios may be pinned for an indefinite amount
  * of time. And, in most cases, the duration of time they may stay pinned
  * would be controlled by the userspace. This behavior is effectively the
  * same as using FOLL_LONGTERM with other GUP APIs.
  *
  * Returns number of folios pinned, which might be less than @max_folios
  * only if the whole range was pinned. If no folios were pinned, it returns
  * -errno.
  */
long memfd_pin_folios(struct file *memfd, unsigned long start,
		      unsigned long end, struct folio **folios,
		      unsigned int max_folios, unsigned long *offset)
{
	unsigned int pgshift = PAGE_SHIFT;
	unsigned int flags, nr_folios, cur_folios, i;
	pgoff_t start_idx, end_idx;
	struct folio_batch fbatch;
	struct folio *folio;
	struct hstate *h;
	long ret;

	if (start > end || !max_folios)
		return -EINVAL;

	if (!memfd)
		return -EINVAL;

	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
		return -EINVAL;

	if (is_file_hugepages(memfd)) {
		h = hstate_file(memfd);
		pgshift = huge_page_shift(h);
	}

	flags = memalloc_pin_save();
	folio_batch_init(&fbatch);
	do {
		nr_folios = 0;
		start_idx = start >> pgshift;
		end_idx = end >> pgshift;
		if (is_file_hugepages(memfd)) {
			start_idx <<= huge_page_order(h);
			end_idx <<= huge_page_order(h);
		}

		while (start_idx <= end_idx) {
			/*
			 * In most cases, we should be able to find the folios
			 * in the page cache. If we cannot find them for some
			 * reason, we try to allocate them and add them to the
			 * page cache.
			 */
			folio_batch_release(&fbatch);
			cur_folios = filemap_get_folios_contig(memfd->f_mapping,
							       &start_idx,
							       end_idx,
							       &fbatch);
			if (!cur_folios) {
				folio = memfd_alloc_folio(memfd, start_idx);
				if (IS_ERR(folio)) {
					ret = PTR_ERR(folio);
					if (ret != -EEXIST)
						goto err;
				}
				folio_put(folio);
				continue;
			}

			/* Let's pin each folio, which shouldn't really fail. */
			for (i = 0; i < cur_folios; i++) {
				folio = try_grab_folio(&fbatch.folios[i]->page,
						       1, FOLL_PIN);
				if (!folio)
					goto err;

				if (!nr_folios)
					*offset = offset_in_folio(folio, start);
				folios[nr_folios++] = folio;

				if (max_folios == nr_folios)
					break;
			}
			if (max_folios == nr_folios)
				break;
		}
		folio_batch_release(&fbatch);

		ret = check_and_migrate_movable_folios(nr_folios, folios);
	} while (ret == -EAGAIN);

	memalloc_pin_restore(flags);
	return ret ? ret : nr_folios;
err:
	folio_batch_release(&fbatch);
	memalloc_pin_restore(flags);
	while (i-- > 0)
		if (folios[i])
			gup_put_folio(folios[i], 1, FOLL_PIN);

	return ret;
}
EXPORT_SYMBOL_GPL(memfd_pin_folios);

I'm still wondering about the  offset handling, though. Could it happen that why we are
repeatedly calling filemap_get_folios_contig(), that we would need offset!=0 on any of
the other folios besides the first one? My current understanding (and looking at
filemap_get_folios_contig()) is: no.

I'm primarily concerned about concurrent fallocate(PUNCH_HOLE) and THP collapse/splitting.

-- 
Cheers,

David / dhildenb