[PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)

Wed Dec 13 15:15:33 UTC 2023

Hi,

>> Sorry, I'm still not happy about the current state, because (1) the
>> folio vs. pages handling is still mixed (2) we're returning+pinning a
>> large folio multiple times.
> I can address (1) in a follow-up series and as far as (2) is concerned, my
> understanding is that we need to increase the folio's refcount as and
> when the folio's tail pages are used. Is this not the case? It appears
> this is what unpin_user_pages() expects as well. Do you see any
> concern with this?

If you'd just pin the folio once, you'd also only have to unpin it once.

Bu that raises a question: Is it a requirement for the user of this 
interface, being able to unmap+unpin each individual page?

If you really want to handle each subpage possibly individually, then 
subpage or folio+offset makes more sense, agreed.

> 
>>
>> See below if there is an easy way to clean this up.
>>
>>>> @@ -5,6 +5,7 @@
>>>    #include <linux/spinlock.h>
>>>
>>>    #include <linux/mm.h>
>>> +#include <linux/memfd.h>
>>>    #include <linux/memremap.h>
>>>    #include <linux/pagemap.h>
>>>    #include <linux/rmap.h>
>>> @@ -17,6 +18,7 @@
>>>    #include <linux/hugetlb.h>
>>>    #include <linux/migrate.h>
>>>    #include <linux/mm_inline.h>
>>> +#include <linux/pagevec.h>
>>>    #include <linux/sched/mm.h>
>>>    #include <linux/shmem_fs.h>
>>>
>>> @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long
>> start, unsigned long nr_pages,
>>>    				     &locked, gup_flags);
>>>    }
>>>    EXPORT_SYMBOL(pin_user_pages_unlocked);
>>> +
>>> +/**
>>> + * memfd_pin_folios() - pin folios associated with a memfd
>>> + * @memfd:      the memfd whose folios are to be pinned
>>> + * @start:      starting memfd offset
>>> + * @nr_pages:   number of pages from start to pin
>>
>> We're not pinning pages. An inclusive range [start, end] would be clearer.
> Ok, I'll make this change in the next version.
> 
>>
>>> + * @folios:     array that receives pointers to the folios pinned.
>>> + *              Should be at-least nr_pages long.
>>> + * @offsets:    array that receives offsets of pages in their folios.
>>> + *              Should be at-least nr_pages long.
>>
>> See below, I'm wondering if this is really required once we return each folio
>> only once.
> The offsets can be calculated by the caller (udmabuf) as well but doing so
> in this interface would prevent special handling in the caller for the hugetlb
> case. Please look at patch 5 in this series (udmabuf: Pin the pages using
> memfd_pin_folios() API (v5)) for more details as to what I mean.
> 

I'll have a look later to be reminded about the target use case :)

>>
>>> + *
>>> + * It must be noted that the folios may be pinned for an indefinite amount
>>> + * of time. And, in most cases, the duration of time they may stay pinned
>>> + * would be controlled by the userspace. This behavior is effectively the
>>> + * same as using FOLL_LONGTERM with other GUP APIs.
>>> + *
>>> + * Returns number of folios pinned. This would be equal to the number of
>>> + * pages requested. If no folios were pinned, it returns -errno.
>>> + */
>>> +long memfd_pin_folios(struct file *memfd, unsigned long start,
>>> +		      unsigned long nr_pages, struct folio **folios,
>>> +		      pgoff_t *offsets)
>>> +{
>>> +	unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1;
>>> +	unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT;
>>> +	pgoff_t start_idx, end_idx, next_idx;
>>> +	unsigned int flags, nr_folios, i, j;
>>> +	struct folio *folio = NULL;
>>> +	struct folio_batch fbatch;
>>> +	struct page **pages;
>>> +	struct hstate *h;
>>> +	long ret;
>>> +
>>> +	if (!nr_pages)
>>> +		return -EINVAL;
>>> +
>>> +	if (!memfd)
>>> +		return -EINVAL;
>>> +
>>> +	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
>>> +		return -EINVAL;
>>> +
>>> +	pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
>>> +	if (!pages)
>>> +		return -ENOMEM;
>>> +
>>> +	if (is_file_hugepages(memfd)) {
>>> +		h = hstate_file(memfd);
>>> +		pgshift = huge_page_shift(h);
>>> +	}
>>> +
>>> +	flags = memalloc_pin_save();
>>> +	do {
>>> +		i = 0;
>>> +		start_idx = start >> pgshift;
>>> +		end_idx = end >> pgshift;
>>> +		if (is_file_hugepages(memfd)) {
>>> +			start_idx <<= huge_page_order(h);
>>> +			end_idx <<= huge_page_order(h);
>>> +		}
>>> +
>>> +		folio_batch_init(&fbatch);
>>> +		while (start_idx <= end_idx) {
>>> +			/*
>>> +			 * In most cases, we should be able to find the folios
>>> +			 * in the page cache. If we cannot find them for some
>>> +			 * reason, we try to allocate them and add them to
>> the
>>> +			 * page cache.
>>> +			 */
>>> +			nr_folios = filemap_get_folios_contig(memfd-
>>> f_mapping,
>>> +							      &start_idx,
>>> +							      end_idx,
>>> +							      &fbatch);
>>> +			if (folio) {
>>> +				folio_put(folio);
>>> +				folio = NULL;
>>> +			}
>>> +
>>> +			next_idx = 0;
>>> +			for (j = 0; j < nr_folios; j++) {
>>> +				if (next_idx &&
>>> +				    next_idx != folio_index(fbatch.folios[j]))
>>> +					continue;
>>> +
>>> +				folio = try_grab_folio(&fbatch.folios[j]->page,
>>> +						       1, FOLL_PIN);
>>> +				if (!folio) {
>>> +					folio_batch_release(&fbatch);
>>> +					kfree(pages);
>>> +					goto err;
>>> +				}
>>> +
>>> +				max_pgs = folio_nr_pages(folio);
>>> +				if (i == 0) {
>>> +					pgoff = offset_in_folio(folio, start);
>>> +					pgoff >>= PAGE_SHIFT;
>>> +				}
>>> +
>>> +				do {
>>> +					folios[i] = folio;
>>> +					offsets[i] = pgoff << PAGE_SHIFT;
>>> +					pages[i] = folio_page(folio, 0);
>>> +					folio_add_pin(folio);
>>> +
>>> +					pgoff++;
>>> +					i++;
>>> +				} while (pgoff < max_pgs && i < nr_pages);
>>> +
>>> +				pgoff = 0;
>>> +				next_idx = folio_next_index(folio);
>>> +				gup_put_folio(folio, 1, FOLL_PIN);
>>> +			}
>>> +
>>> +			folio = NULL;
>>> +			folio_batch_release(&fbatch);
>>> +			if (!nr_folios) {
>>> +				folio = memfd_alloc_folio(memfd, start_idx);
>>> +				if (IS_ERR(folio)) {
>>> +					ret = PTR_ERR(folio);
>>> +					if (ret != -EEXIST) {
>>> +						kfree(pages);
>>> +						goto err;
>>> +					}
>>> +				}
>>> +			}
>>> +		}
>>> +
>>> +		ret = check_and_migrate_movable_pages(nr_pages, pages);
>>
>> Having a folio variant would avoid having to mess with pages here at all.
>> Further, we're now returning+pinning the same folio multiple times, instead
>> of
>> just once like the folio batching variant would.
> It should be possible to pin the folio only once but I don't see any problem with
> pinning it multiple times -- once per each subpage used -- as long as it is unpinned
> correctly the same number of times. Is this not ok?

You can, but that partially avoids the benefit of using folios?

Instead of "large folio + offset" you have "folio+offset1, folio+offset2 
..." essentially for each subpage. But again, maybe that really is 
required for the target use case.

It's not necessarily wrong to do that, but staring just at the interface 
it's the opposite of what other folio-handling functions like batching do.

> 
>>
>> I'm wondering if the following wouldn't make more sense, assuming we add
>> check_and_migrate_movable_folios(), which should be pretty easy to add.
>>
>> Obviously untested, just to express what I have in mind:
> Thank you for taking the time to do this!
> 
>>
>>
>>
>> /**
>>    * memfd_pin_folios() - pin folios associated with a memfd
>>    * @memfd:      the memfd whose folios are to be pinned
>>    * @start:      the starting memfd offset
>>    * @end:        the final memfd offset (inclusive)
>>    * @folios:     array that receives pointers to the folios pinned
>>    * @max_folios: the number of entries in the array for folios
>>    * @offsets:    the offset into the first folio
> Given that my goal is to do the following in udmabuf driver:
>          ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL);
>          for_each_sg(sg->sgl, sgl, ubuf->pagecount, i)
>                  sg_set_folio(sgl, ubuf->folios[i], PAGE_SIZE, ubuf->offsets[i]);
> 
>          ret = dma_map_sgtable(dev, sg, direction, 0);
> 
> That is, populate a scatterlist with ubuf->pagecount number of entries,
> where each segment if of size PAGE_SIZE, in order to be consistent and
> support a wide variety of DMA importers that may not probably handle
> segments that are larger than PAGE_SIZE.
> 
> Therefore, in the hugetlb case, there would be multiple entries pointing to
> the same folio with different offsets. The question really is whether these
> entries associated with @folios and @offsets would need to be populated
> by the caller (udmabuf) or the API (memfd_pin_folios). I have tried both of
> these approaches in the earlier versions and they all work fine but I think
> populating the entries in memfd_pin_folios() seems to be cleaner as the
> caller does not need to do any special handling (hugetlb vs shmem).
> 
>>    *
>>    * Attempt to pin folios associated with a memfd; given that a memfd is
>>    * either backed by shmem or hugetlb, the folios can either be found in
>>    * the page cache or need to be allocated if necessary. Once the folios
>>    * are located, they are all pinned via FOLL_PIN and @offset is populated
>>    * with the offset into the first folio.
>>    *
>>    * Pinned folios must be released using unpin_folio() or unpin_folios().
>>    *
>>    * It must be noted that the folios may be pinned for an indefinite amount
>>    * of time. And, in most cases, the duration of time they may stay pinned
>>    * would be controlled by the userspace. This behavior is effectively the
>>    * same as using FOLL_LONGTERM with other GUP APIs.
>>    *
>>    * Returns number of folios pinned, which might be less than @max_folios
>>    * only if the whole range was pinned. If no folios were pinned, it returns
>>    * -errno.
>>    */
>> long memfd_pin_folios(struct file *memfd, unsigned long start,
>> 		      unsigned long end, struct folio **folios,
>> 		      unsigned int max_folios, unsigned long *offset)
>> {
>> 	unsigned int pgshift = PAGE_SHIFT;
>> 	unsigned int flags, nr_folios, cur_folios, i;
>> 	pgoff_t start_idx, end_idx;
>> 	struct folio_batch fbatch;
>> 	struct folio *folio;
>> 	struct hstate *h;
>> 	long ret;
>>
>> 	if (start > end || !max_folios)
>> 		return -EINVAL;
>>
>> 	if (!memfd)
>> 		return -EINVAL;
>>
>> 	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
>> 		return -EINVAL;
>>
>> 	if (is_file_hugepages(memfd)) {
>> 		h = hstate_file(memfd);
>> 		pgshift = huge_page_shift(h);
>> 	}
>>
>> 	flags = memalloc_pin_save();
>> 	folio_batch_init(&fbatch);
>> 	do {
>> 		nr_folios = 0;
>> 		start_idx = start >> pgshift;
>> 		end_idx = end >> pgshift;
>> 		if (is_file_hugepages(memfd)) {
>> 			start_idx <<= huge_page_order(h);
>> 			end_idx <<= huge_page_order(h);
>> 		}
>>
>> 		while (start_idx <= end_idx) {
>> 			/*
>> 			 * In most cases, we should be able to find the folios
>> 			 * in the page cache. If we cannot find them for some
>> 			 * reason, we try to allocate them and add them to
>> the
>> 			 * page cache.
>> 			 */
>> 			folio_batch_release(&fbatch);
>> 			cur_folios = filemap_get_folios_contig(memfd-
>>> f_mapping,
>> 							       &start_idx,
>> 							       end_idx,
>> 							       &fbatch);
>> 			if (!cur_folios) {
>> 				folio = memfd_alloc_folio(memfd, start_idx);
>> 				if (IS_ERR(folio)) {
>> 					ret = PTR_ERR(folio);
>> 					if (ret != -EEXIST)
>> 						goto err;
>> 				}
>> 				folio_put(folio);
>> 				continue;
>> 			}
>>
>> 			/* Let's pin each folio, which shouldn't really fail. */
>> 			for (i = 0; i < cur_folios; i++) {
>> 				folio = try_grab_folio(&fbatch.folios[i]->page,
>> 						       1, FOLL_PIN);
>> 				if (!folio)
>> 					goto err;
>>
>> 				if (!nr_folios)
>> 					*offset = offset_in_folio(folio, start);
>> 				folios[nr_folios++] = folio;
>>
>> 				if (max_folios == nr_folios)
>> 					break;
>> 			}
>> 			if (max_folios == nr_folios)
>> 				break;
>> 		}
>> 		folio_batch_release(&fbatch);
>>
>> 		ret = check_and_migrate_movable_folios(nr_folios, folios);
>> 	} while (ret == -EAGAIN);
>>
>> 	memalloc_pin_restore(flags);
>> 	return ret ? ret : nr_folios;
>> err:
>> 	folio_batch_release(&fbatch);
>> 	memalloc_pin_restore(flags);
>> 	while (i-- > 0)
>> 		if (folios[i])
>> 			gup_put_folio(folios[i], 1, FOLL_PIN);
>>
>> 	return ret;
>> }
>> EXPORT_SYMBOL_GPL(memfd_pin_folios);
>>
>>
>>
>> I'm still wondering about the  offset handling, though. Could it happen that
>> why we are
>> repeatedly calling filemap_get_folios_contig(), that we would need offset!=0
>> on any of
>> the other folios besides the first one? My current understanding (and looking
>> at
>> filemap_get_folios_contig()) is: no.

> I am not entirely sure but while testing this series with Qemu master + kernel
> snapshot of drm-tip which is 6.7 RC1, I noticed strange behavior of
> filemap_get_folios_contig() and the batches it returns particularly for the
> hugetlb folios. Assuming we have order-9 folios in the memfd (my test-case),
> and if the range [start, end] cuts across more than one folio: lets say start is
> at subpage 490 (in folio-0) and end is at subpage 520 (in folio-1), then start_idx
> would be 0 and end_idx would be 512. In this case, I would have expected

That is weird. Shouldn't you get start_idx = 0 and end_idx = 1 with 
hugetlb, where the idx differs ? Maybe that's the problem.

> filemap_get_folios_contig() to return two entries in the batch that included
> folio-0 and folio-1. However, it returned a batch with 15 entries (max batch size)
> with all the entries pointing to folio-0. This is why I added the check: > 	if (next_idx &&
>                     next_idx != folio_index(fbatch.folios[j]))
>                     	continue;
> 
> Anyway, based on the code you wrote, I have realized that we both have a
> different view on how many entries need to be there in the @folios array
> for a given range [start, end] in the hugetlb case.

Oh, yes, ideally the interface should behave the same for hugetlb and shmem.

> 
> I have assumed that it is highly desirable to have a segment length of
> PAGE_SIZE for consistency and interoperability reasons but I guess it might
> be ok to do:
> sg_set_folio(sgl, ubuf->folios[i], nr_tails * PAGE_SIZE, ubuf->offsets[i]);
> 
> I'll run some experiments to see if this would work in most cases or not.
> 
>>
>> I'm primarily concerned about concurrent fallocate(PUNCH_HOLE) and THP
>> collapse/splitting.
> Could you please elaborate on what the issue would be in this case?

I'm not sure if this can happen, but assume the following (shouldn't 
happen as long as shmem does not support 1m folios):

Assume the file looks like this:

[    1m    ][ 512k ]
^0          ^256    ^384

Assume we call filemap_get_folios_contig() and get back the first folio 
and get start_idx=256

Then, someone fallocate(PUNCH_HOLE) the whole range and re-populates the 
whole range with a 2m folio.

[          2m          ]
^0          ^256    ^384

if we call filemap_get_folios_contig() with 256, we get another "large 
folio with offset".

Of course, we can detect that, and simply fail/retry. Just wondering if
that could happen.

-- 
Cheers,

David / dhildenb