[PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4)

Tue Nov 21 06:54:50 UTC 2023

Hi David,

> 
> On 18.11.23 07:32, Vivek Kasireddy wrote:
> > For drivers that would like to longterm-pin the pages associated
> > with a file, the pin_user_pages_fd() API provides an option to
> > not only pin the pages via FOLL_PIN but also to check and migrate
> > them if they reside in movable zone or CMA block. This API
> > currently works with files that belong to either shmem or hugetlbfs.
> > Files belonging to other filesystems are rejected for now.
> >
> > The pages need to be located first before pinning them via FOLL_PIN.
> > If they are found in the page cache, they can be immediately pinned.
> > Otherwise, they need to be allocated using the filesystem specific
> > APIs and then pinned.
> >
> > v2:
> > - Drop gup_flags and improve comments and commit message (David)
> > - Allocate a page if we cannot find in page cache for the hugetlbfs
> >    case as well (David)
> > - Don't unpin pages if there is a migration related failure (David)
> > - Drop the unnecessary nr_pages <= 0 check (Jason)
> > - Have the caller of the API pass in file * instead of fd (Jason)
> >
> > v3: (David)
> > - Enclose the huge page allocation code with #ifdef
> CONFIG_HUGETLB_PAGE
> >    (Build error reported by kernel test robot <lkp at intel.com>)
> > - Don't forget memalloc_pin_restore() on non-migration related errors
> > - Improve the readability of the cleanup code associated with
> >    non-migration related errors
> > - Augment the comments by describing FOLL_LONGTERM like behavior
> > - Include the R-b tag from Jason
> >
> > v4:
> > - Remove the local variable "page" and instead use 3 return statements
> >    in alloc_file_page() (David)
> > - Add the R-b tag from David
> >
> > Cc: David Hildenbrand <david at redhat.com>
> > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> > Cc: Mike Kravetz <mike.kravetz at oracle.com>
> > Cc: Hugh Dickins <hughd at google.com>
> > Cc: Peter Xu <peterx at redhat.com>
> > Cc: Gerd Hoffmann <kraxel at redhat.com>
> > Cc: Dongwon Kim <dongwon.kim at intel.com>
> > Cc: Junxiao Chang <junxiao.chang at intel.com>
> > Suggested-by: Jason Gunthorpe <jgg at nvidia.com>
> > Reviewed-by: Jason Gunthorpe <jgg at nvidia.com> (v2)
> > Reviewed-by: David Hildenbrand <david at redhat.com> (v3)
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy at intel.com>
> > ---
> 
> 
> [...]
> 
> 
> > +static struct page *alloc_file_page(struct file *file, pgoff_t idx)
> > +{
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +	struct folio *folio;
> > +	int err;
> > +
> > +	if (is_file_hugepages(file)) {
> > +		folio = alloc_hugetlb_folio_nodemask(hstate_file(file),
> > +						     NUMA_NO_NODE,
> > +						     NULL,
> > +						     GFP_USER);
> > +		if (folio && folio_try_get(folio)) {
> > +			err = hugetlb_add_to_page_cache(folio,
> > +							file->f_mapping,
> > +							idx);
> > +			if (err) {
> > +				folio_put(folio);
> > +				free_huge_folio(folio);
> > +				return ERR_PTR(err);
> > +			}
> > +			return &folio->page;
> 
> While looking at the user of pin_user_pages_fd(), I realized something:
> 
> Assume idx is not aligned to the hugetlb page size.
> find_get_page_flags() would always return a tail page in that case, but
> you'd be returning the head page here.
> 
> See pagecache_get_page()->folio_file_page(folio, index);
Thank you for catching this. Looking at how udambuf uses this API for hugetlb case:
hpstate = hstate_file(memfd);
mapidx = list[i].offset >> huge_page_shift(hpstate);
do {
	nr_pages = shmem_file(memfd) ? pgcnt : 1;
               ret = pin_user_pages_fd(memfd, mapidx, nr_pages,
                                                            ubuf->pages + pgbuf);
As the raw file offset is translated into huge page size units, represented by
mapidx, I was expecting find_get_page_flags() to return a head page but I
did not realize that find_get_page_flags() now returns tail pages given that
it had returned head pages in the previous kernel versions I had tested IIRC.
As my goal is to only grab the head pages, __filemap_get_folio() seems like
the right API to use instead of find_get_page_flags(). With this change, the
hugetlb subtest (that I have not tested with kernels >= 6.7) that fails with
kernel 6.7 RC1 now seems to work as expected.

> 
> > +		}
> > +		return ERR_PTR(-ENOMEM);
> > +	}
> > +#endif
> > +	return shmem_read_mapping_page(file->f_mapping, idx);
> > +}
> > +
> > +/**
> > + * pin_user_pages_fd() - pin user pages associated with a file
> > + * @file:       the file whose pages are to be pinned
> > + * @start:      starting file offset
> > + * @nr_pages:   number of pages from start to pin
> > + * @pages:      array that receives pointers to the pages pinned.
> > + *              Should be at-least nr_pages long.
> > + *
> > + * Attempt to pin pages associated with a file that belongs to either
> shmem
> > + * or hugetlb. The pages are either found in the page cache or allocated if
> > + * necessary. Once the pages are located, they are all pinned via FOLL_PIN.
> > + * And, these pinned pages need to be released either using
> unpin_user_pages()
> > + * or unpin_user_page().
> > + *
> > + * It must be noted that the pages may be pinned for an indefinite amount
> > + * of time. And, in most cases, the duration of time they may stay pinned
> > + * would be controlled by the userspace. This behavior is effectively the
> > + * same as using FOLL_LONGTERM with other GUP APIs.
> > + *
> > + * Returns number of pages pinned. This would be equal to the number of
> > + * pages requested. If no pages were pinned, it returns -errno.
> > + */
> > +long pin_user_pages_fd(struct file *file, pgoff_t start,
> > +		       unsigned long nr_pages, struct page **pages)
> > +{
> > +	struct page *page;
> > +	unsigned int flags, i;
> > +	long ret;
> > +
> > +	if (start < 0)
> > +		return -EINVAL;
> > +
> > +	if (!file)
> > +	    return -EINVAL;
> > +
> > +	if (!shmem_file(file) && !is_file_hugepages(file))
> > +	    return -EINVAL;
> > +
> > +	flags = memalloc_pin_save();
> > +	do {
> > +		for (i = 0; i < nr_pages; i++) {
> > +			/*
> > + 			 * In most cases, we should be able to find the page
> > +			 * in the page cache. If we cannot find it, we try to
> > +			 * allocate one and add it to the page cache.
> > +			 */
> > +			page = find_get_page_flags(file->f_mapping,
> > +						   start + i,
> > +						   FGP_ACCESSED);
> > +			if (!page) {
> > +				page = alloc_file_page(file, start + i);
> > +				if (IS_ERR(page)) {
> > +					ret = PTR_ERR(page);
> > +					goto err;
> 
> While looking at above, I do wonder: what if two parties tried to alloc
> the page at the same time? I suspect we'd want to handle -EEXIST a bit
> nicer here, right?
At-least with the udmabuf use-case, there cannot be multiple entities allocating
and adding a page at a given offset in the memfd at the same time. However, I
can make the following changes to protect against this potential failure mode:
        do {
                for (i = 0; i < nr_pages; i++) {
+retry:
+                       page = NULL;
                        /*
                         * In most cases, we should be able to find the page
                         * in the page cache. If we cannot find it, we try to
                         * allocate one and add it to the page cache.
                         */
-                       page = find_get_page_flags(file->f_mapping,
-                                                  start + i,
-                                                  FGP_ACCESSED);
+                       folio = __filemap_get_folio(file->f_mapping,
+                                                   start + i,
+                                                   FGP_ACCESSED, 0);
+                       if (!IS_ERR(folio))
+                               page = &folio->page;
+
                        if (!page) {
                                page = alloc_file_page(file, start + i);
                                if (IS_ERR(page)) {
                                        ret = PTR_ERR(page);
+                                       if (ret == -EEXIST)
+                                               goto retry;
                                        goto err;
                                }

Thanks,
Vivek

> 
> 
> --
> Cheers,
> 
> David / dhildenb