[PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4)
Kasireddy, Vivek
vivek.kasireddy at intel.com
Tue Nov 21 06:54:50 UTC 2023
Hi David,
>
> On 18.11.23 07:32, Vivek Kasireddy wrote:
> > For drivers that would like to longterm-pin the pages associated
> > with a file, the pin_user_pages_fd() API provides an option to
> > not only pin the pages via FOLL_PIN but also to check and migrate
> > them if they reside in movable zone or CMA block. This API
> > currently works with files that belong to either shmem or hugetlbfs.
> > Files belonging to other filesystems are rejected for now.
> >
> > The pages need to be located first before pinning them via FOLL_PIN.
> > If they are found in the page cache, they can be immediately pinned.
> > Otherwise, they need to be allocated using the filesystem specific
> > APIs and then pinned.
> >
> > v2:
> > - Drop gup_flags and improve comments and commit message (David)
> > - Allocate a page if we cannot find in page cache for the hugetlbfs
> > case as well (David)
> > - Don't unpin pages if there is a migration related failure (David)
> > - Drop the unnecessary nr_pages <= 0 check (Jason)
> > - Have the caller of the API pass in file * instead of fd (Jason)
> >
> > v3: (David)
> > - Enclose the huge page allocation code with #ifdef
> CONFIG_HUGETLB_PAGE
> > (Build error reported by kernel test robot <lkp at intel.com>)
> > - Don't forget memalloc_pin_restore() on non-migration related errors
> > - Improve the readability of the cleanup code associated with
> > non-migration related errors
> > - Augment the comments by describing FOLL_LONGTERM like behavior
> > - Include the R-b tag from Jason
> >
> > v4:
> > - Remove the local variable "page" and instead use 3 return statements
> > in alloc_file_page() (David)
> > - Add the R-b tag from David
> >
> > Cc: David Hildenbrand <david at redhat.com>
> > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> > Cc: Mike Kravetz <mike.kravetz at oracle.com>
> > Cc: Hugh Dickins <hughd at google.com>
> > Cc: Peter Xu <peterx at redhat.com>
> > Cc: Gerd Hoffmann <kraxel at redhat.com>
> > Cc: Dongwon Kim <dongwon.kim at intel.com>
> > Cc: Junxiao Chang <junxiao.chang at intel.com>
> > Suggested-by: Jason Gunthorpe <jgg at nvidia.com>
> > Reviewed-by: Jason Gunthorpe <jgg at nvidia.com> (v2)
> > Reviewed-by: David Hildenbrand <david at redhat.com> (v3)
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy at intel.com>
> > ---
>
>
> [...]
>
>
> > +static struct page *alloc_file_page(struct file *file, pgoff_t idx)
> > +{
> > +#ifdef CONFIG_HUGETLB_PAGE
> > + struct folio *folio;
> > + int err;
> > +
> > + if (is_file_hugepages(file)) {
> > + folio = alloc_hugetlb_folio_nodemask(hstate_file(file),
> > + NUMA_NO_NODE,
> > + NULL,
> > + GFP_USER);
> > + if (folio && folio_try_get(folio)) {
> > + err = hugetlb_add_to_page_cache(folio,
> > + file->f_mapping,
> > + idx);
> > + if (err) {
> > + folio_put(folio);
> > + free_huge_folio(folio);
> > + return ERR_PTR(err);
> > + }
> > + return &folio->page;
>
> While looking at the user of pin_user_pages_fd(), I realized something:
>
> Assume idx is not aligned to the hugetlb page size.
> find_get_page_flags() would always return a tail page in that case, but
> you'd be returning the head page here.
>
> See pagecache_get_page()->folio_file_page(folio, index);
Thank you for catching this. Looking at how udambuf uses this API for hugetlb case:
hpstate = hstate_file(memfd);
mapidx = list[i].offset >> huge_page_shift(hpstate);
do {
nr_pages = shmem_file(memfd) ? pgcnt : 1;
ret = pin_user_pages_fd(memfd, mapidx, nr_pages,
ubuf->pages + pgbuf);
As the raw file offset is translated into huge page size units, represented by
mapidx, I was expecting find_get_page_flags() to return a head page but I
did not realize that find_get_page_flags() now returns tail pages given that
it had returned head pages in the previous kernel versions I had tested IIRC.
As my goal is to only grab the head pages, __filemap_get_folio() seems like
the right API to use instead of find_get_page_flags(). With this change, the
hugetlb subtest (that I have not tested with kernels >= 6.7) that fails with
kernel 6.7 RC1 now seems to work as expected.
>
> > + }
> > + return ERR_PTR(-ENOMEM);
> > + }
> > +#endif
> > + return shmem_read_mapping_page(file->f_mapping, idx);
> > +}
> > +
> > +/**
> > + * pin_user_pages_fd() - pin user pages associated with a file
> > + * @file: the file whose pages are to be pinned
> > + * @start: starting file offset
> > + * @nr_pages: number of pages from start to pin
> > + * @pages: array that receives pointers to the pages pinned.
> > + * Should be at-least nr_pages long.
> > + *
> > + * Attempt to pin pages associated with a file that belongs to either
> shmem
> > + * or hugetlb. The pages are either found in the page cache or allocated if
> > + * necessary. Once the pages are located, they are all pinned via FOLL_PIN.
> > + * And, these pinned pages need to be released either using
> unpin_user_pages()
> > + * or unpin_user_page().
> > + *
> > + * It must be noted that the pages may be pinned for an indefinite amount
> > + * of time. And, in most cases, the duration of time they may stay pinned
> > + * would be controlled by the userspace. This behavior is effectively the
> > + * same as using FOLL_LONGTERM with other GUP APIs.
> > + *
> > + * Returns number of pages pinned. This would be equal to the number of
> > + * pages requested. If no pages were pinned, it returns -errno.
> > + */
> > +long pin_user_pages_fd(struct file *file, pgoff_t start,
> > + unsigned long nr_pages, struct page **pages)
> > +{
> > + struct page *page;
> > + unsigned int flags, i;
> > + long ret;
> > +
> > + if (start < 0)
> > + return -EINVAL;
> > +
> > + if (!file)
> > + return -EINVAL;
> > +
> > + if (!shmem_file(file) && !is_file_hugepages(file))
> > + return -EINVAL;
> > +
> > + flags = memalloc_pin_save();
> > + do {
> > + for (i = 0; i < nr_pages; i++) {
> > + /*
> > + * In most cases, we should be able to find the page
> > + * in the page cache. If we cannot find it, we try to
> > + * allocate one and add it to the page cache.
> > + */
> > + page = find_get_page_flags(file->f_mapping,
> > + start + i,
> > + FGP_ACCESSED);
> > + if (!page) {
> > + page = alloc_file_page(file, start + i);
> > + if (IS_ERR(page)) {
> > + ret = PTR_ERR(page);
> > + goto err;
>
> While looking at above, I do wonder: what if two parties tried to alloc
> the page at the same time? I suspect we'd want to handle -EEXIST a bit
> nicer here, right?
At-least with the udmabuf use-case, there cannot be multiple entities allocating
and adding a page at a given offset in the memfd at the same time. However, I
can make the following changes to protect against this potential failure mode:
do {
for (i = 0; i < nr_pages; i++) {
+retry:
+ page = NULL;
/*
* In most cases, we should be able to find the page
* in the page cache. If we cannot find it, we try to
* allocate one and add it to the page cache.
*/
- page = find_get_page_flags(file->f_mapping,
- start + i,
- FGP_ACCESSED);
+ folio = __filemap_get_folio(file->f_mapping,
+ start + i,
+ FGP_ACCESSED, 0);
+ if (!IS_ERR(folio))
+ page = &folio->page;
+
if (!page) {
page = alloc_file_page(file, start + i);
if (IS_ERR(page)) {
ret = PTR_ERR(page);
+ if (ret == -EEXIST)
+ goto retry;
goto err;
}
Thanks,
Vivek
>
>
> --
> Cheers,
>
> David / dhildenb
More information about the dri-devel
mailing list