[PATCH] mm/memremap: Introduce pgmap_request_folio() using pgmap offsets

Thu Dec 1 00:22:50 UTC 2022

Jason Gunthorpe wrote:
> On Thu, Oct 20, 2022 at 02:56:39PM -0700, Dan Williams wrote:
> > A 'struct dev_pagemap' (pgmap) represents a collection of ZONE_DEVICE
> > pages. The pgmap is a reference counted object that serves a similar
> > role as a 'struct request_queue'. Live references are obtained for each
> > in flight request / page, and once a page's reference count drops to
> > zero the associated pin of the pgmap is dropped as well. While a page is
> > idle nothing should be accessing it because that is effectively a
> > use-after-free situation. Unfortunately, all current ZONE_DEVICE
> > implementations deploy a layering violation to manage requests to
> > activate pages owned by a pgmap. Specifically, they take steps like walk
> > the pfns that were previously assigned at memremap_pages() time and use
> > pfn_to_page() to recall metadata like page->pgmap, or make use of other
> > data like page->zone_device_data.
> > 
> > The first step towards correcting that situation is to provide a
> > API to get access to a pgmap page that does not require the caller to
> > know the pfn, nor access any fields of an idle page. Ideally this API
> > would be able to support dynamic page creation instead of the current
> > status quo of pre-allocating and initializing pages.
> > 
> > On a prompt from Jason, introduce pgmap_request_folio() that operates on
> > an offset into a pgmap. It replaces the shortlived
> > pgmap_request_folios() that was continuing the layering violation of
> > assuming pages are available to be consulted before asking the pgmap to
> > make them available.
> > 
> > For now this only converts the callers to lookup the pgmap and generate
> > the pgmap offset, but it does not do the deeper cleanup of teaching
> > those call sites to generate those arguments without walking the page
> > metadata. For next steps it appears the DEVICE_PRIVATE implementations
> > could plumb the pgmap into the necessary callsites and switch to using
> > gen_pool_alloc() to track which offsets of a pgmap are allocated. For
> > DAX, dax_direct_access() could switch from returning pfns to returning
> > the associated @pgmap and @pgmap_offset. Those changes are saved for
> > follow-on work.
> 
> I like it, though it would be nice to see drivers converted away from
> pfn_to_pgmap_offset()..

I think since there is no urgent need for this series to move forward in
v6.2 I can take the time to kill the need for pfn_to_pgmap_offset() and
circle back for this in v6.3.

The urgency in my mind is not there because:

1/ Physical memory-device-hotplug is not common, that does not arrive
   until CXL 2.0 hosts are shipping in volume. Note that's distinct from
   ACPI hotplug that is platform firmware coordinated.

2/ Beyond the initial growing pains with Folios and DAX-pages there are
   no additional collisions on the v6.2 horizon.

3/ I have not seen any MEMORY_DEVICE_PRIVATE users attempt the
   pfn_to_pgmap_offset() conversion, so no patches are dependent on this
   moving forward in v6.2.

If someone sees an urgency reason I missed then let me know.

> >  /**
> > - * pgmap_request_folios - activate an contiguous span of folios in @pgmap
> > - * @pgmap: host page map for the folio array
> > - * @folio: start of the folio list, all subsequent folios have same folio_size()
> > + * pgmap_request_folio - activate a folio of a given order in @pgmap
> > + * @pgmap: host page map of the folio to activate
> > + * @pgmap_offset: page-offset into the pgmap to request
> > + * @order: expected folio_order() of the folio
> >   *
> >   * Caller is responsible for @pgmap remaining live for the duration of
> > - * this call. Caller is also responsible for not racing requests for the
> > - * same folios.
> > + * this call. The order (size) of the folios in the pgmap are assumed
> > + * stable before this call.
> >   */
> 
> I would probably add some discussion here that this enables
> refcounting on the folio and the pgmap_ops page free will be called
> once the folio is no longer being used.
> 
> And explain that the pgmap user is responsible for tracking which
> pgmap_offsets are requested and which have been returned by free. It
> would be nice to say that this can only be called on free'd folios.

Ok.

> 
> > -bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
> > -			  int nr_folios)
> > +struct folio *pgmap_request_folio(struct dev_pagemap *pgmap,
> > +				  pgoff_t pgmap_offset, int order)
> 
> unsigned int order?

Sure.

> 
> >  {
> > -	struct folio *iter;
> > -	int i;
> > +	unsigned long pfn = pgmap_offset_to_pfn(pgmap, pgmap_offset);
> > +	struct page *page = pfn_to_page(pfn);
> > +	struct folio *folio;
> > +	int v;
> >  
> > -	/*
> > -	 * All of the WARNs below are for catching bugs in future
> > -	 * development that changes the assumptions of:
> > -	 * 1/ uniform folios in @pgmap
> > -	 * 2/ @pgmap death does not race this routine.
> > -	 */
> > -	VM_WARN_ON_ONCE(!folio_span_valid(pgmap, folio, nr_folios));
> > +	if (WARN_ON_ONCE(page->pgmap != pgmap))
> > +		return NULL;
> 
> Checking that pgmap_offset is not bigger than pgmap length is also a
> good assertion.. At that point if pgmap is not right then the struct
> page has been corrupted.

Ok.

> >  	if (WARN_ON_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> > -		return false;
> > +		return NULL;
> >  
> > -	for (iter = folio_next(folio), i = 1; i < nr_folios;
> > -	     iter = folio_next(folio), i++)
> > -		if (WARN_ON_ONCE(folio_order(iter) != folio_order(folio)))
> > -			return false;
> > +	folio = page_folio(page);
> > +	if (WARN_ON_ONCE(folio_order(folio) != order))
> > +		return NULL;
> 
> Do you see a blocker to simply restructuring the pages into head/tail
> here? If the refcounts are all zero it should be safe?

I do not think all callers are good about avoiding a new request if they
are already holding a reference.

> 
> > +	v = folio_ref_inc_return(folio);
> > +	if (v > 1)
> > +		return folio;
> 
> IMHO, ideally, this should require the foilio to have a 0 refcount and
> this should set it to 1.

That would certainly be a nice property, let me take some time to see
how that could be made a requirement.

> > +	if (WARN_ON_ONCE(!percpu_ref_tryget(&pgmap->ref))) {
> 
> This should not be a warn on, there should be races where the dying
> check could miss but the refcounts all reached zero anyhow.

Ok, makes sense.