[PATCH v8 07/15] mm/gup: migrate device coherent pages when pinning instead of failing

Alistair Popple apopple at nvidia.com
Thu Jul 14 05:39:49 UTC 2022


David Hildenbrand <david at redhat.com> writes:

> On 07.07.22 21:03, Alex Sierra wrote:
>> From: Alistair Popple <apopple at nvidia.com>
>>
>> Currently any attempts to pin a device coherent page will fail. This is
>> because device coherent pages need to be managed by a device driver, and
>> pinning them would prevent a driver from migrating them off the device.
>>
>> However this is no reason to fail pinning of these pages. These are
>> coherent and accessible from the CPU so can be migrated just like
>> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
>> them first try migrating them out of ZONE_DEVICE.
>>
>> Signed-off-by: Alistair Popple <apopple at nvidia.com>
>> Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>
>> [hch: rebased to the split device memory checks,
>>       moved migrate_device_page to migrate_device.c]
>> Signed-off-by: Christoph Hellwig <hch at lst.de>
>> ---
>>  mm/gup.c            | 47 +++++++++++++++++++++++++++++++++++-----
>>  mm/internal.h       |  1 +
>>  mm/migrate_device.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 96 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index b65fe8bf5af4..9b6b9923d22d 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1891,9 +1891,43 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
>>  			continue;
>>  		prev_folio = folio;
>>
>> -		if (folio_is_longterm_pinnable(folio))
>> +		/*
>> +		 * Device private pages will get faulted in during gup so it
>> +		 * shouldn't be possible to see one here.
>> +		 */
>> +		if (WARN_ON_ONCE(folio_is_device_private(folio))) {
>> +			ret = -EFAULT;
>> +			goto unpin_pages;
>> +		}
>
> I'd just drop that. Device private pages are never part of a present PTE. So if we
> could actually get a grab of one via GUP we would be in bigger trouble ...
> already before this patch.

Fair.

>> +
>> +		/*
>> +		 * Device coherent pages are managed by a driver and should not
>> +		 * be pinned indefinitely as it prevents the driver moving the
>> +		 * page. So when trying to pin with FOLL_LONGTERM instead try
>> +		 * to migrate the page out of device memory.
>> +		 */
>> +		if (folio_is_device_coherent(folio)) {
>> +			WARN_ON_ONCE(PageCompound(&folio->page));
>
> Maybe that belongs into migrate_device_page()?

Ok (noting Matthew's comment there as well).

>> +
>> +			/*
>> +			 * Migration will fail if the page is pinned, so convert
>
> [...]
>
>>  /*
>>   * mm/gup.c
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index cf9668376c5a..5decd26dd551 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -794,3 +794,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
>>  	}
>>  }
>>  EXPORT_SYMBOL(migrate_vma_finalize);
>> +
>> +/*
>> + * Migrate a device coherent page back to normal memory.  The caller should have
>> + * a reference on page which will be copied to the new page if migration is
>> + * successful or dropped on failure.
>> + */
>> +struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
>
> Function name should most probably indicate that we're dealing with coherent pages here?

Ok.

>> +{
>> +	unsigned long src_pfn, dst_pfn = 0;
>> +	struct migrate_vma args;
>> +	struct page *dpage;
>> +
>> +	lock_page(page);
>> +	src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
>> +	args.src = &src_pfn;
>> +	args.dst = &dst_pfn;
>> +	args.cpages = 1;
>> +	args.npages = 1;
>> +	args.vma = NULL;
>> +	migrate_vma_setup(&args);
>> +	if (!(src_pfn & MIGRATE_PFN_MIGRATE))
>> +		return NULL;
>
> Wow, these refcount and page locking/unlocking rules with this migrate_* api are
> confusing now. And the usage here of sometimes returning and sometimes falling
> trough don't make it particularly easier to understand here.
>
> I'm not 100% happy about reusing migrate_vma_setup() usage if there *is no VMA*.
> That's just absolutely confusing, because usually migrate_vma_setup() itself
> would do the collection step and ref+lock pages. :/
>
> In general, I can see why/how we're reusing the migrate_vma_* API here, but there
> is absolutely no VMA ... not sure what to improve besides providing a second API
> that does a simple single-page migration. But that can be changed later ...

Yeah, as noted in your other response I think it should be ok to just
call migrate_vma_unmap() directly from migrate_device_page() so I assume
that would adequately deal with this.

>> +
>> +	dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
>> +
>
> alloc_page()
>
>> +	/*
>> +	 * get/pin the new page now so we don't have to retry gup after
>> +	 * migrating. We already have a reference so this should never fail.
>> +	 */
>> +	if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
>> +		__free_pages(dpage, 0);
>
> __free_page()
>
>> +		dpage = NULL;
>> +	}
>
> Hm, this means that we are not pinning via the PTE at hand, but via something
> we expect migration to put into the PTE. I'm not really happy about this.
>
> Ideally, we'd make the pinning decision only on the actual GUP path, not in here.
> Just like in the migrate_pages() case, where we end up dropping all refs/pins
> and looking up again via GUP from the PTE.
>
> For example, I wonder if something nasty could happen if the PTE got mapped
> R/O in the meantime and you're pinning R/W here ...
>
> TBH, all this special casing on gup_flags here is nasty. Please, let's just do
> it like migrate_pages() and do another GUP walk. Absolutely no need to optimize.

The only reason to pass gup_flags is to check FOLL_PIN vs. FOLL_GET so
that we can do the right reference on the destination page. I did the
optimisation because we already have the destination page with a
reference and GUP/PUP does not make any guarantees about the current PTE
state anyway.

However I noticed there might be a race here - during migration we
replace present PTEs with migration entries. On fork these get copied
via copy_nonpresent_pte() and made read-only. However we don't check if
the page a migration entry points to is pinned or not. For an ordinary
PTE copy_present_pte() would copy the page for a COW mapping, but this
won't happen if the page happens to be undergoing migration (even though
the migration will ultimately fail due to the pin).

Anyway I don't think this patch currently makes that any worse, but if
we fix the above it will because there is a brief period during which
the page we're pinning won't look like a pinned page.

So I will go with the suggestion to do another GUP walk.

> [...]
>
>
>
> I'd go with something like the following on top (which does not touch on the
> general semantic issue with migrate_vma_* ). Note that I most probably messed
> up some refcount/lock handling and that it's broken.
> Just to give you an idea what I think could be cleaner.

Thanks! At a glance it looks roughly right but I will check and respin
it to incorporate the comments.

> diff --git a/mm/gup.c b/mm/gup.c
> index 9b6b9923d22d..17041b3e605e 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1881,7 +1881,7 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
>  	unsigned long isolation_error_count = 0, i;
>  	struct folio *prev_folio = NULL;
>  	LIST_HEAD(movable_page_list);
> -	bool drain_allow = true;
> +	bool drain_allow = true, any_device_coherent = false;
>  	int ret = 0;
>
>  	for (i = 0; i < nr_pages; i++) {
> @@ -1891,15 +1891,6 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
>  			continue;
>  		prev_folio = folio;
>
> -		/*
> -		 * Device private pages will get faulted in during gup so it
> -		 * shouldn't be possible to see one here.
> -		 */
> -		if (WARN_ON_ONCE(folio_is_device_private(folio))) {
> -			ret = -EFAULT;
> -			goto unpin_pages;
> -		}
> -
>  		/*
>  		 * Device coherent pages are managed by a driver and should not
>  		 * be pinned indefinitely as it prevents the driver moving the
> @@ -1907,7 +1898,12 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
>  		 * to migrate the page out of device memory.
>  		 */
>  		if (folio_is_device_coherent(folio)) {
> -			WARN_ON_ONCE(PageCompound(&folio->page));
> +			/*
> +			 * We always want a new GUP lookup with device coherent
> +			 * pages.
> +			 */
> +			any_device_coherent = true;
> +			pages[i] = 0;
>
>  			/*
>  			 * Migration will fail if the page is pinned, so convert
> @@ -1918,11 +1914,12 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
>  				unpin_user_page(&folio->page);
>  			}
>
> -			pages[i] = migrate_device_page(&folio->page, gup_flags);
> -			if (!pages[i]) {
> -				ret = -EBUSY;
> +			ret = migrate_device_coherent_page(&folio->page);
> +			if (ret)
>  				goto unpin_pages;
> -			}
> +			/* The reference to our folio is stale now. */
> +			prev_folio = NULL;
> +			folio = NULL;
>  			continue;
>  		}
>
> @@ -1953,7 +1950,8 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
>  				    folio_nr_pages(folio));
>  	}
>
> -	if (!list_empty(&movable_page_list) || isolation_error_count)
> +	if (!list_empty(&movable_page_list) || isolation_error_count ||
> +	    any_device_coherent)
>  		goto unpin_pages;
>
>  	/*
> @@ -1963,14 +1961,19 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
>  	return nr_pages;
>
>  unpin_pages:
> -	for (i = 0; i < nr_pages; i++) {
> -		if (!pages[i])
> -			continue;
> +	/* We have to be careful if we stumbled over device coherent pages. */
> +	if (unlikely(any_device_coherent || !(gup_flags & FOLL_PIN))) {
> +		for (i = 0; i < nr_pages; i++) {
> +			if (!pages[i])
> +				continue;
>
> -		if (gup_flags & FOLL_PIN)
> -			unpin_user_page(pages[i]);
> -		else
> -			put_page(pages[i]);
> +			if (gup_flags & FOLL_PIN)
> +				unpin_user_page(pages[i]);
> +			else
> +				put_page(pages[i]);
> +		}
> +	} else {
> +		unpin_user_pages(pages, nr_pages);
>  	}
>
>  	if (!list_empty(&movable_page_list)) {
> diff --git a/mm/internal.h b/mm/internal.h
> index eeab4ee7a4a3..899dab512c5a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -853,7 +853,7 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>  		      unsigned long addr, int page_nid, int *flags);
>
>  void free_zone_device_page(struct page *page);
> -struct page *migrate_device_page(struct page *page, unsigned int gup_flags);
> +int migrate_device_coherent_page(struct page *page);
>
>  /*
>   * mm/gup.c
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 5decd26dd551..dfb78ea3d326 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -797,53 +797,40 @@ EXPORT_SYMBOL(migrate_vma_finalize);
>
>  /*
>   * Migrate a device coherent page back to normal memory.  The caller should have
> - * a reference on page which will be copied to the new page if migration is
> - * successful or dropped on failure.
> + * a reference on page, which will be dropped on return.
>   */
> -struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
> +int migrate_device_coherent_page(struct page *page)
>  {
>  	unsigned long src_pfn, dst_pfn = 0;
> -	struct migrate_vma args;
> +	struct migrate_vma args = {
> +		.src = &src_pfn,
> +		.dst = &dst_pfn,
> +		.cpages = 1,
> +		.npages = 1,
> +		.vma = NULL,
> +	};
>  	struct page *dpage;
>
> +	VM_WARN_ON_ONCE(PageCompound(page));
> +
>  	lock_page(page);
>  	src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
> -	args.src = &src_pfn;
> -	args.dst = &dst_pfn;
> -	args.cpages = 1;
> -	args.npages = 1;
> -	args.vma = NULL;
> -	migrate_vma_setup(&args);
> -	if (!(src_pfn & MIGRATE_PFN_MIGRATE))
> -		return NULL;
> -
> -	dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
> -
> -	/*
> -	 * get/pin the new page now so we don't have to retry gup after
> -	 * migrating. We already have a reference so this should never fail.
> -	 */
> -	if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
> -		__free_pages(dpage, 0);
> -		dpage = NULL;
> -	}
>
> -	if (dpage) {
> -		lock_page(dpage);
> -		dst_pfn = migrate_pfn(page_to_pfn(dpage));
> +	migrate_vma_setup(&args);
> +	if (src_pfn & MIGRATE_PFN_MIGRATE) {
> +		dpage = alloc_page(GFP_USER | __GFP_NOWARN);
> +		if (dpage) {
> +			dst_pfn = migrate_pfn(page_to_pfn(dpage));
> +			lock_page(dpage);
> +		}
>  	}
>
>  	migrate_vma_pages(&args);
>  	if (src_pfn & MIGRATE_PFN_MIGRATE)
>  		copy_highpage(dpage, page);
>  	migrate_vma_finalize(&args);
> -	if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
> -		if (gup_flags & FOLL_PIN)
> -			unpin_user_page(dpage);
> -		else
> -			put_page(dpage);
> -		dpage = NULL;
> -	}
>
> -	return dpage;
> +	if (src_pfn & MIGRATE_PFN_MIGRATE)
> +		return 0;
> +	return -EBUSY;
>  }
> --
> 2.35.3


More information about the dri-devel mailing list