[RFC 00/11] THP support for zone device pages

Fri Jul 4 13:52:46 UTC 2025

Hi,

On Fri, Mar 07, 2025 at 10:20:30AM +1100, Balbir Singh wrote:
> On 3/7/25 10:08, Matthew Brost wrote:
> > On Thu, Mar 06, 2025 at 03:42:28PM +1100, Balbir Singh wrote:
> > 
> > This is an exciting series to see. As of today, we have just merged this
> > series into the DRM subsystem / Xe [2], which adds very basic SVM
> > support. One of the performance bottlenecks we quickly identified was
> > the lack of THP for device pages—I believe our profiling showed that 96%
> > of the time spent on 2M page GPU faults was within the migrate_vma_*
> > functions. Presumably, this will help significantly.
> > 
> > We will likely attempt to pull this code into GPU SVM / Xe fairly soon.
> > I believe we will encounter a conflict since [2] includes these patches
> > [3] [4], but we should be able to resolve that. These patches might make
> > it into the 6.15 PR — TBD but I can get back to you on that.
> > 
> > I have one question—does this series contain all the required core MM
> > changes for us to give it a try? That is, do I need to include any other
> > code from the list to test this out?
> > 
> 
> Thank you, the patches are built on top of mm-everything-2025-03-04-05-51, which
> includes changes by Alistair to fix fs/dax reference counting and changes
> By Zi Yan (folio split changes), the series builds on top of those, but the
> patches are not dependent on the folio split changes, IIRC
> 
> Please do report bugs/issues that you come across.
> 
> Balbir
> 

Thanks for sharing. We used your series to experimentally enable THP migration
of zone device pages in DRM GPU SVM and Xe. Here is an early draft [1] rebased
on 6.16-rc1. It is still hacky but I wanted to share some findings/questions:
- Is there an updated version of your series?
- In hmm_vma_walk_pmd(), when the device private pages are owned by the caller,
  is it needed to fault them in or could execution just continue in order to
  handle the PMD?
- When __drm_gpusvm_migrate_to_ram() is called from the CPU fault handler, the
  faulting folio is already locked when reaching migrate_vma_collect_huge_pmd()
  so folio_trylock() fails, which leads to skipping collection. As this case
  seems valid, collection should probably be skipped only when the folio is not
  the faulting folio.
- Something seems odd with the folio ref count in folio_migrate_mapping(), it
  does not match the expected count in our runs. This is not root caused yet.
- The expectation for HMM internals is speedups as it should find one single
  THP versus 512 devices pages previously. However we noticed slowdowns, for
  example in hmm_range_fault(), which increase drm_gpusvm_range_get_pages()
  execution time. We are investigating why this happens as this can be caused
  by leftover hacks in my patches but is the above expectation correct? Have
  you also observed such side effects?

Thanks,
Francois

[1] https://gitlab.freedesktop.org/ifdu/kernel/-/tree/svm-thp-device