[PATCH v1 03/14] mm: add iomem vma selection for memory migration

Thu Sep 9 04:02:10 UTC 2021

Am 2021-09-02 um 4:18 a.m. schrieb Christoph Hellwig:
> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
>>>>> It looks like I'm totally misunderstanding what you are adding here
>>>>> then.  Why do we need any special treatment at all for memory that
>>>>> has normal struct pages and is part of the direct kernel map?
>>>> The pages are like normal memory for purposes of mapping them in CPU
>>>> page tables and for coherent access from the CPU.
>>> That's the user page tables.  What about the kernel direct map?
>>> If there is a normal kernel struct page backing there really should
>>> be no need for the pgmap.
>> I'm not sure. The physical address ranges are in the UEFI system address
>> map as special-purpose memory. Does Linux create the struct pages and
>> kernel direct map for that without a pgmap call? I didn't see that last
>> time I went digging through that code.
> So doing some googling finds a patch from Dan that claims to hand EFI
> special purpose memory to the device dax driver.  But when I try to
> follow the version that got merged it looks it is treated simply as an
> MMIO region to be claimed by drivers, which would not get a struct page.
>
> Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works?
>
>>>> From an application
>>>> perspective, we want file-backed and anonymous mappings to be able to
>>>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>>>> optimize performance for GPU heavy workloads while minimizing the need
>>>> to migrate data back-and-forth between system memory and device memory.
>>> I don't really understand that part.  file backed pages are always
>>> allocated by the file system using the pagecache helpers, that is
>>> using the page allocator.  Anonymouns memory also always comes from
>>> the page allocator.
>> I'm coming at this from my experience with DEVICE_PRIVATE. Both
>> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
>> memory by the migrate_vma_* helpers for more efficient access by our
>> GPU. (*) It's part of the basic premise of HMM as I understand it. I
>> would expect the same thing to work for DEVICE_PUBLIC memory.
> Ok, so you want to migrate to and from them.  Not use DEVICE_PUBLIC
> for the actual page cache pages.  That maks a lot more sense.
>
>> I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows
>> the CPU to map the device memory coherently to minimize the need for
>> migrations when CPU and GPU access the same memory concurrently or
>> alternatingly. But we're not going as far as putting that memory
>> entirely under the management of the Linux memory manager and VM
>> subsystem. Our (and HPE's) system architects decided that this memory is
>> not suitable to be used like regular NUMA system memory by the Linux
>> memory manager.
> So yes.  It is a Memory Mapped I/O region, which unlike the PCIe BARs
> that people typically deal with is fully cache coherent.  I think this
> does make more sense as a description.
>
> But to go back to what start this discussion:  If these are memory
> mapped I/O pfn_valid should generally not return true for them.

As I understand it, pfn_valid should be true for any pfn that's part of
the kernel's physical memory map, i.e. is returned by page_to_pfn or
works with pfn_to_page. Both the hmm_range_fault and the migrate_vma_*
APIs use pfns to refer to regular system memory and ZONE_DEVICE pages
(even DEVICE_PRIVATE). Therefore I believe pfn_valid should be true for
ZONE_DEVICE pages as well.

Regards,
  Felix

>
> And as you already pointed out in reply to Alex we need to tighten the
> selection criteria one way or another.