Need to support mixed memory mappings with HMM

Thu Mar 25 16:10:47 UTC 2021

Am 25.03.21 um 17:03 schrieb Felix Kuehling:
> Hi,
>
> This is a long one with a proposal for a pretty significant redesign of
> how we handle migrations and VRAM management with HMM. This is based on
> my own debugging and reading of the migrate_vma helpers, as well as
> Alex's problems with migrations on A+A. I hope we can discuss this next
> Monday after you've had some time do digest it.
>
> I did some debugging yesterday and found that migrations to VRAM can
> fail for some pages. The current migration helpers have many corner
> cases where a page cannot be migrated. Some of them may be fixable
> (adding support for THP), others are not (locked pages are skipped to
> avoid deadlocks). Therefore I think our current code is too inflexible
> when it assumes that a range is entirely in one place.
>
> Alex also ran into some funny issues with COW on A+A where some pages
> get faulted back to system memory. I think a lot of the problems here
> will get easier once we support mixed mappings.
>
> Mixed GPU mappings
> ===========
>
> The idea is, to remove any assumptions that an entire svm_range is in
> one place. Instead hmm_range_fault gives us a list of pages, some of
> which are system memory and others are device_private or device_generic.
>
> We will need an amdgpu_vm interface that lets us map mixed page arrays
> where different pages use different PTE flags. We can have at least 3
> different types of pages in one mapping:
>
>   1. System memory (S-bit set)
>   2. Local memory (S-bit cleared, MTYPE for local memory)
>   3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)
>
> My idea is to change the amdgpu_vm_update_mapping interface to use some
> high-bit in the pages_addr array to indicate the page type. For the
> default page type (0) nothing really changes for the callers. The
> "flags" parameter needs to become a pointer to an array that gets
> indexed by the high bits from the pages_addr array. For existing callers
> it's as easy as changing flags to &flags (array of size 1). For HMM we
> would pass a pointer to a real array.

Yeah, I've thought about stuff like that as well for a while.

Problem is that this won't work that easily. We assume at many places 
that the flags doesn't change for the range in question.

We would somehow need to change that to get the flags directly from the 
low bits of the DMA address or something instead.

Christian.

>
> Once this is done, it leads to a number of opportunities for
> simplification and better efficiency in kfd_svm:
>
>    * Support migration when cpages != npages
>    * Migrate a part of an svm_range without splitting it. No more
>      splitting of ranges in CPU page faults
>    * Migrate a part of an svm_range in GPU page fault handler. No more
>      migrating the whole range for a single page fault
>    * Simplified VRAM management (see below)
>
> With that, svm_range will no longer have an "actual_loc" field. If we're
> not sure where the data is, we need to call migrate. If it's already in
> the right place, then cpages will be 0 and we can skip all the steps
> after migrate_vma_setup.
>
> Simplified VRAM management
> ==============
>
> VRAM BOs are no longer associated with pranges. Instead they are
> "free-floating", allocated during migration to VRAM, with reference
> count for each page that uses the BO. Ref is released in page-release
> callback. When the ref count drops to 0, free the BO.
>
> VRAM BO size should match the migration granularity, 2MB by default.
> That way the BO can be freed when memory gets migrated out. If migration
> of some pages fails the BO may not be fully occupied. Also some pages
> may be released individually on A+A due to COW or other events.
>
> Eviction needs to migrate all the pages still using the BO. If the BO
> struct keeps an array of page pointers, that's basically the migrate.src
> for the eviction. Migration calls "try_to_unmap", which has the best
> chance of freeing the BO, even when shared by multiple processes.
>
> If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
> allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
> pressure so we can start evicting memory.
>
> Regards,
>    Felix
>