Need to support mixed memory mappings with HMM

Thu Mar 25 16:14:44 UTC 2021

Am 2021-03-25 um 12:10 p.m. schrieb Christian König:
>
>
> Am 25.03.21 um 17:03 schrieb Felix Kuehling:
>> Hi,
>>
>> This is a long one with a proposal for a pretty significant redesign of
>> how we handle migrations and VRAM management with HMM. This is based on
>> my own debugging and reading of the migrate_vma helpers, as well as
>> Alex's problems with migrations on A+A. I hope we can discuss this next
>> Monday after you've had some time do digest it.
>>
>> I did some debugging yesterday and found that migrations to VRAM can
>> fail for some pages. The current migration helpers have many corner
>> cases where a page cannot be migrated. Some of them may be fixable
>> (adding support for THP), others are not (locked pages are skipped to
>> avoid deadlocks). Therefore I think our current code is too inflexible
>> when it assumes that a range is entirely in one place.
>>
>> Alex also ran into some funny issues with COW on A+A where some pages
>> get faulted back to system memory. I think a lot of the problems here
>> will get easier once we support mixed mappings.
>>
>> Mixed GPU mappings
>> ===========
>>
>> The idea is, to remove any assumptions that an entire svm_range is in
>> one place. Instead hmm_range_fault gives us a list of pages, some of
>> which are system memory and others are device_private or device_generic.
>>
>> We will need an amdgpu_vm interface that lets us map mixed page arrays
>> where different pages use different PTE flags. We can have at least 3
>> different types of pages in one mapping:
>>
>>   1. System memory (S-bit set)
>>   2. Local memory (S-bit cleared, MTYPE for local memory)
>>   3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)
>>
>> My idea is to change the amdgpu_vm_update_mapping interface to use some
>> high-bit in the pages_addr array to indicate the page type. For the
>> default page type (0) nothing really changes for the callers. The
>> "flags" parameter needs to become a pointer to an array that gets
>> indexed by the high bits from the pages_addr array. For existing callers
>> it's as easy as changing flags to &flags (array of size 1). For HMM we
>> would pass a pointer to a real array.
>
> Yeah, I've thought about stuff like that as well for a while.
>
> Problem is that this won't work that easily. We assume at many places
> that the flags doesn't change for the range in question.

I think some lower level functions assume that the flags stay the same
for physically contiguous ranges. But if you use the high-bits to encode
the page type, the ranges won't be contiguous any more. So you can
change page flags for different contiguous ranges.

Regards,
  Felix

>
> We would somehow need to change that to get the flags directly from
> the low bits of the DMA address or something instead.
>
> Christian.
>
>>
>> Once this is done, it leads to a number of opportunities for
>> simplification and better efficiency in kfd_svm:
>>
>>    * Support migration when cpages != npages
>>    * Migrate a part of an svm_range without splitting it. No more
>>      splitting of ranges in CPU page faults
>>    * Migrate a part of an svm_range in GPU page fault handler. No more
>>      migrating the whole range for a single page fault
>>    * Simplified VRAM management (see below)
>>
>> With that, svm_range will no longer have an "actual_loc" field. If we're
>> not sure where the data is, we need to call migrate. If it's already in
>> the right place, then cpages will be 0 and we can skip all the steps
>> after migrate_vma_setup.
>>
>> Simplified VRAM management
>> ==============
>>
>> VRAM BOs are no longer associated with pranges. Instead they are
>> "free-floating", allocated during migration to VRAM, with reference
>> count for each page that uses the BO. Ref is released in page-release
>> callback. When the ref count drops to 0, free the BO.
>>
>> VRAM BO size should match the migration granularity, 2MB by default.
>> That way the BO can be freed when memory gets migrated out. If migration
>> of some pages fails the BO may not be fully occupied. Also some pages
>> may be released individually on A+A due to COW or other events.
>>
>> Eviction needs to migrate all the pages still using the BO. If the BO
>> struct keeps an array of page pointers, that's basically the migrate.src
>> for the eviction. Migration calls "try_to_unmap", which has the best
>> chance of freeing the BO, even when shared by multiple processes.
>>
>> If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
>> allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
>> pressure so we can start evicting memory.
>>
>> Regards,
>>    Felix
>>
>