Need to support mixed memory mappings with HMM

Thu Mar 25 16:16:07 UTC 2021

Am 25.03.21 um 17:14 schrieb Felix Kuehling:
> Am 2021-03-25 um 12:10 p.m. schrieb Christian König:
>>
>> Am 25.03.21 um 17:03 schrieb Felix Kuehling:
>>> Hi,
>>>
>>> This is a long one with a proposal for a pretty significant redesign of
>>> how we handle migrations and VRAM management with HMM. This is based on
>>> my own debugging and reading of the migrate_vma helpers, as well as
>>> Alex's problems with migrations on A+A. I hope we can discuss this next
>>> Monday after you've had some time do digest it.
>>>
>>> I did some debugging yesterday and found that migrations to VRAM can
>>> fail for some pages. The current migration helpers have many corner
>>> cases where a page cannot be migrated. Some of them may be fixable
>>> (adding support for THP), others are not (locked pages are skipped to
>>> avoid deadlocks). Therefore I think our current code is too inflexible
>>> when it assumes that a range is entirely in one place.
>>>
>>> Alex also ran into some funny issues with COW on A+A where some pages
>>> get faulted back to system memory. I think a lot of the problems here
>>> will get easier once we support mixed mappings.
>>>
>>> Mixed GPU mappings
>>> ===========
>>>
>>> The idea is, to remove any assumptions that an entire svm_range is in
>>> one place. Instead hmm_range_fault gives us a list of pages, some of
>>> which are system memory and others are device_private or device_generic.
>>>
>>> We will need an amdgpu_vm interface that lets us map mixed page arrays
>>> where different pages use different PTE flags. We can have at least 3
>>> different types of pages in one mapping:
>>>
>>>    1. System memory (S-bit set)
>>>    2. Local memory (S-bit cleared, MTYPE for local memory)
>>>    3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)
>>>
>>> My idea is to change the amdgpu_vm_update_mapping interface to use some
>>> high-bit in the pages_addr array to indicate the page type. For the
>>> default page type (0) nothing really changes for the callers. The
>>> "flags" parameter needs to become a pointer to an array that gets
>>> indexed by the high bits from the pages_addr array. For existing callers
>>> it's as easy as changing flags to &flags (array of size 1). For HMM we
>>> would pass a pointer to a real array.
>> Yeah, I've thought about stuff like that as well for a while.
>>
>> Problem is that this won't work that easily. We assume at many places
>> that the flags doesn't change for the range in question.
> I think some lower level functions assume that the flags stay the same
> for physically contiguous ranges. But if you use the high-bits to encode
> the page type, the ranges won't be contiguous any more. So you can
> change page flags for different contiguous ranges.

Yeah, but then you also get absolutely zero THP and fragment flags support.

But I think we could also add those later on.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> We would somehow need to change that to get the flags directly from
>> the low bits of the DMA address or something instead.
>>
>> Christian.
>>
>>> Once this is done, it leads to a number of opportunities for
>>> simplification and better efficiency in kfd_svm:
>>>
>>>     * Support migration when cpages != npages
>>>     * Migrate a part of an svm_range without splitting it. No more
>>>       splitting of ranges in CPU page faults
>>>     * Migrate a part of an svm_range in GPU page fault handler. No more
>>>       migrating the whole range for a single page fault
>>>     * Simplified VRAM management (see below)
>>>
>>> With that, svm_range will no longer have an "actual_loc" field. If we're
>>> not sure where the data is, we need to call migrate. If it's already in
>>> the right place, then cpages will be 0 and we can skip all the steps
>>> after migrate_vma_setup.
>>>
>>> Simplified VRAM management
>>> ==============
>>>
>>> VRAM BOs are no longer associated with pranges. Instead they are
>>> "free-floating", allocated during migration to VRAM, with reference
>>> count for each page that uses the BO. Ref is released in page-release
>>> callback. When the ref count drops to 0, free the BO.
>>>
>>> VRAM BO size should match the migration granularity, 2MB by default.
>>> That way the BO can be freed when memory gets migrated out. If migration
>>> of some pages fails the BO may not be fully occupied. Also some pages
>>> may be released individually on A+A due to COW or other events.
>>>
>>> Eviction needs to migrate all the pages still using the BO. If the BO
>>> struct keeps an array of page pointers, that's basically the migrate.src
>>> for the eviction. Migration calls "try_to_unmap", which has the best
>>> chance of freeing the BO, even when shared by multiple processes.
>>>
>>> If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
>>> allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
>>> pressure so we can start evicting memory.
>>>
>>> Regards,
>>>     Felix
>>>