Need to support mixed memory mappings with HMM

Thu Mar 25 16:20:05 UTC 2021

Am 2021-03-25 um 12:16 p.m. schrieb Christian König:
> Am 25.03.21 um 17:14 schrieb Felix Kuehling:
>> Am 2021-03-25 um 12:10 p.m. schrieb Christian König:
>>>
>>> Am 25.03.21 um 17:03 schrieb Felix Kuehling:
>>>> Hi,
>>>>
>>>> This is a long one with a proposal for a pretty significant
>>>> redesign of
>>>> how we handle migrations and VRAM management with HMM. This is
>>>> based on
>>>> my own debugging and reading of the migrate_vma helpers, as well as
>>>> Alex's problems with migrations on A+A. I hope we can discuss this
>>>> next
>>>> Monday after you've had some time do digest it.
>>>>
>>>> I did some debugging yesterday and found that migrations to VRAM can
>>>> fail for some pages. The current migration helpers have many corner
>>>> cases where a page cannot be migrated. Some of them may be fixable
>>>> (adding support for THP), others are not (locked pages are skipped to
>>>> avoid deadlocks). Therefore I think our current code is too inflexible
>>>> when it assumes that a range is entirely in one place.
>>>>
>>>> Alex also ran into some funny issues with COW on A+A where some pages
>>>> get faulted back to system memory. I think a lot of the problems here
>>>> will get easier once we support mixed mappings.
>>>>
>>>> Mixed GPU mappings
>>>> ===========
>>>>
>>>> The idea is, to remove any assumptions that an entire svm_range is in
>>>> one place. Instead hmm_range_fault gives us a list of pages, some of
>>>> which are system memory and others are device_private or
>>>> device_generic.
>>>>
>>>> We will need an amdgpu_vm interface that lets us map mixed page arrays
>>>> where different pages use different PTE flags. We can have at least 3
>>>> different types of pages in one mapping:
>>>>
>>>>    1. System memory (S-bit set)
>>>>    2. Local memory (S-bit cleared, MTYPE for local memory)
>>>>    3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)
>>>>
>>>> My idea is to change the amdgpu_vm_update_mapping interface to use
>>>> some
>>>> high-bit in the pages_addr array to indicate the page type. For the
>>>> default page type (0) nothing really changes for the callers. The
>>>> "flags" parameter needs to become a pointer to an array that gets
>>>> indexed by the high bits from the pages_addr array. For existing
>>>> callers
>>>> it's as easy as changing flags to &flags (array of size 1). For HMM we
>>>> would pass a pointer to a real array.
>>> Yeah, I've thought about stuff like that as well for a while.
>>>
>>> Problem is that this won't work that easily. We assume at many places
>>> that the flags doesn't change for the range in question.
>> I think some lower level functions assume that the flags stay the same
>> for physically contiguous ranges. But if you use the high-bits to encode
>> the page type, the ranges won't be contiguous any more. So you can
>> change page flags for different contiguous ranges.
>
> Yeah, but then you also get absolutely zero THP and fragment flags
> support.

As long as you have a contiguous 2MB page with the same page type, I
think you can still get a THP mapping in the GPU page table. But if one
page in the middle of a 2MB page has a different page type, that will
break the THP mapping, as it should.

Regards,
  Felix

>
> But I think we could also add those later on.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> We would somehow need to change that to get the flags directly from
>>> the low bits of the DMA address or something instead.
>>>
>>> Christian.
>>>
>>>> Once this is done, it leads to a number of opportunities for
>>>> simplification and better efficiency in kfd_svm:
>>>>
>>>>     * Support migration when cpages != npages
>>>>     * Migrate a part of an svm_range without splitting it. No more
>>>>       splitting of ranges in CPU page faults
>>>>     * Migrate a part of an svm_range in GPU page fault handler. No
>>>> more
>>>>       migrating the whole range for a single page fault
>>>>     * Simplified VRAM management (see below)
>>>>
>>>> With that, svm_range will no longer have an "actual_loc" field. If
>>>> we're
>>>> not sure where the data is, we need to call migrate. If it's
>>>> already in
>>>> the right place, then cpages will be 0 and we can skip all the steps
>>>> after migrate_vma_setup.
>>>>
>>>> Simplified VRAM management
>>>> ==============
>>>>
>>>> VRAM BOs are no longer associated with pranges. Instead they are
>>>> "free-floating", allocated during migration to VRAM, with reference
>>>> count for each page that uses the BO. Ref is released in page-release
>>>> callback. When the ref count drops to 0, free the BO.
>>>>
>>>> VRAM BO size should match the migration granularity, 2MB by default.
>>>> That way the BO can be freed when memory gets migrated out. If
>>>> migration
>>>> of some pages fails the BO may not be fully occupied. Also some pages
>>>> may be released individually on A+A due to COW or other events.
>>>>
>>>> Eviction needs to migrate all the pages still using the BO. If the BO
>>>> struct keeps an array of page pointers, that's basically the
>>>> migrate.src
>>>> for the eviction. Migration calls "try_to_unmap", which has the best
>>>> chance of freeing the BO, even when shared by multiple processes.
>>>>
>>>> If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
>>>> allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
>>>> pressure so we can start evicting memory.
>>>>
>>>> Regards,
>>>>     Felix
>>>>
>