Need to support mixed memory mappings with HMM

Thu Mar 25 16:22:30 UTC 2021

Am 25.03.21 um 17:20 schrieb Felix Kuehling:
> Am 2021-03-25 um 12:16 p.m. schrieb Christian König:
>> Am 25.03.21 um 17:14 schrieb Felix Kuehling:
>>> Am 2021-03-25 um 12:10 p.m. schrieb Christian König:
>>>> Am 25.03.21 um 17:03 schrieb Felix Kuehling:
>>>>> Hi,
>>>>>
>>>>> This is a long one with a proposal for a pretty significant
>>>>> redesign of
>>>>> how we handle migrations and VRAM management with HMM. This is
>>>>> based on
>>>>> my own debugging and reading of the migrate_vma helpers, as well as
>>>>> Alex's problems with migrations on A+A. I hope we can discuss this
>>>>> next
>>>>> Monday after you've had some time do digest it.
>>>>>
>>>>> I did some debugging yesterday and found that migrations to VRAM can
>>>>> fail for some pages. The current migration helpers have many corner
>>>>> cases where a page cannot be migrated. Some of them may be fixable
>>>>> (adding support for THP), others are not (locked pages are skipped to
>>>>> avoid deadlocks). Therefore I think our current code is too inflexible
>>>>> when it assumes that a range is entirely in one place.
>>>>>
>>>>> Alex also ran into some funny issues with COW on A+A where some pages
>>>>> get faulted back to system memory. I think a lot of the problems here
>>>>> will get easier once we support mixed mappings.
>>>>>
>>>>> Mixed GPU mappings
>>>>> ===========
>>>>>
>>>>> The idea is, to remove any assumptions that an entire svm_range is in
>>>>> one place. Instead hmm_range_fault gives us a list of pages, some of
>>>>> which are system memory and others are device_private or
>>>>> device_generic.
>>>>>
>>>>> We will need an amdgpu_vm interface that lets us map mixed page arrays
>>>>> where different pages use different PTE flags. We can have at least 3
>>>>> different types of pages in one mapping:
>>>>>
>>>>>     1. System memory (S-bit set)
>>>>>     2. Local memory (S-bit cleared, MTYPE for local memory)
>>>>>     3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)
>>>>>
>>>>> My idea is to change the amdgpu_vm_update_mapping interface to use
>>>>> some
>>>>> high-bit in the pages_addr array to indicate the page type. For the
>>>>> default page type (0) nothing really changes for the callers. The
>>>>> "flags" parameter needs to become a pointer to an array that gets
>>>>> indexed by the high bits from the pages_addr array. For existing
>>>>> callers
>>>>> it's as easy as changing flags to &flags (array of size 1). For HMM we
>>>>> would pass a pointer to a real array.
>>>> Yeah, I've thought about stuff like that as well for a while.
>>>>
>>>> Problem is that this won't work that easily. We assume at many places
>>>> that the flags doesn't change for the range in question.
>>> I think some lower level functions assume that the flags stay the same
>>> for physically contiguous ranges. But if you use the high-bits to encode
>>> the page type, the ranges won't be contiguous any more. So you can
>>> change page flags for different contiguous ranges.
>> Yeah, but then you also get absolutely zero THP and fragment flags
>> support.
> As long as you have a contiguous 2MB page with the same page type, I
> think you can still get a THP mapping in the GPU page table. But if one
> page in the middle of a 2MB page has a different page type, that will
> break the THP mapping, as it should.

Yeah, but currently we detect that before we call down into the 
functions to update the tables.

When you give a mixed list the chance for that is just completely gone.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> But I think we could also add those later on.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>     Felix
>>>
>>>
>>>> We would somehow need to change that to get the flags directly from
>>>> the low bits of the DMA address or something instead.
>>>>
>>>> Christian.
>>>>
>>>>> Once this is done, it leads to a number of opportunities for
>>>>> simplification and better efficiency in kfd_svm:
>>>>>
>>>>>      * Support migration when cpages != npages
>>>>>      * Migrate a part of an svm_range without splitting it. No more
>>>>>        splitting of ranges in CPU page faults
>>>>>      * Migrate a part of an svm_range in GPU page fault handler. No
>>>>> more
>>>>>        migrating the whole range for a single page fault
>>>>>      * Simplified VRAM management (see below)
>>>>>
>>>>> With that, svm_range will no longer have an "actual_loc" field. If
>>>>> we're
>>>>> not sure where the data is, we need to call migrate. If it's
>>>>> already in
>>>>> the right place, then cpages will be 0 and we can skip all the steps
>>>>> after migrate_vma_setup.
>>>>>
>>>>> Simplified VRAM management
>>>>> ==============
>>>>>
>>>>> VRAM BOs are no longer associated with pranges. Instead they are
>>>>> "free-floating", allocated during migration to VRAM, with reference
>>>>> count for each page that uses the BO. Ref is released in page-release
>>>>> callback. When the ref count drops to 0, free the BO.
>>>>>
>>>>> VRAM BO size should match the migration granularity, 2MB by default.
>>>>> That way the BO can be freed when memory gets migrated out. If
>>>>> migration
>>>>> of some pages fails the BO may not be fully occupied. Also some pages
>>>>> may be released individually on A+A due to COW or other events.
>>>>>
>>>>> Eviction needs to migrate all the pages still using the BO. If the BO
>>>>> struct keeps an array of page pointers, that's basically the
>>>>> migrate.src
>>>>> for the eviction. Migration calls "try_to_unmap", which has the best
>>>>> chance of freeing the BO, even when shared by multiple processes.
>>>>>
>>>>> If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
>>>>> allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
>>>>> pressure so we can start evicting memory.
>>>>>
>>>>> Regards,
>>>>>      Felix
>>>>>