[PATCH] drm/amdgpu: remove PT BOs when unmapping

Tue Nov 5 18:51:52 UTC 2019

Hi Eric,

Ah! Yeah that is a well known issue.

Basic problem is that for releasing the BOs we need to reserve them to 
check if they are idle or not.

I've got a branch with a TTM change to avoid that, but essentially that 
is a huge problem which needs a rather big change in memory management 
to fix.

Regards,
Christian.

Am 05.11.19 um 17:27 schrieb Huang, JinHuiEric:
> Hi Christian,
>
> I found the reason why page tables are not freed when unmapping. All the
> pts are reserved, then they are not freed until vm fini. So the
> consequences are old pts and new pts for the same VAs will exist till vm
> fini. In KFD big buffer strees test, multiple times of mapping and
> unmapping a big range of system memory causes huge vram pts usage
> accumulation.
>
> I tried to avoid generating duplicated pts during unmapping in
> amdgpu_vm_update_ptes() by skipping amdgpu_vm_free_pts() and not
> reserving the lowest pts, but they didn't work with VM fault. The only
> way working is skipping whole function amdgpu_vm_update_ptes(), but it
> seems wrong, because we have to update GPU VM MMU.
>
> So there is no bug in amdgpu_vm_update_ptes(), but the accumulation of
> pts vram usage is an overhead. Do you think what we can do to get better
> solution?
>
> Regards,
>
> Eric
>
> On 2019-10-31 10:33 a.m., Huang, JinHuiEric wrote:
>> The hardware is vega10 and test is KFDMemoryTest.BigBufferStressTest.
>> More detail is on Jira SWDEV-201443.
>>
>> Regards,
>>
>> Eric
>>
>> On 2019-10-31 10:08 a.m., StDenis, Tom wrote:
>>> I could try it on my carrizo/polaris setup.  Is there a test procedure I
>>> could folllow to trigger the changed code paths?
>>>
>>>
>>> Tom
>>>
>>> On 2019-10-31 6:41 a.m., Koenig, Christian wrote:
>>>> Just tested this and amdgpu_vm_update_ptes() indeed works as expected.
>>>>
>>>> When you free at least a 2MB the lowest level of page tables is freed
>>>> up again.
>>>>
>>>> BTW: What hardware have you tested this on? On gfx8 and older it is
>>>> expected that page tables are never freed.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 30.10.19 um 19:11 schrieb Christian König:
>>>>> Then I don't see how this patch actually changes anything.
>>>>>
>>>>> Could only be a bug in amdgpu_vm_update_ptes(). Going to investigate
>>>>> this, but I won't have time to look into the ticket in detail.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 30.10.19 um 19:00 schrieb Huang, JinHuiEric:
>>>>>> Actually I do prevent to remove in-use pts by this:
>>>>>>
>>>>>> +               r = amdgpu_vm_remove_ptes(adev, vm,
>>>>>> +                               (mapping->start + 0x1ff) & (~0x1ffll),
>>>>>> +                               (mapping->last + 1) & (~0x1ffll));
>>>>>>
>>>>>> Which is only removing aligned page table for 2M. And I have tested
>>>>>> it at least on KFD tests without anything broken.
>>>>>>
>>>>>> By the way, I am not familiar with memory staff. This patch is the
>>>>>> best I can do for now. Could you take a look at the Jira ticket
>>>>>> SWDEV-201443 ? and find the better solution. Thanks!
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Eric
>>>>>>
>>>>>> On 2019-10-30 1:57 p.m., Christian König wrote:
>>>>>>> One thing I've forgotten:
>>>>>>>
>>>>>>> What you could maybe do to improve the situation is to join
>>>>>>> adjacent ranges in amdgpu_vm_clear_freed(), but I'm not sure how
>>>>>>> the chances are that the ranges are freed all together.
>>>>>>>
>>>>>>> The only other alternative I can see would be to check the mappings
>>>>>>> of a range in amdgpu_update_ptes() and see if you could walk the
>>>>>>> tree up if the valid flag is not set and there are no mappings left
>>>>>>> for a page table.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 30.10.19 um 18:42 schrieb Koenig, Christian:
>>>>>>>>> The vaild flag doesn't take effect in this function.
>>>>>>>> That's irrelevant.
>>>>>>>>
>>>>>>>> See what amdgpu_vm_update_ptes() does is to first determine the
>>>>>>>> fragment size:
>>>>>>>>> amdgpu_vm_fragment(params, frag_start, end, flags, &frag, &frag_end);
>>>>>>>> Then we walk down the tree:
>>>>>>>>>            amdgpu_vm_pt_start(adev, params->vm, start, &cursor);
>>>>>>>>>            while (cursor.pfn < end) {
>>>>>>>> And make sure that the page tables covering the address range are
>>>>>>>> actually allocated:
>>>>>>>>>                    r = amdgpu_vm_alloc_pts(params->adev, params->vm,
>>>>>>>>> &cursor);
>>>>>>>> Then we update the tables with the flags and addresses and free up
>>>>>>>> subsequent tables in the case of huge pages or freed up areas:
>>>>>>>>>                            /* Free all child entries */
>>>>>>>>>                            while (cursor.pfn < frag_start) {
>>>>>>>>> amdgpu_vm_free_pts(adev, params->vm, &cursor);
>>>>>>>>> amdgpu_vm_pt_next(adev, &cursor);
>>>>>>>>>                            }
>>>>>>>> This is the maximum you can free, cause all other page tables are
>>>>>>>> not completely covered by the range and so potentially still in use.
>>>>>>>>
>>>>>>>> And I have the strong suspicion that this is what your patch is
>>>>>>>> actually doing wrong. In other words you are also freeing page
>>>>>>>> tables which are only partially covered by the range and so
>>>>>>>> potentially still in use.
>>>>>>>>
>>>>>>>> Since we don't have any tracking how many entries in a page table
>>>>>>>> are currently valid and how many are invalid we actually can't
>>>>>>>> implement what you are trying to do here. So the patch is
>>>>>>>> definitely somehow broken.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 30.10.19 um 17:55 schrieb Huang, JinHuiEric:
>>>>>>>>> The vaild flag doesn't take effect in this function.
>>>>>>>>> amdgpu_vm_alloc_pts() is always executed that only depended on
>>>>>>>>> "cursor.pfn < end". The valid flag has only been checked on here
>>>>>>>>> for asic below GMC v9:
>>>>>>>>>
>>>>>>>>> if (adev->asic_type < CHIP_VEGA10 &&
>>>>>>>>>                (flags & AMDGPU_PTE_VALID))...
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Eric
>>>>>>>>>
>>>>>>>>> On 2019-10-30 12:30 p.m., Koenig, Christian wrote:
>>>>>>>>>> Am 30.10.2019 17:19 schrieb "Huang, JinHuiEric"
>>>>>>>>>> <JinHuiEric.Huang at amd.com>:
>>>>>>>>>>
>>>>>>>>>>        I tested it that it saves a lot of vram on KFD big buffer
>>>>>>>>>>        stress test. I think there are two reasons:
>>>>>>>>>>
>>>>>>>>>>        1. Calling amdgpu_vm_update_ptes() during unmapping will
>>>>>>>>>>        allocate unnecessary pts, because there is no flag to
>>>>>>>>>>        determine if the VA is mapping or unmapping in function
>>>>>>>>>>        amdgpu_vm_update_ptes(). It saves the most of memory.
>>>>>>>>>>
>>>>>>>>>> That's not correct. The valid flag is used for this.
>>>>>>>>>>
>>>>>>>>>>        2. Intentionally removing those unmapping pts is logical
>>>>>>>>>>        expectation, although it is not removing so much pts.
>>>>>>>>>>
>>>>>>>>>> Well I actually don't see a change to what update_ptes is doing
>>>>>>>>>> and have the strong suspicion that the patch is simply broken.
>>>>>>>>>>
>>>>>>>>>> You either free page tables which are potentially still in use
>>>>>>>>>> or update_pte doesn't free page tables when the valid but is not
>>>>>>>>>> set.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>        Regards,
>>>>>>>>>>
>>>>>>>>>>        Eric
>>>>>>>>>>
>>>>>>>>>>        On 2019-10-30 11:57 a.m., Koenig, Christian wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>            Am 30.10.2019 16:47 schrieb "Kuehling, Felix"
>>>>>>>>>>            <Felix.Kuehling at amd.com> <mailto:Felix.Kuehling at amd.com>:
>>>>>>>>>>
>>>>>>>>>>                On 2019-10-30 9:52 a.m., Christian König wrote:
>>>>>>>>>>                > Am 29.10.19 um 21:06 schrieb Huang, JinHuiEric:
>>>>>>>>>>                >> The issue is PT BOs are not freed when unmapping VA,
>>>>>>>>>>                >> which causes vram usage accumulated is huge in some
>>>>>>>>>>                >> memory stress test, such as kfd big buffer stress
>>>>>>>>>>                test.
>>>>>>>>>>                >> Function amdgpu_vm_bo_update_mapping() is called
>>>>>>>>>>                by both
>>>>>>>>>>                >> amdgpu_vm_bo_update() and
>>>>>>>>>>                amdgpu_vm_clear_freed(). The
>>>>>>>>>>                >> solution is replacing
>>>>>>>>>>                amdgpu_vm_bo_update_mapping() in
>>>>>>>>>>                >> amdgpu_vm_clear_freed() with removing PT BOs function
>>>>>>>>>>                >> to save vram usage.
>>>>>>>>>>                >
>>>>>>>>>>                > NAK, that is intentional behavior.
>>>>>>>>>>                >
>>>>>>>>>>                > Otherwise we can run into out of memory situations
>>>>>>>>>>                when page tables
>>>>>>>>>>                > need to be allocated again under stress.
>>>>>>>>>>
>>>>>>>>>>                That's a bit arbitrary and inconsistent. We are
>>>>>>>>>>                freeing page tables in
>>>>>>>>>>                other situations, when a mapping uses huge pages in
>>>>>>>>>>                amdgpu_vm_update_ptes. Why not when a mapping is
>>>>>>>>>>                destroyed completely?
>>>>>>>>>>
>>>>>>>>>>                I'm actually a bit surprised that the huge-page
>>>>>>>>>>                handling in
>>>>>>>>>>                amdgpu_vm_update_ptes isn't kicking in to free up
>>>>>>>>>>                lower-level page
>>>>>>>>>>                tables when a BO is unmapped.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>            Well it does free the lower level, and that is already
>>>>>>>>>>            causing problems (that's why I added the reserved space).
>>>>>>>>>>
>>>>>>>>>>            What we don't do is freeing the higher levels.
>>>>>>>>>>
>>>>>>>>>>            E.g. when you free a 2MB BO we free the lowest level, if
>>>>>>>>>>            we free a 1GB BO we free the two lowest levels etc...
>>>>>>>>>>
>>>>>>>>>>            The problem with freeing the higher levels is that you
>>>>>>>>>>            don't know who is also using this. E.g. we would need to
>>>>>>>>>>            check all entries when we unmap one.
>>>>>>>>>>
>>>>>>>>>>            It's simply not worth it for a maximum saving of 2MB per VM.
>>>>>>>>>>
>>>>>>>>>>            Writing this I'm actually wondering how you ended up in
>>>>>>>>>>            this issue? There shouldn't be much savings from this.
>>>>>>>>>>
>>>>>>>>>>            Regards,
>>>>>>>>>>            Christian.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                Regards,
>>>>>>>>>>                   Felix
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                >
>>>>>>>>>>                > Regards,
>>>>>>>>>>                > Christian.
>>>>>>>>>>                >
>>>>>>>>>>                >>
>>>>>>>>>>                >> Change-Id: Ic24e35bff8ca85265b418a642373f189d972a924
>>>>>>>>>>                >> Signed-off-by: Eric Huang
>>>>>>>>>>                <JinhuiEric.Huang at amd.com>
>>>>>>>>>>                <mailto:JinhuiEric.Huang at amd.com>
>>>>>>>>>>                >> ---
>>>>>>>>>>                >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 56
>>>>>>>>>>                >> +++++++++++++++++++++++++++++-----
>>>>>>>>>>                >>   1 file changed, 48 insertions(+), 8 deletions(-)
>>>>>>>>>>                >>
>>>>>>>>>>                >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>>>                >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>>>                >> index 0f4c3b2..8a480c7 100644
>>>>>>>>>>                >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>>>                >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>>>>>                >> @@ -1930,6 +1930,51 @@ static void
>>>>>>>>>>                amdgpu_vm_prt_fini(struct
>>>>>>>>>>                >> amdgpu_device *adev, struct amdgpu_vm *vm)
>>>>>>>>>>                >>   }
>>>>>>>>>>                >>     /**
>>>>>>>>>>                >> + * amdgpu_vm_remove_ptes - free PT BOs
>>>>>>>>>>                >> + *
>>>>>>>>>>                >> + * @adev: amdgpu device structure
>>>>>>>>>>                >> + * @vm: amdgpu vm structure
>>>>>>>>>>                >> + * @start: start of mapped range
>>>>>>>>>>                >> + * @end: end of mapped entry
>>>>>>>>>>                >> + *
>>>>>>>>>>                >> + * Free the page table level.
>>>>>>>>>>                >> + */
>>>>>>>>>>                >> +static int amdgpu_vm_remove_ptes(struct
>>>>>>>>>>                amdgpu_device *adev,
>>>>>>>>>>                >> + struct amdgpu_vm *vm, uint64_t start, uint64_t end)
>>>>>>>>>>                >> +{
>>>>>>>>>>                >> +    struct amdgpu_vm_pt_cursor cursor;
>>>>>>>>>>                >> +    unsigned shift, num_entries;
>>>>>>>>>>                >> +
>>>>>>>>>>                >> + amdgpu_vm_pt_start(adev, vm, start, &cursor);
>>>>>>>>>>                >> +    while (cursor.level < AMDGPU_VM_PTB) {
>>>>>>>>>>                >> +        if (!amdgpu_vm_pt_descendant(adev, &cursor))
>>>>>>>>>>                >> + return -ENOENT;
>>>>>>>>>>                >> +    }
>>>>>>>>>>                >> +
>>>>>>>>>>                >> +    while (cursor.pfn < end) {
>>>>>>>>>>                >> + amdgpu_vm_free_table(cursor.entry);
>>>>>>>>>>                >> + num_entries = amdgpu_vm_num_entries(adev,
>>>>>>>>>>                cursor.level - 1);
>>>>>>>>>>                >> +
>>>>>>>>>>                >> +        if (cursor.entry !=
>>>>>>>>>>                &cursor.parent->entries[num_entries - 1]) {
>>>>>>>>>>                >> + /* Next ptb entry */
>>>>>>>>>>                >> + shift = amdgpu_vm_level_shift(adev,
>>>>>>>>>>                cursor.level - 1);
>>>>>>>>>>                >> + cursor.pfn += 1ULL << shift;
>>>>>>>>>>                >> + cursor.pfn &= ~((1ULL << shift) - 1);
>>>>>>>>>>                >> + cursor.entry++;
>>>>>>>>>>                >> +        } else {
>>>>>>>>>>                >> + /* Next ptb entry in next pd0 entry */
>>>>>>>>>>                >> + amdgpu_vm_pt_ancestor(&cursor);
>>>>>>>>>>                >> + shift = amdgpu_vm_level_shift(adev,
>>>>>>>>>>                cursor.level - 1);
>>>>>>>>>>                >> + cursor.pfn += 1ULL << shift;
>>>>>>>>>>                >> + cursor.pfn &= ~((1ULL << shift) - 1);
>>>>>>>>>>                >> + amdgpu_vm_pt_descendant(adev, &cursor);
>>>>>>>>>>                >> +        }
>>>>>>>>>>                >> +    }
>>>>>>>>>>                >> +
>>>>>>>>>>                >> +    return 0;
>>>>>>>>>>                >> +}
>>>>>>>>>>                >> +
>>>>>>>>>>                >> +/**
>>>>>>>>>>                >>    * amdgpu_vm_clear_freed - clear freed BOs in
>>>>>>>>>>                the PT
>>>>>>>>>>                >>    *
>>>>>>>>>>                >>    * @adev: amdgpu_device pointer
>>>>>>>>>>                >> @@ -1949,7 +1994,6 @@ int
>>>>>>>>>>                amdgpu_vm_clear_freed(struct amdgpu_device
>>>>>>>>>>                >> *adev,
>>>>>>>>>>                >>                 struct dma_fence **fence)
>>>>>>>>>>                >>   {
>>>>>>>>>>                >>       struct amdgpu_bo_va_mapping *mapping;
>>>>>>>>>>                >> -    uint64_t init_pte_value = 0;
>>>>>>>>>>                >>       struct dma_fence *f = NULL;
>>>>>>>>>>                >>       int r;
>>>>>>>>>>                >>   @@ -1958,13 +2002,10 @@ int
>>>>>>>>>>                amdgpu_vm_clear_freed(struct
>>>>>>>>>>                >> amdgpu_device *adev,
>>>>>>>>>>                >> struct amdgpu_bo_va_mapping, list);
>>>>>>>>>>                >> list_del(&mapping->list);
>>>>>>>>>>                >>   -        if (vm->pte_support_ats &&
>>>>>>>>>>                >> - mapping->start < AMDGPU_GMC_HOLE_START)
>>>>>>>>>>                >> - init_pte_value = AMDGPU_PTE_DEFAULT_ATC;
>>>>>>>>>>                >> +        r = amdgpu_vm_remove_ptes(adev, vm,
>>>>>>>>>>                >> + (mapping->start + 0x1ff) & (~0x1ffll),
>>>>>>>>>>                >> + (mapping->last + 1) & (~0x1ffll));
>>>>>>>>>>                >>   -        r = amdgpu_vm_bo_update_mapping(adev,
>>>>>>>>>>                vm, false, NULL,
>>>>>>>>>>                >> - mapping->start, mapping->last,
>>>>>>>>>>                >> - init_pte_value, 0, NULL, &f);
>>>>>>>>>>                >> amdgpu_vm_free_mapping(adev, vm, mapping, f);
>>>>>>>>>>                >>           if (r) {
>>>>>>>>>>                >> dma_fence_put(f);
>>>>>>>>>>                >> @@ -1980,7 +2021,6 @@ int
>>>>>>>>>>                amdgpu_vm_clear_freed(struct amdgpu_device
>>>>>>>>>>                >> *adev,
>>>>>>>>>>                >>       }
>>>>>>>>>>                >> return 0;
>>>>>>>>>>                >> -
>>>>>>>>>>                >>   }
>>>>>>>>>>                >>     /**
>>>>>>>>>>                >
>>>>>>>>>>                > _______________________________________________
>>>>>>>>>>                > amd-gfx mailing list
>>>>>>>>>>                > amd-gfx at lists.freedesktop.org
>>>>>>>>>>                <mailto:amd-gfx at lists.freedesktop.org>
>>>>>>>>>>                > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx