[PATCH 1/9] drm/amdgpu: generally allow over-commit during BO allocation

Sun Dec 11 01:13:10 UTC 2022

Am 2022-12-10 um 09:12 schrieb Christian König:
> Am 10.12.22 um 07:15 schrieb Felix Kuehling:
>> On 2022-11-25 05:21, Christian König wrote:
>>> We already fallback to a dummy BO with no backing store when we
>>> allocate GDS,GWS and OA resources and to GTT when we allocate VRAM.
>>>
>>> Drop all those workarounds and generalize this for GTT as well. This
>>> fixes ENOMEM issues with runaway applications which try to 
>>> allocate/free
>>> GTT in a loop and are otherwise only limited by the CPU speed.
>>>
>>> The CS will wait for the cleanup of freed up BOs to satisfy the
>>> various domain specific limits and so effectively throttle those
>>> buggy applications down to a sane allocation behavior again.
>>>
>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>
>> This patch causes some regressions in KFDTest. KFDMemoryTest.MMBench 
>> sees a huge VRAM allocation slow-down. And 
>> KFDMemoryTest.LargestVramBufferTest can only allocate half the 
>> available memory.
>
> Mhm, I wasn't expecting that we use this for the KFD as well.

Yeah, we use amdgpu_gem_object_create. I guess we could duplicate its 
functionality or add a "no_overcommit" or "greedy" parameter for our needs.

>
>>
>> This seems to be caused by initially validating VRAM BOs in the CPU 
>> domain, which allocates a ttm_tt. A subsequent validation in the VRAM 
>> domain involves a copy from GTT to VRAM.
>
> The idea was to initially create the BOs without any backing store.

I thought about it a bit more. I believe the BO creation without backing 
store is working as expected. But amdgpu_bo_move can't move the 
uninitialized BO directly from system to VRAM. It returns -EMULTIHOP. So 
the BO gets moved to GTT first (allocating system memory) before it can 
be migrated to VRAM. That adds a bunch of overhead with unnecessary 
system memory allocation and forces all VRAM to be zero-initialized on 
the CPU and copied through PCIe. I think your idea would work with 
almost no overhead if amdgpu_bo_move could directly move a BO without 
backing store to VRAM with ttm_bo_move_null.

Regards,
   Felix

>
>>
>> After that, freeing of BOs can get delayed by the ghost object of a 
>> previous migration, which delays calling release notifiers and causes 
>> problems for KFDs available memory accounting.
>>
>> I experimented with a workaround that validates BOs immediately after 
>> allocation, but that only moves around the delays and doesn't solve 
>> the problem. During those experiments I may also have stumbled over a 
>> bug in ttm_buffer_object_transfer: It calls ttm_bo_set_bulk_move 
>> before initializing and locking fbo->base.base._resv. This results in 
>> a flood of warnings because ttm_bo_set_bulk_move expects the 
>> reservation to be locked.
>>
>> Right now I'd like to remove the bp.domain = initial_domain | 
>> AMDGPU_GEM_DOMAIN_CPU change in amdgpu_gem_object_create to fix this.
>
> Yeah, let's revert and investigate this first.
>
> Thanks,
> Christian.
>
>>
>> Regards,
>>   Felix
>>
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    | 16 +++-------------
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  6 +-----
>>>   2 files changed, 4 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> index a0780a4e3e61..62e98f1ad770 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -113,7 +113,7 @@ int amdgpu_gem_object_create(struct 
>>> amdgpu_device *adev, unsigned long size,
>>>       bp.resv = resv;
>>>       bp.preferred_domain = initial_domain;
>>>       bp.flags = flags;
>>> -    bp.domain = initial_domain;
>>> +    bp.domain = initial_domain | AMDGPU_GEM_DOMAIN_CPU;
>>>       bp.bo_ptr_size = sizeof(struct amdgpu_bo);
>>>         r = amdgpu_bo_create_user(adev, &bp, &ubo);
>>> @@ -332,20 +332,10 @@ int amdgpu_gem_create_ioctl(struct drm_device 
>>> *dev, void *data,
>>>       }
>>>         initial_domain = (u32)(0xffffffff & args->in.domains);
>>> -retry:
>>>       r = amdgpu_gem_object_create(adev, size, args->in.alignment,
>>> -                     initial_domain,
>>> -                     flags, ttm_bo_type_device, resv, &gobj);
>>> +                     initial_domain, flags, ttm_bo_type_device,
>>> +                     resv, &gobj);
>>>       if (r && r != -ERESTARTSYS) {
>>> -        if (flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) {
>>> -            flags &= ~AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
>>> -            goto retry;
>>> -        }
>>> -
>>> -        if (initial_domain == AMDGPU_GEM_DOMAIN_VRAM) {
>>> -            initial_domain |= AMDGPU_GEM_DOMAIN_GTT;
>>> -            goto retry;
>>> -        }
>>>           DRM_DEBUG("Failed to allocate GEM object (%llu, %d, %llu, 
>>> %d)\n",
>>>                   size, initial_domain, args->in.alignment, r);
>>>       }
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> index 974e85d8b6cc..919bbea2e3ac 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> @@ -581,11 +581,7 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
>>>           bo->flags |= AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE;
>>>         bo->tbo.bdev = &adev->mman.bdev;
>>> -    if (bp->domain & (AMDGPU_GEM_DOMAIN_GWS | AMDGPU_GEM_DOMAIN_OA |
>>> -              AMDGPU_GEM_DOMAIN_GDS))
>>> -        amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_CPU);
>>> -    else
>>> -        amdgpu_bo_placement_from_domain(bo, bp->domain);
>>> +    amdgpu_bo_placement_from_domain(bo, bp->domain);
>>>       if (bp->type == ttm_bo_type_kernel)
>>>           bo->tbo.priority = 1;
>