[RFC 1/5] drm/amdgpu: Fix migration rate limiting accounting

Wed May 15 10:51:49 UTC 2024

On 15/05/2024 08:14, Christian König wrote:
> Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin:
>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>
>> The logic assumed any migration attempt worked and therefore would over-
>> account the amount of data migrated during buffer re-validation. As a
>> consequence client can be unfairly penalised by incorrectly considering
>> its migration budget spent.
>>
>> Fix it by looking at the before and after buffer object backing store and
>> only account if there was a change.
>>
>> FIXME:
>> I think this needs a better solution to account for migrations between
>> VRAM visible and non-visible portions.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26 +++++++++++++++++++++-----
>>   1 file changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index ec888fc6ead8..22708954ae68 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -784,12 +784,15 @@ static int amdgpu_cs_bo_validate(void *param, 
>> struct amdgpu_bo *bo)
>>           .no_wait_gpu = false,
>>           .resv = bo->tbo.base.resv
>>       };
>> +    struct ttm_resource *old_res;
>>       uint32_t domain;
>>       int r;
>>       if (bo->tbo.pin_count)
>>           return 0;
>> +    old_res = bo->tbo.resource;
>> +
>>       /* Don't move this buffer if we have depleted our allowance
>>        * to move it. Don't move anything if the threshold is zero.
>>        */
>> @@ -817,16 +820,29 @@ static int amdgpu_cs_bo_validate(void *param, 
>> struct amdgpu_bo *bo)
>>       amdgpu_bo_placement_from_domain(bo, domain);
>>       r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
>> -    p->bytes_moved += ctx.bytes_moved;
>> -    if (!amdgpu_gmc_vram_full_visible(&adev->gmc) &&
>> -        amdgpu_res_cpu_visible(adev, bo->tbo.resource))
>> -        p->bytes_moved_vis += ctx.bytes_moved;
>> -
>>       if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {
>>           domain = bo->allowed_domains;
>>           goto retry;
>>       }
>> +    if (!r) {
>> +        struct ttm_resource *new_res = bo->tbo.resource;
>> +        bool moved = true;
>> +
>> +        if (old_res == new_res)
>> +            moved = false;
>> +        else if (old_res && new_res &&
>> +             old_res->mem_type == new_res->mem_type)
>> +            moved = false;
> 
> The old resource might already be destroyed after you return from 
> validation. So this here won't work.
> 
> Apart from that even when a migration attempt fails the moved bytes 
> should be accounted.
> 
> When the validation attempt doesn't caused any moves then the bytecount 
> here would be zero.
> 
> So as far as I can see that is as fair as you can get.

Right, I think I suffered a bit of tunnel vision here and completely 
ignore the _ctx_.moved_bytes part. Scratch this one too then.

Regards,

Tvrtko

> 
> Regards,
> Christian.
> 
> PS: Looks like our mail servers are once more not very reliable.
> 
> If you get mails from me multiple times please just ignore it.
> 
>> +
>> +        if (moved) {
>> +            p->bytes_moved += ctx.bytes_moved;
>> +            if (!amdgpu_gmc_vram_full_visible(&adev->gmc) &&
>> +                amdgpu_res_cpu_visible(adev, bo->tbo.resource))
>> +                p->bytes_moved_vis += ctx.bytes_moved;
>> +        }
>> +    }
>> +
>>       return r;
>>   }
>