[RFC 1/5] drm/amdgpu: Fix migration rate limiting accounting

Wed May 15 07:14:44 UTC 2024

Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>
> The logic assumed any migration attempt worked and therefore would over-
> account the amount of data migrated during buffer re-validation. As a
> consequence client can be unfairly penalised by incorrectly considering
> its migration budget spent.
>
> Fix it by looking at the before and after buffer object backing store and
> only account if there was a change.
>
> FIXME:
> I think this needs a better solution to account for migrations between
> VRAM visible and non-visible portions.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
> Cc: Christian König <christian.koenig at amd.com>
> Cc: Friedrich Vock <friedrich.vock at gmx.de>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26 +++++++++++++++++++++-----
>   1 file changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index ec888fc6ead8..22708954ae68 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -784,12 +784,15 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo)
>   		.no_wait_gpu = false,
>   		.resv = bo->tbo.base.resv
>   	};
> +	struct ttm_resource *old_res;
>   	uint32_t domain;
>   	int r;
>   
>   	if (bo->tbo.pin_count)
>   		return 0;
>   
> +	old_res = bo->tbo.resource;
> +
>   	/* Don't move this buffer if we have depleted our allowance
>   	 * to move it. Don't move anything if the threshold is zero.
>   	 */
> @@ -817,16 +820,29 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo)
>   	amdgpu_bo_placement_from_domain(bo, domain);
>   	r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
>   
> -	p->bytes_moved += ctx.bytes_moved;
> -	if (!amdgpu_gmc_vram_full_visible(&adev->gmc) &&
> -	    amdgpu_res_cpu_visible(adev, bo->tbo.resource))
> -		p->bytes_moved_vis += ctx.bytes_moved;
> -
>   	if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {
>   		domain = bo->allowed_domains;
>   		goto retry;
>   	}
>   
> +	if (!r) {
> +		struct ttm_resource *new_res = bo->tbo.resource;
> +		bool moved = true;
> +
> +		if (old_res == new_res)
> +			moved = false;
> +		else if (old_res && new_res &&
> +			 old_res->mem_type == new_res->mem_type)
> +			moved = false;

The old resource might already be destroyed after you return from 
validation. So this here won't work.

Apart from that even when a migration attempt fails the moved bytes 
should be accounted.

When the validation attempt doesn't caused any moves then the bytecount 
here would be zero.

So as far as I can see that is as fair as you can get.

Regards,
Christian.

PS: Looks like our mail servers are once more not very reliable.

If you get mails from me multiple times please just ignore it.

> +
> +		if (moved) {
> +			p->bytes_moved += ctx.bytes_moved;
> +			if (!amdgpu_gmc_vram_full_visible(&adev->gmc) &&
> +			    amdgpu_res_cpu_visible(adev, bo->tbo.resource))
> +				p->bytes_moved_vis += ctx.bytes_moved;
> +		}
> +	}
> +
>   	return r;
>   }
>