[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

André Almeida andrealmeid at igalia.com
Sat Jan 13 21:35:35 UTC 2024


Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:
> We need to bump the karma of the drm_sched job in order for the context
> that we just recovered to get correct feedback that it is guilty of
> hanging.
> 
> Without this feedback, the application may keep pushing through the soft
> recoveries, continually hanging the system with jobs that timeout.
> 
> There is an accompanying Mesa/RADV patch here
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
> to properly handle device loss state when VRAM is not lost.
> 
> With these, I was able to run Counter-Strike 2 and launch an application
> which can fault the GPU in a variety of ways, and still have Steam +
> Counter-Strike 2 + Gamescope (compositor) stay up and continue
> functioning on Steam Deck.
> 

I sent a similar patch in the past, maybe you find the discussion 
interesting:

https://lore.kernel.org/lkml/20230424014324.218531-1-andrealmeid@igalia.com/

> Signed-off-by: Joshua Ashton <joshua at froggi.es>
> 
> Cc: Friedrich Vock <friedrich.vock at gmx.de>
> Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
> Cc: Christian König <christian.koenig at amd.com>
> Cc: André Almeida <andrealmeid at igalia.com>
> Cc: stable at vger.kernel.org
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 25209ce54552..e87cafb5b1c3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job)
>   		dma_fence_set_error(fence, -ENODATA);
>   	spin_unlock_irqrestore(fence->lock, flags);
>   
> +	if (job->vm)
> +		drm_sched_increase_karma(&job->base);
>   	atomic_inc(&ring->adev->gpu_reset_counter);
>   	while (!dma_fence_is_signaled(fence) &&
>   	       ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)


More information about the amd-gfx mailing list