[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Sat Jan 13 22:55:39 UTC 2024

+Marek

On 1/13/24 21:35, André Almeida wrote:
> Hi Joshua,
> 
> Em 13/01/2024 11:02, Joshua Ashton escreveu:
>> We need to bump the karma of the drm_sched job in order for the context
>> that we just recovered to get correct feedback that it is guilty of
>> hanging.
>>
>> Without this feedback, the application may keep pushing through the soft
>> recoveries, continually hanging the system with jobs that timeout.
>>
>> There is an accompanying Mesa/RADV patch here
>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
>> to properly handle device loss state when VRAM is not lost.
>>
>> With these, I was able to run Counter-Strike 2 and launch an application
>> which can fault the GPU in a variety of ways, and still have Steam +
>> Counter-Strike 2 + Gamescope (compositor) stay up and continue
>> functioning on Steam Deck.
>>
> 
> I sent a similar patch in the past, maybe you find the discussion 
> interesting:
> 
> https://lore.kernel.org/lkml/20230424014324.218531-1-andrealmeid@igalia.com/

Thanks, I had a peruse through that old thread.

Marek definitely had the right idea here, given he mentions:
"That supposedly depends on the compositor. There may be compositors for
very specific cases (e.g. Steam Deck)"

Given that is what I work on and also wrote this patch for that does 
basically the same thing as was proposed. :-)

For context though, I am less interested in Gamescope (the Steam Deck 
compositor) hanging (we don't have code that hangs, if we go down, it's 
likely Steam/CEF died with us anyway atm, can solve that battle some 
other day) and more about the applications run under it.

Marek is very right when he says applications that fault/hang will 
submit one IB after another that also fault/hang -- especially if they 
write to descriptors from the GPU (descriptor buffers), or use draw 
indirect or anything bindless or...
That's basically functionally equivalent to DOSing a system if it is not 
prevented.

And that's exactly what I see even in a simple test app doing a fault -> 
hang every frame.

Right now, given that soft recovery never marks a context as guilty, it 
means that every app I tested is never stopped from submitting garbage 
That holds true for basically any app that GPU hangs and makes soft 
recovery totally useless in my testing without this.

(That being said, without my patches, RADV treats *any* reset from the 
query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it was 
not guilty, so any faulting/hanging application causes every Vulkan app 
to die e_e. This is fixed in 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )

- Joshie 🐸✨

> 
>> Signed-off-by: Joshua Ashton <joshua at froggi.es>
>>
>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>> Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: André Almeida <andrealmeid at igalia.com>
>> Cc: stable at vger.kernel.org
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> index 25209ce54552..e87cafb5b1c3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring 
>> *ring, struct amdgpu_job *job)
>>           dma_fence_set_error(fence, -ENODATA);
>>       spin_unlock_irqrestore(fence->lock, flags);
>> +    if (job->vm)
>> +        drm_sched_increase_karma(&job->base);
>>       atomic_inc(&ring->adev->gpu_reset_counter);
>>       while (!dma_fence_is_signaled(fence) &&
>>              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)