[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Joshua Ashton joshua at froggi.es
Mon Jan 15 11:54:52 UTC 2024



On 1/15/24 09:47, Christian König wrote:
> Am 13.01.24 um 23:55 schrieb Joshua Ashton:
>> +Marek
>>
>> On 1/13/24 21:35, André Almeida wrote:
>>> Hi Joshua,
>>>
>>> Em 13/01/2024 11:02, Joshua Ashton escreveu:
>>>> We need to bump the karma of the drm_sched job in order for the context
>>>> that we just recovered to get correct feedback that it is guilty of
>>>> hanging.
>>>>
>>>> Without this feedback, the application may keep pushing through the 
>>>> soft
>>>> recoveries, continually hanging the system with jobs that timeout.
>>>>
>>>> There is an accompanying Mesa/RADV patch here
>>>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
>>>> to properly handle device loss state when VRAM is not lost.
>>>>
>>>> With these, I was able to run Counter-Strike 2 and launch an 
>>>> application
>>>> which can fault the GPU in a variety of ways, and still have Steam +
>>>> Counter-Strike 2 + Gamescope (compositor) stay up and continue
>>>> functioning on Steam Deck.
>>>>
>>>
>>> I sent a similar patch in the past, maybe you find the discussion 
>>> interesting:
>>>
>>> https://lore.kernel.org/lkml/20230424014324.218531-1-andrealmeid@igalia.com/
>>
>> Thanks, I had a peruse through that old thread.
>>
>> Marek definitely had the right idea here, given he mentions:
>> "That supposedly depends on the compositor. There may be compositors for
>> very specific cases (e.g. Steam Deck)"
>>
>> Given that is what I work on and also wrote this patch for that does 
>> basically the same thing as was proposed. :-)
>>
>> For context though, I am less interested in Gamescope (the Steam Deck 
>> compositor) hanging (we don't have code that hangs, if we go down, 
>> it's likely Steam/CEF died with us anyway atm, can solve that battle 
>> some other day) and more about the applications run under it.
>>
>> Marek is very right when he says applications that fault/hang will 
>> submit one IB after another that also fault/hang -- especially if they 
>> write to descriptors from the GPU (descriptor buffers), or use draw 
>> indirect or anything bindless or...
>> That's basically functionally equivalent to DOSing a system if it is 
>> not prevented.
>>
>> And that's exactly what I see even in a simple test app doing a fault 
>> -> hang every frame.
>>
>> Right now, given that soft recovery never marks a context as guilty, 
>> it means that every app I tested is never stopped from submitting 
>> garbage That holds true for basically any app that GPU hangs and makes 
>> soft recovery totally useless in my testing without this.
> 
> Yeah, the problem is that your patch wouldn't help with that. A testing 
> app can still re-create the context for each submission and so crash the 
> system over and over again.

It is still definitely possible for an application to do re-create its 
context and hang yet again -- however that is not the problem I am 
trying to solve here.

The problem I am trying to solve is that applications do not even get 
marked guilty when triggering soft recovery right now.

The patch does help with that on SteamOS, as the type of applications we 
deal with that hang, just abort on VK_ERROR_DEVICE_LOST.

If a UI toolkit that handles DEVICE_LOST keeps doing this, then yes it 
would not help, but this patch is also a necessary step towards fixing 
that someday. (Eg. some policy where processes are killed for hanging 
too many times, etc)

> 
> The question here is really if we should handled soft recovered errors 
> as fatal or not. Marek is in pro of that Michel is against it.
> 
> Figure out what you want in userspace and I'm happy to implement it :)
> 
>>
>> (That being said, without my patches, RADV treats *any* reset from the 
>> query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it 
>> was not guilty, so any faulting/hanging application causes every 
>> Vulkan app to die e_e. This is fixed in 
>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )
> 
> That is actually intended behavior. When something disrupted the GPU 
> execution and the application is affected it is mandatory to forward 
> this error to the application.

No. If said context was entirely unaffected, then it should be 
completely transparent to the application.

Consider the following:

  - I have Counter-Strike 2 running
  - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the specific 
VMID for HangApp's submissions and signals the submission fence.

In this instance, the Gamescope and Counter-Strike 2 ctxs are completely 
unaffected and don't need to report VK_ERROR_DEVICE_LOST as there was no 
impact to their work.

Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem with 
HangApp, FWIU the way that the clear-out works being vmid specific means 
that they would be unaffected, right?

- Joshie 🐸✨

> 
> Regards,
> Christian.
> 
>>
>> - Joshie 🐸✨
>>
>>>
>>>> Signed-off-by: Joshua Ashton <joshua at froggi.es>
>>>>
>>>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>>>> Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
>>>> Cc: Christian König <christian.koenig at amd.com>
>>>> Cc: André Almeida <andrealmeid at igalia.com>
>>>> Cc: stable at vger.kernel.org
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
>>>>   1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> index 25209ce54552..e87cafb5b1c3 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>> @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
>>>> amdgpu_ring *ring, struct amdgpu_job *job)
>>>>           dma_fence_set_error(fence, -ENODATA);
>>>>       spin_unlock_irqrestore(fence->lock, flags);
>>>> +    if (job->vm)
>>>> +        drm_sched_increase_karma(&job->base);
>>>>       atomic_inc(&ring->adev->gpu_reset_counter);
>>>>       while (!dma_fence_is_signaled(fence) &&
>>>>              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>
> 



More information about the amd-gfx mailing list