[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 13:19:21 UTC 2024

Am 15.01.24 um 12:54 schrieb Joshua Ashton:
> [SNIP]
>>
>> The question here is really if we should handled soft recovered 
>> errors as fatal or not. Marek is in pro of that Michel is against it.
>>
>> Figure out what you want in userspace and I'm happy to implement it :)
>>
>>>
>>> (That being said, without my patches, RADV treats *any* reset from 
>>> the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost 
>>> and it was not guilty, so any faulting/hanging application causes 
>>> every Vulkan app to die e_e. This is fixed in 
>>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )
>>
>> That is actually intended behavior. When something disrupted the GPU 
>> execution and the application is affected it is mandatory to forward 
>> this error to the application.
>
> No. If said context was entirely unaffected, then it should be 
> completely transparent to the application.
>
> Consider the following:
>
>  - I have Counter-Strike 2 running
>  - I have Gamescope running
>
> I then go ahead and start HangApp that hangs the GPU.
>
> Soft recovery happens and that clears out all the work for the 
> specific VMID for HangApp's submissions and signals the submission fence.
>
> In this instance, the Gamescope and Counter-Strike 2 ctxs are 
> completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as 
> there was no impact to their work.

Ok, that is something I totally agree on. But why would the Gamescope 
and Counter-Strike 2 app report VK_ERROR_DEVICE_LOST for a soft recovery 
in the first place?

IIRC a soft recovery doesn't increment the reset counter in any way. So 
they should never be affected.

Regards,
Christian.

>
> Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem 
> with HangApp, FWIU the way that the clear-out works being vmid 
> specific means that they would be unaffected, right?
>
> - Joshie 🐸✨
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> - Joshie 🐸✨
>>>
>>>>
>>>>> Signed-off-by: Joshua Ashton <joshua at froggi.es>
>>>>>
>>>>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>>>>> Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
>>>>> Cc: Christian König <christian.koenig at amd.com>
>>>>> Cc: André Almeida <andrealmeid at igalia.com>
>>>>> Cc: stable at vger.kernel.org
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
>>>>>   1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> index 25209ce54552..e87cafb5b1c3 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
>>>>> amdgpu_ring *ring, struct amdgpu_job *job)
>>>>>           dma_fence_set_error(fence, -ENODATA);
>>>>>       spin_unlock_irqrestore(fence->lock, flags);
>>>>> +    if (job->vm)
>>>>> +        drm_sched_increase_karma(&job->base);
>>>>> atomic_inc(&ring->adev->gpu_reset_counter);
>>>>>       while (!dma_fence_is_signaled(fence) &&
>>>>>              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>>
>>
>