[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 14:00:46 UTC 2024

On 1/15/24 13:19, Christian König wrote:
> Am 15.01.24 um 12:54 schrieb Joshua Ashton:
>> [SNIP]
>>>
>>> The question here is really if we should handled soft recovered 
>>> errors as fatal or not. Marek is in pro of that Michel is against it.
>>>
>>> Figure out what you want in userspace and I'm happy to implement it :)
>>>
>>>>
>>>> (That being said, without my patches, RADV treats *any* reset from 
>>>> the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost 
>>>> and it was not guilty, so any faulting/hanging application causes 
>>>> every Vulkan app to die e_e. This is fixed in 
>>>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )
>>>
>>> That is actually intended behavior. When something disrupted the GPU 
>>> execution and the application is affected it is mandatory to forward 
>>> this error to the application.
>>
>> No. If said context was entirely unaffected, then it should be 
>> completely transparent to the application.
>>
>> Consider the following:
>>
>>  - I have Counter-Strike 2 running
>>  - I have Gamescope running
>>
>> I then go ahead and start HangApp that hangs the GPU.
>>
>> Soft recovery happens and that clears out all the work for the 
>> specific VMID for HangApp's submissions and signals the submission fence.
>>
>> In this instance, the Gamescope and Counter-Strike 2 ctxs are 
>> completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as 
>> there was no impact to their work.
> 
> Ok, that is something I totally agree on. But why would the Gamescope 
> and Counter-Strike 2 app report VK_ERROR_DEVICE_LOST for a soft recovery 
> in the first place?
> 
> IIRC a soft recovery doesn't increment the reset counter in any way. So 
> they should never be affected.

It does, and without assigning any guilt, amdgpu_ring_soft_recovery 
calls atomic_inc(&ring->adev->gpu_reset_counter).

> 
> Regards,
> Christian.
> 
>>
>> Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem 
>> with HangApp, FWIU the way that the clear-out works being vmid 
>> specific means that they would be unaffected, right?
>>
>> - Joshie 🐸✨
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> - Joshie 🐸✨
>>>>
>>>>>
>>>>>> Signed-off-by: Joshua Ashton <joshua at froggi.es>
>>>>>>
>>>>>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>>>>>> Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
>>>>>> Cc: Christian König <christian.koenig at amd.com>
>>>>>> Cc: André Almeida <andrealmeid at igalia.com>
>>>>>> Cc: stable at vger.kernel.org
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
>>>>>>   1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>>> index 25209ce54552..e87cafb5b1c3 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>>> @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
>>>>>> amdgpu_ring *ring, struct amdgpu_job *job)
>>>>>>           dma_fence_set_error(fence, -ENODATA);
>>>>>>       spin_unlock_irqrestore(fence->lock, flags);
>>>>>> +    if (job->vm)
>>>>>> +        drm_sched_increase_karma(&job->base);
>>>>>> atomic_inc(&ring->adev->gpu_reset_counter);
>>>>>>       while (!dma_fence_is_signaled(fence) &&
>>>>>>              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>>>
>>>
>>
> 

- Joshie 🐸✨