[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path
Joshua Ashton
joshua at froggi.es
Mon Jan 15 14:00:46 UTC 2024
On 1/15/24 13:19, Christian König wrote:
> Am 15.01.24 um 12:54 schrieb Joshua Ashton:
>> [SNIP]
>>>
>>> The question here is really if we should handled soft recovered
>>> errors as fatal or not. Marek is in pro of that Michel is against it.
>>>
>>> Figure out what you want in userspace and I'm happy to implement it :)
>>>
>>>>
>>>> (That being said, without my patches, RADV treats *any* reset from
>>>> the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost
>>>> and it was not guilty, so any faulting/hanging application causes
>>>> every Vulkan app to die e_e. This is fixed in
>>>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )
>>>
>>> That is actually intended behavior. When something disrupted the GPU
>>> execution and the application is affected it is mandatory to forward
>>> this error to the application.
>>
>> No. If said context was entirely unaffected, then it should be
>> completely transparent to the application.
>>
>> Consider the following:
>>
>> - I have Counter-Strike 2 running
>> - I have Gamescope running
>>
>> I then go ahead and start HangApp that hangs the GPU.
>>
>> Soft recovery happens and that clears out all the work for the
>> specific VMID for HangApp's submissions and signals the submission fence.
>>
>> In this instance, the Gamescope and Counter-Strike 2 ctxs are
>> completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as
>> there was no impact to their work.
>
> Ok, that is something I totally agree on. But why would the Gamescope
> and Counter-Strike 2 app report VK_ERROR_DEVICE_LOST for a soft recovery
> in the first place?
>
> IIRC a soft recovery doesn't increment the reset counter in any way. So
> they should never be affected.
It does, and without assigning any guilt, amdgpu_ring_soft_recovery
calls atomic_inc(&ring->adev->gpu_reset_counter).
>
> Regards,
> Christian.
>
>>
>> Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem
>> with HangApp, FWIU the way that the clear-out works being vmid
>> specific means that they would be unaffected, right?
>>
>> - Joshie 🐸✨
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> - Joshie 🐸✨
>>>>
>>>>>
>>>>>> Signed-off-by: Joshua Ashton <joshua at froggi.es>
>>>>>>
>>>>>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>>>>>> Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
>>>>>> Cc: Christian König <christian.koenig at amd.com>
>>>>>> Cc: André Almeida <andrealmeid at igalia.com>
>>>>>> Cc: stable at vger.kernel.org
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
>>>>>> 1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>>> index 25209ce54552..e87cafb5b1c3 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>>> @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct
>>>>>> amdgpu_ring *ring, struct amdgpu_job *job)
>>>>>> dma_fence_set_error(fence, -ENODATA);
>>>>>> spin_unlock_irqrestore(fence->lock, flags);
>>>>>> + if (job->vm)
>>>>>> + drm_sched_increase_karma(&job->base);
>>>>>> atomic_inc(&ring->adev->gpu_reset_counter);
>>>>>> while (!dma_fence_is_signaled(fence) &&
>>>>>> ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>>>
>>>
>>
>
- Joshie 🐸✨
More information about the amd-gfx
mailing list