[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Joshua Ashton joshua at froggi.es
Mon Jan 15 11:37:51 UTC 2024



On 1/15/24 09:40, Christian König wrote:
> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>> We need to bump the karma of the drm_sched job in order for the context
>> that we just recovered to get correct feedback that it is guilty of
>> hanging.
> 
> Big NAK to that approach, the karma handling is completely deprecated.
> 
> When you want to signal execution errors please use the fence error code.

The fence error code does not result in ctx's being marked as guilty, 
only the karma path does.

See drm_sched_increase_karma.

Are you proposing that we instead mark contexts as guilty with the fence 
error ourselves here?

> 
>> Without this feedback, the application may keep pushing through the soft
>> recoveries, continually hanging the system with jobs that timeout.
> 
> Well, that is intentional behavior. Marek is voting for making soft 
> recovered errors fatal as well while Michel is voting for better 
> ignoring them.
> 
> I'm not really sure what to do. If you guys think that soft recovered 
> hangs should be fatal as well then we can certainly do this.

They have to be!

As Marek and I have pointed out, applications that hang or fault will 
just hang or fault again, especially when they use things like draw 
indirect, buffer device address, descriptor buffers, etc.

The majority of apps these days have a lot of side effects and 
persistence between frames and submissions.

- Joshie 🐸✨

> 
> Regards,
> Christian.
> 
>>
>> There is an accompanying Mesa/RADV patch here
>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
>> to properly handle device loss state when VRAM is not lost.
>>
>> With these, I was able to run Counter-Strike 2 and launch an application
>> which can fault the GPU in a variety of ways, and still have Steam +
>> Counter-Strike 2 + Gamescope (compositor) stay up and continue
>> functioning on Steam Deck.
>>
>> Signed-off-by: Joshua Ashton <joshua at froggi.es>
>>
>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>> Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: André Almeida <andrealmeid at igalia.com>
>> Cc: stable at vger.kernel.org
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> index 25209ce54552..e87cafb5b1c3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>> @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring 
>> *ring, struct amdgpu_job *job)
>>           dma_fence_set_error(fence, -ENODATA);
>>       spin_unlock_irqrestore(fence->lock, flags);
>> +    if (job->vm)
>> +        drm_sched_increase_karma(&job->base);
>>       atomic_inc(&ring->adev->gpu_reset_counter);
>>       while (!dma_fence_is_signaled(fence) &&
>>              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> 




More information about the amd-gfx mailing list