[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 19:13:11 UTC 2024

On 1/15/24 18:53, Christian König wrote:
> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock <friedrich.vock at gmx.de 
>>> <mailto:friedrich.vock at gmx.de>> wrote:
>>>
>>>     Re-sending as plaintext, sorry about that
>>>
>>>     On 15.01.24 18:54, Michel Dänzer wrote:
>>>      > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>      >> [snip]
>>>      >> The fundamental problem here is that not telling applications 
>>> that
>>>      >> something went wrong when you just canceled their work midway 
>>> is an
>>>      >> out-of-spec hack.
>>>      >> When there is a report of real-world apps breaking because of
>>>     that hack,
>>>      >> reports of different apps working (even if it's convenient 
>>> that they
>>>      >> work) doesn't justify keeping the broken code.
>>>      > If the breaking apps hit multiple soft resets in a row, I've laid
>>>     out a pragmatic solution which covers both cases.
>>>     Hitting soft reset every time is the lucky path. Once GPU work is
>>>     interrupted out of nowhere, all bets are off and it might as well
>>>     trigger a full system hang next time. No hang recovery should be 
>>> able to
>>>     cause that under any circumstance.
>>>
>>>
>>> I think the more insidious situation is no further hangs but wrong 
>>> results because we skipped some work. That we skipped work may e.g. 
>>> result in some texture not being uploaded or some GPGPU work not 
>>> being done and causing further errors downstream (say if a game is 
>>> doing AI/physics on the GPU not to say anything of actual GPGPU work 
>>> one might be doing like AI)
>>
>> Even worse if this is compute on eg. OpenCL for something 
>> science/math/whatever related, or training a model.
>>
>> You could randomly just get invalid/wrong results without even knowing!
> 
> Well on the kernel side we do provide an API to query the result of a 
> submission. That includes canceling submissions with a soft recovery.
> 
> What we just doesn't do is to prevent further submissions from this 
> application. E.g. enforcing that the application is punished for bad 
> behavior.

You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking the 
submission fence error somehow? That doesn't seem very ergonomic for a 
Vulkan driver compared to the simple solution which is to just mark it 
as guilty with what already exists...

- Joshie 🐸✨

> 
>>
>> Now imagine this is VulkanSC displaying something in the car 
>> dashboard, or some medical device doing some compute work to show 
>> something on a graph...
>>
>> I am not saying you should be doing any of that with RADV + AMDGPU, 
>> but it's just food for thought... :-)
>>
>> As I have been saying, you simply cannot just violate API contracts 
>> like this, it's flatout wrong.
> 
> Yeah, completely agree to that.
> 
> Regards,
> Christian.
> 
>>
>> - Joshie 🐸✨
>>
>>>
>>>      >
>>>      >
>>>      >> If mutter needs to be robust against faults it caused itself, it
>>>     should be robust
>>>      >> against GPU resets.
>>>      > It's unlikely that the hangs I've seen were caused by mutter
>>>     itself, more likely Mesa or amdgpu.
>>>      >
>>>      > Anyway, this will happen at some point, the reality is it hasn't
>>>     yet though.
>>>      >
>>>      >
>>>
> 

- Joshie 🐸✨