[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 18:53:43 UTC 2024

Am 15.01.24 um 19:35 schrieb Joshua Ashton:
> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock <friedrich.vock at gmx.de 
>> <mailto:friedrich.vock at gmx.de>> wrote:
>>
>>     Re-sending as plaintext, sorry about that
>>
>>     On 15.01.24 18:54, Michel Dänzer wrote:
>>      > On 2024-01-15 18:26, Friedrich Vock wrote:
>>      >> [snip]
>>      >> The fundamental problem here is that not telling applications 
>> that
>>      >> something went wrong when you just canceled their work midway 
>> is an
>>      >> out-of-spec hack.
>>      >> When there is a report of real-world apps breaking because of
>>     that hack,
>>      >> reports of different apps working (even if it's convenient 
>> that they
>>      >> work) doesn't justify keeping the broken code.
>>      > If the breaking apps hit multiple soft resets in a row, I've laid
>>     out a pragmatic solution which covers both cases.
>>     Hitting soft reset every time is the lucky path. Once GPU work is
>>     interrupted out of nowhere, all bets are off and it might as well
>>     trigger a full system hang next time. No hang recovery should be 
>> able to
>>     cause that under any circumstance.
>>
>>
>> I think the more insidious situation is no further hangs but wrong 
>> results because we skipped some work. That we skipped work may e.g. 
>> result in some texture not being uploaded or some GPGPU work not 
>> being done and causing further errors downstream (say if a game is 
>> doing AI/physics on the GPU not to say anything of actual GPGPU work 
>> one might be doing like AI)
>
> Even worse if this is compute on eg. OpenCL for something 
> science/math/whatever related, or training a model.
>
> You could randomly just get invalid/wrong results without even knowing!

Well on the kernel side we do provide an API to query the result of a 
submission. That includes canceling submissions with a soft recovery.

What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for bad 
behavior.

>
> Now imagine this is VulkanSC displaying something in the car 
> dashboard, or some medical device doing some compute work to show 
> something on a graph...
>
> I am not saying you should be doing any of that with RADV + AMDGPU, 
> but it's just food for thought... :-)
>
> As I have been saying, you simply cannot just violate API contracts 
> like this, it's flatout wrong.

Yeah, completely agree to that.

Regards,
Christian.

>
> - Joshie 🐸✨
>
>>
>>      >
>>      >
>>      >> If mutter needs to be robust against faults it caused itself, it
>>     should be robust
>>      >> against GPU resets.
>>      > It's unlikely that the hangs I've seen were caused by mutter
>>     itself, more likely Mesa or amdgpu.
>>      >
>>      > Anyway, this will happen at some point, the reality is it hasn't
>>     yet though.
>>      >
>>      >
>>