[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 18:35:19 UTC 2024

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
> 
> 
> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock <friedrich.vock at gmx.de 
> <mailto:friedrich.vock at gmx.de>> wrote:
> 
>     Re-sending as plaintext, sorry about that
> 
>     On 15.01.24 18:54, Michel Dänzer wrote:
>      > On 2024-01-15 18:26, Friedrich Vock wrote:
>      >> [snip]
>      >> The fundamental problem here is that not telling applications that
>      >> something went wrong when you just canceled their work midway is an
>      >> out-of-spec hack.
>      >> When there is a report of real-world apps breaking because of
>     that hack,
>      >> reports of different apps working (even if it's convenient that they
>      >> work) doesn't justify keeping the broken code.
>      > If the breaking apps hit multiple soft resets in a row, I've laid
>     out a pragmatic solution which covers both cases.
>     Hitting soft reset every time is the lucky path. Once GPU work is
>     interrupted out of nowhere, all bets are off and it might as well
>     trigger a full system hang next time. No hang recovery should be able to
>     cause that under any circumstance.
> 
> 
> I think the more insidious situation is no further hangs but wrong 
> results because we skipped some work. That we skipped work may e.g. 
> result in some texture not being uploaded or some GPGPU work not being 
> done and causing further errors downstream (say if a game is doing 
> AI/physics on the GPU not to say anything of actual GPGPU work one might 
> be doing like AI)

Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even knowing!

Now imagine this is VulkanSC displaying something in the car dashboard, 
or some medical device doing some compute work to show something on a 
graph...

I am not saying you should be doing any of that with RADV + AMDGPU, but 
it's just food for thought... :-)

As I have been saying, you simply cannot just violate API contracts like 
this, it's flatout wrong.

- Joshie 🐸✨

> 
>      >
>      >
>      >> If mutter needs to be robust against faults it caused itself, it
>     should be robust
>      >> against GPU resets.
>      > It's unlikely that the hangs I've seen were caused by mutter
>     itself, more likely Mesa or amdgpu.
>      >
>      > Anyway, this will happen at some point, the reality is it hasn't
>     yet though.
>      >
>      >
>