[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path
Joshua Ashton
joshua at froggi.es
Mon Jan 15 18:35:19 UTC 2024
On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>
>
> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock <friedrich.vock at gmx.de
> <mailto:friedrich.vock at gmx.de>> wrote:
>
> Re-sending as plaintext, sorry about that
>
> On 15.01.24 18:54, Michel Dänzer wrote:
> > On 2024-01-15 18:26, Friedrich Vock wrote:
> >> [snip]
> >> The fundamental problem here is that not telling applications that
> >> something went wrong when you just canceled their work midway is an
> >> out-of-spec hack.
> >> When there is a report of real-world apps breaking because of
> that hack,
> >> reports of different apps working (even if it's convenient that they
> >> work) doesn't justify keeping the broken code.
> > If the breaking apps hit multiple soft resets in a row, I've laid
> out a pragmatic solution which covers both cases.
> Hitting soft reset every time is the lucky path. Once GPU work is
> interrupted out of nowhere, all bets are off and it might as well
> trigger a full system hang next time. No hang recovery should be able to
> cause that under any circumstance.
>
>
> I think the more insidious situation is no further hangs but wrong
> results because we skipped some work. That we skipped work may e.g.
> result in some texture not being uploaded or some GPGPU work not being
> done and causing further errors downstream (say if a game is doing
> AI/physics on the GPU not to say anything of actual GPGPU work one might
> be doing like AI)
Even worse if this is compute on eg. OpenCL for something
science/math/whatever related, or training a model.
You could randomly just get invalid/wrong results without even knowing!
Now imagine this is VulkanSC displaying something in the car dashboard,
or some medical device doing some compute work to show something on a
graph...
I am not saying you should be doing any of that with RADV + AMDGPU, but
it's just food for thought... :-)
As I have been saying, you simply cannot just violate API contracts like
this, it's flatout wrong.
- Joshie 🐸✨
>
> >
> >
> >> If mutter needs to be robust against faults it caused itself, it
> should be robust
> >> against GPU resets.
> > It's unlikely that the hangs I've seen were caused by mutter
> itself, more likely Mesa or amdgpu.
> >
> > Anyway, this will happen at some point, the reality is it hasn't
> yet though.
> >
> >
>
More information about the amd-gfx
mailing list