[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 18:30:21 UTC 2024

On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock <friedrich.vock at gmx.de>
wrote:

> Re-sending as plaintext, sorry about that
>
> On 15.01.24 18:54, Michel Dänzer wrote:
> > On 2024-01-15 18:26, Friedrich Vock wrote:
> >> [snip]
> >> The fundamental problem here is that not telling applications that
> >> something went wrong when you just canceled their work midway is an
> >> out-of-spec hack.
> >> When there is a report of real-world apps breaking because of that hack,
> >> reports of different apps working (even if it's convenient that they
> >> work) doesn't justify keeping the broken code.
> > If the breaking apps hit multiple soft resets in a row, I've laid out a
> pragmatic solution which covers both cases.
> Hitting soft reset every time is the lucky path. Once GPU work is
> interrupted out of nowhere, all bets are off and it might as well
> trigger a full system hang next time. No hang recovery should be able to
> cause that under any circumstance.
>

I think the more insidious situation is no further hangs but wrong results
because we skipped some work. That we skipped work may e.g. result in some
texture not being uploaded or some GPGPU work not being done and causing
further errors downstream (say if a game is doing AI/physics on the GPU not
to say anything of actual GPGPU work one might be doing like AI)

> >
> >
> >> If mutter needs to be robust against faults it caused itself, it should
> be robust
> >> against GPU resets.
> > It's unlikely that the hangs I've seen were caused by mutter itself,
> more likely Mesa or amdgpu.
> >
> > Anyway, this will happen at some point, the reality is it hasn't yet
> though.
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240115/e14d3c6b/attachment.htm>