[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path
Marek Olšák
maraeo at gmail.com
Mon Jan 15 23:37:02 UTC 2024
On Mon, Jan 15, 2024 at 11:41 AM Michel Dänzer <michel at daenzer.net> wrote:
>
> On 2024-01-15 17:19, Friedrich Vock wrote:
> > On 15.01.24 16:43, Joshua Ashton wrote:
> >> On 1/15/24 15:25, Michel Dänzer wrote:
> >>> On 2024-01-15 14:17, Christian König wrote:
> >>>> Am 15.01.24 um 12:37 schrieb Joshua Ashton:
> >>>>> On 1/15/24 09:40, Christian König wrote:
> >>>>>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
> >>>>>>
> >>>>>>> Without this feedback, the application may keep pushing through
> >>>>>>> the soft
> >>>>>>> recoveries, continually hanging the system with jobs that timeout.
> >>>>>>
> >>>>>> Well, that is intentional behavior. Marek is voting for making
> >>>>>> soft recovered errors fatal as well while Michel is voting for
> >>>>>> better ignoring them.
> >>>>>>
> >>>>>> I'm not really sure what to do. If you guys think that soft
> >>>>>> recovered hangs should be fatal as well then we can certainly do
> >>>>>> this.
> >>>
> >>> A possible compromise might be making soft resets fatal if they
> >>> happen repeatedly (within a certain period of time?).
> >>
> >> No, no and no. Aside from introducing issues by side effects not
> >> surfacing and all of the stuff I mentioned about descriptor buffers,
> >> bda, draw indirect and stuff just resulting in more faults and hangs...
> >>
> >> You are proposing we throw out every promise we made to an application
> >> on the API contract level because it "might work". That's just wrong!
> >>
> >> Let me put this in explicit terms: What you are proposing is in direct
> >> violation of the GL and Vulkan specification.
> >>
> >> You can't just chose to break these contracts because you think it
> >> 'might' be a better user experience.
> >
> > Is the original issue that motivated soft resets to be non-fatal even an
> > issue anymore?
> >
> > If I read that old thread correctly, the rationale for that was that
> > assigning guilt to a context was more broken than not doing it, because
> > the compositor/Xwayland process would also crash despite being unrelated
> > to the hang.
> > With Joshua's Mesa fixes, this is not the case anymore, so I don't think
> > keeping soft resets non-fatal provides any benefit to the user experience.
> > The potential detriments to user experience have been outlined multiple
> > times in this thread already.
> >
> > (I suppose if the compositor itself faults it might still bring down a
> > session, but I've literally never seen that, and it's not like a
> > compositor triggering segfaults on CPU stays alive either.)
>
> That's indeed what happened for me, multiple times. And each time the session continued running fine for days after the soft reset.
>
> But apparently my experience isn't valid somehow, and I should have been forced to log in again to please the GL gods...
>
>
> Conversely, I can't remember hitting a case where an app kept running into soft resets. It's almost as if different people may have different experiences! ;)
>
> Note that I'm not saying that case can't happen. Making soft resets fatal only if they happen repeatedly could address both issues, rather than only one or the other. Seems like a win-win.
This is exactly the comment that shouldn't have been sent, and you are
not the only one.
Nobody should ever care about subjective experiences. We can only do
this properly by looking at the whole system and its rules and try to
find a solution that works for everything on paper first. DrawIndirect
is one case where the current system fails. "Works for me because I
don't use DrawIndirect" is a horrible way to do this.
Marek
More information about the amd-gfx
mailing list