[Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Tue Nov 14 19:48:03 UTC 2023
On Tue, 2023-11-14 at 17:52 +0000, Tvrtko Ursulin wrote:
> On 14/11/2023 17:37, Teres Alexis, Alan Previn wrote:
> > On Tue, 2023-11-14 at 17:27 +0000, Tvrtko Ursulin wrote:
> > > On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
> > > > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
> > > > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> > > > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> > > > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
>
alan:snip
> > alan: So i did trace back the gt->wakeref before i posted these patches and
> > see that within these runtime get/put calls, i believe the first 'get' leads
> > to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get
> > helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging off
> > i915_device). (naturally there is a corresponding mirros for the '_put_last').
> > So this means the first-get and last-put lets the kernel know. Thats why when
> > i tested this patch, it did actually cause the suspend to abort from kernel side
> > and the kernel would print a message indicating i915 was the one that didnt
> > release all refs.
>
> Ah that would be much better then.
>
> Do you know if everything gets resumed/restored correctly in that case
> or we would need some additional work to maybe early exit from callers
> of wait_for_suspend()?
alan: So assuming we are still discussing about a "potentially new
future leaked-wakeref bug" (i.e. putting aside the fact that
Patch #1 + #2 resolves this specific series' bug), based on the
previous testing we did, after this timeout-bail trigger,
the suspend flow bails and gt/guc operation does actually continue
as normal. However, its been a long time since we tested this so
i am not sure of how accidental-new-future bugs might play to this
assumption especially if some other subsystem that leaked the rpm
wakref but that subsystem did NOT get reset like how GuC is reset
at the end of suspend.
>
> What I would also ask is to see if something like injecting a probing
> failure is feasible, so we can have this new timeout exit path
> constantly/regularly tested in CI.
alan: Thats a good idea. In line with this, i would like to point out that
rev6 of this series has been posted but i removed this Patch #3. However i did
post this Patch #3 as a standalone patch here: https://patchwork.freedesktop.org/series/126414/
as i anticipate this patch will truly help with future issue debuggability.
That said, i shall post a review on that patch with your suggestion to add
an injected probe error for the suspend-resume flow and follow up on that one
separately.
>
> Regards,
>
> Tvrtko
>
> > alan: Anyways, i have pulled this patch out of rev6 of this series and created a
> > separate standalone patch for this patch #3 that we review independently.
> >
More information about the Intel-gfx
mailing list