[Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

Wed Oct 4 17:59:26 UTC 2023

On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
> > Thanks for taking the time to review this Tvrtko, replies inline below.
alan:snip

> > > 
> > > Main concern is that we need to be sure there are no possible
> > > ill-effects, like letting the GPU/GuC scribble on some memory we
> > > unmapped (or will unmap), having let the suspend continue after timing
> > > out, and not perhaps doing the forced wedge like wait_for_suspend() does
> > > on the existing timeout path.
> > alan: this will not happen because the held wakeref is never force-released
> > after the timeout - so what happens is the kernel would bail the suspend.
> 
> How does it know to fail the suspend when there is no error code 
> returned with this timeout? Maybe a stupid question.. my knowledge of 
> suspend-resume paths was not great even before I forgot it all.
alan:Tvrtko, you and I both sir. (apologies for the tardy response yet again busy week).
So i did trace back the gt->wakeref before i posted these patches and discovered that
runtime get/put call, i believe that the first 'get' leads to __intel_wakeref_get_first
which calls intel_runtime_pm_get via rpm_get helper and eventually executes a
pm_runtime_get_sync(rpm->kdev); (hanging off i915). (ofc, there is a corresponding
for '_put_last') - so non-first, non-last increases the counter for the gt...
but this last miss will mean kernel knows i915 hasnt 'put' everything.

alan:snip
> > 
> > Recap: so in both cases (original vs this patch), if we had a buggy gt-wakeref leak,
> > we dont get invalid guc-accesses, but without this patch, we wait forever,
> > and with this patch, we get some messages and eventually bail the suspend.
> 
> It is not possible to wait for lost G2H in something like 
> intel_uc_suspend() and simply declare "bad things happened" if it times 
> out there, and forcibly clean it all up? (Which would include releasing 
> all the abandoned pm refs, so this patch wouldn't be needed.)
> 
alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref. 

As we already know, what we do know from a uc-perspective:
-  ensure the outstanding guc related workers is flushed which we didnt before
(addressed by patch #1).
- any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
and not realizing it failed (addressed by patch #2).
- (we already), "forcibly clean it all up" at the end of the intel_uc_suspend
when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code)
- we previously didnt have a coherrent guarantee that "this is the end" i.e. no
more new request after intel_uc_suspend. I mean by code logic, we thought we did
(thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise.
So we that fix by adding the additional rcu_barrier (also part of patch #2).

That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
a future racy gt-wakeref late-leak somewhere - no matter which subsystem
took it (guc is not the only subsystem taking gt wakerefs), we at
least don't hang forever in this code. Ofc, based on that, even without
patch-3 i am confident the issue is resolved anyway.
So we could just drop patch-3 is you prefer?

...alan