[Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Tue Nov 14 17:27:18 UTC 2023


On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
> On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
>> On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
>>> On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
>>>> On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
> alan:snip
>>>> It is not possible to wait for lost G2H in something like
>>>> intel_uc_suspend() and simply declare "bad things happened" if it times
>>>> out there, and forcibly clean it all up? (Which would include releasing
>>>> all the abandoned pm refs, so this patch wouldn't be needed.)
>>>>
>>> alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
>>> check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref.
>>>
>>> As we already know, what we do know from a uc-perspective:
>>> -  ensure the outstanding guc related workers is flushed which we didnt before
>>> (addressed by patch #1).
>>> - any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
>>> and not realizing it failed (addressed by patch #2).
>>> - (we already), "forcibly clean it all up" at the end of the intel_uc_suspend
>>> when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code)
>>> - we previously didnt have a coherrent guarantee that "this is the end" i.e. no
>>> more new request after intel_uc_suspend. I mean by code logic, we thought we did
>>> (thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise.
>>> So we that fix by adding the additional rcu_barrier (also part of patch #2).
>>
>> It is not clear to me from the above if that includes cleaning up the
>> outstanding CT replies or no. But anyway..
> alan: Apologies, i should have made it clearer by citing the actual function
> name calling out the steps of interest: So if you look at the function
> "intel_gt_suspend_late", it calls "intel_uc_suspend" which in turn calls
> "intel_guc_suspend" which does a soft reset onto guc firmware - so after that
> we can assume all outstanding G2Hs are done. Going back up the stack,
> intel_gt_suspend_late finally calls gt_sanitize which calls "intel_uc_reset_prepare"
> which calls "intel_guc_submission_reset_prepare" which calls
> "scrub_guc_desc_for_outstanding_g2h" which does what we are looking for for all
> types of outstanding G2H. As per prior review comments, besides closing the race
> condition, we were missing an rcu_barrier (which caused new requests to come in way
> late). So we have resolved both in Patch #2.

Cool, I read this as all known bugs are fixed by first two patches.

>>> That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
>>> a future racy gt-wakeref late-leak somewhere - no matter which subsystem
>>> took it (guc is not the only subsystem taking gt wakerefs), we at
>>> least don't hang forever in this code. Ofc, based on that, even without
>>> patch-3 i am confident the issue is resolved anyway.
>>> So we could just drop patch-3 is you prefer?
>>
>> .. given this it does sound to me that if you are confident patch 3
>> isn't fixing anything today that it should be dropped.
> alan: I won't say its NOT fixing anything - i am saying it's not fixing
> this specific bug where we have the outstanding G2H from a context destruction
> operation that got dropped. I am okay with dropping this patch in the next rev
> but shall post a separate stand alone version of Patch3 - because I believe
> all other i915 subsystems that take runtime-pm's DO NOT wait forever when entering
> suspend - but GT is doing this. This means if there ever was a bug introduced,
> it would require serial port or ramoops collection to debug. So i think such a
> patch, despite not fixing this specific bug will be very helpful for debuggability
> of future issues. After all, its better to fail our suspend when we have a bug
> rather than to hang the kernel forever.

Issue I have is that I am not seeing how it fails the suspend when 
nothing is passed out from *void* wait_suspend(gt). To me it looks the 
suspend just carries on. And if so, it sounds dangerous to allow it to 
do that with any future/unknown bugs in the suspend sequence. Existing 
timeout wedges the GPU (and all that entails). New code just says "meh 
I'll just carry on regardless".

Regards,

Tvrtko



More information about the dri-devel mailing list