[Intel-gfx] [PATCH v1 1/3] drm/i915/guc: Flush context destruction worker at suspend
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Wed Aug 9 21:09:01 UTC 2023
Thanks Rodrigo for reviewing this.
On Mon, 2023-08-07 at 13:52 -0400, Vivi, Rodrigo wrote:
> On Wed, Aug 02, 2023 at 04:34:59PM -0700, Alan Previn wrote:
> > Suspend is not like reset, it can unroll, so we have to properly
> > flush pending context-guc-id deregistrations to complete before
> > we return from suspend calls.
>
> But if is 'unrolls' the execution should just continue, no?!
> In other words, why is this flush needed? What happens if we
> don't flush, but resume doesn't proceed? in in which case
> of resume you are thinking that it returns and not having flushed?
alan: I apologize, i realize I should have explained it better.
The flush is needed for when we DON'T unroll. I wanted to point
out that at in intel_gt_suspend_foo we dont actually know
if the suspend is going to unroll or not so we should always flush
properly in the case. I will re-rev with better comment on this patch.
alan:snip
>
> >
> > static void wait_for_suspend(struct intel_gt *gt)
> > {
> > - if (!intel_gt_pm_is_awake(gt))
> > + if (!intel_gt_pm_is_awake(gt)) {
> > + intel_uc_suspend_prepare(>->uc);
>
> why only on idle?
alan: actually no - i am flushing for both idle and non-idle cases (see farther
below) but for the non-idle case, i am skipping the flush if we decide to wedge
(since the wedge will clear up everything -> all guc-contexts that are inflight
are nuked and freed).
>
> Well, I know, if we are in idle it is because all the requests had
> already ended and gt will be wedged, but why do we need to do anything
> if we are in idle?
Because the issue that is seen as a very late context-deregister that
is triggered by a, orthogonal thread via the following callstack:
drm-fence signal triggers a FENCE_FREE via__i915_sw_fence_notify that
connects to engines_notify -> free_engines_rcu -> (thread) ->
intel_context_put -> kref_put(&ce->ref..) that queues the
context-destruction worker. That said, eventhough the gt is idle because
the last workload has been retired a context-destruction worker
may have has just gotten queued.
>
> And why here and not some upper layer? like in prepare....
alan: wait_for_suspend does both the checking for idle as well as the potential
wedging if guc or hw has hung at this late state. if i call from the upper
layer before this wait_for_suspend, it might be too early because the
wait-for-idle could experience workloads completing and new contexts-derigtrations
being queued. If i call it from upper layer after wait_for_suspend, then it would
be unnecessary if wait_for_suspend decided to nuke everything. Hmmm.. but i guess
the latter could work too - since the nuke case would have emptied out the link-list
of that worker and so it would either run and do nothing or would not even be queued.
Would you rather i go that way? (i'll recheck with my team mates for i-dotting/t-crossing
discussion.
>
> > return;
> > + }
> >
> > if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) {
> > /*
> > @@ -299,6 +301,8 @@ static void wait_for_suspend(struct intel_gt *gt)
> > */
> > intel_gt_set_wedged(gt);
> > intel_gt_retire_requests(gt);
> > + } else {
> > + intel_uc_suspend_prepare(>->uc);
> > }
> >
> > intel_gt_pm_wait_for_idle(gt);
alan:snip
More information about the Intel-gfx
mailing list