[Intel-gfx] [PATCH v1 1/3] drm/i915/guc: Flush context destruction worker at suspend

Wed Aug 9 21:09:01 UTC 2023

Thanks Rodrigo for reviewing this.

On Mon, 2023-08-07 at 13:52 -0400, Vivi, Rodrigo wrote:
> On Wed, Aug 02, 2023 at 04:34:59PM -0700, Alan Previn wrote:
> > Suspend is not like reset, it can unroll, so we have to properly
> > flush pending context-guc-id deregistrations to complete before
> > we return from suspend calls.
> 
> But if is 'unrolls' the execution should just continue, no?!
> In other words, why is this flush needed? What happens if we
> don't flush, but resume doesn't proceed? in in which case
> of resume you are thinking that it returns and not having flushed?

alan: I apologize, i realize I should have explained it better.
The flush is needed for when we DON'T unroll. I wanted to point
out that at in intel_gt_suspend_foo we dont actually know
if the suspend is going to unroll or not so we should always flush
properly in the case. I will re-rev with better comment on this patch.

alan:snip
> 
> >  
> >  static void wait_for_suspend(struct intel_gt *gt)
> >  {
> > -	if (!intel_gt_pm_is_awake(gt))
> > +	if (!intel_gt_pm_is_awake(gt)) {
> > +		intel_uc_suspend_prepare(&gt->uc);
> 
> why only on idle?

alan: actually no - i am flushing for both idle and non-idle cases (see farther
below) but for the non-idle case, i am skipping the flush if we decide to wedge
(since the wedge will clear up everything -> all guc-contexts that are inflight
are nuked and freed).
> 
> Well, I know, if we are in idle it is because all the requests had
> already ended and gt will be wedged, but why do we need to do anything
> if we are in idle?

Because the issue that is seen as a very late context-deregister that
is triggered by a, orthogonal thread via the following callstack:
drm-fence signal triggers a FENCE_FREE via__i915_sw_fence_notify that
connects to engines_notify -> free_engines_rcu -> (thread) ->
intel_context_put -> kref_put(&ce->ref..) that queues the
context-destruction worker. That said, eventhough the gt is idle because
the last workload has been retired a context-destruction worker
may have has just gotten queued. 

> 
> And why here and not some upper layer? like in prepare....
alan: wait_for_suspend does both the checking for idle as well as the potential
wedging if guc or hw has hung at this late state. if i call from the upper
layer before this wait_for_suspend, it might be too early because the
wait-for-idle could experience workloads completing and new contexts-derigtrations
being queued. If i call it from upper layer after wait_for_suspend, then it would
be unnecessary if wait_for_suspend decided to nuke everything. Hmmm.. but i guess
the latter could work too - since the nuke case would have emptied out the link-list
of that worker and so it would either run and do nothing or would not even be queued.
Would you rather i go that way? (i'll recheck with my team mates for i-dotting/t-crossing
discussion.
> 
> >  		return;
> > +	}
> >  
> >  	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) {
> >  		/*
> > @@ -299,6 +301,8 @@ static void wait_for_suspend(struct intel_gt *gt)
> >  		 */
> >  		intel_gt_set_wedged(gt);
> >  		intel_gt_retire_requests(gt);
> > +	} else {
> > +		intel_uc_suspend_prepare(&gt->uc);
> >  	}
> >  
> >  	intel_gt_pm_wait_for_idle(gt);
alan:snip