[Intel-gfx] [PATCH v3 2/3] drm/i915/guc: Close deregister-context race against CT-loss

Fri Sep 22 18:02:12 UTC 2023

(cc Anshuman who is working directly with the taskforce debugging this)

Thanks again for taking the time to review this patch.
Apologies for the tardiness, rest assured debug is still ongoing.

As mentioned in prior comments, the signatures and frequency are
now different compared to without the patches of this series.
We are still hunting for data as we are suspecting a different wakeref
still being held with the same trigger event despite.

That said, we will continue to rebase / update this series but hold off on actual
merge until we can be sure we have all the issues resolved.

On Thu, 2023-09-14 at 11:34 -0400, Vivi, Rodrigo wrote:
> On Sat, Sep 09, 2023 at 08:58:45PM -0700, Alan Previn wrote:
> > If we are at the end of suspend or very early in resume
> > its possible an async fence signal could lead us to the
alan:snip

> > @@ -3188,19 +3202,33 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >  	/* Seal race with Reset */
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  	disabled = submission_disabled(guc);
> > 
alan:snip
> > +	/* Change context state to destroyed and get gt-pm */
> > +	__intel_gt_pm_get(gt);
> > +	set_context_destroyed(ce);
> > +	clr_context_registered(ce);
> > +
> > +	ret = deregister_context(ce, ce->guc_id.id);
> > +	if (ret) {
> > +		/* Undo the state change and put gt-pm if that failed */
> > +		set_context_registered(ce);
> > +		clr_context_destroyed(ce);
> > +		intel_gt_pm_put(gt);
> 
> This is a might_sleep inside a spin_lock! Are you 100% confident no WARN
> was seeing during the tests indicated in the commit msg?
alan: Good catch - i dont think we saw a WARN - I'll go back and check
with the task force - i shall rework this function to get that outside the lock.

> 
> > +	}
> > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +
> > +	return 0;
> 
> If you are always returning 0, there's no pointing in s/void/int...
Alan: agreed - will change to void.
> > 
> > 

alan:snip
> > @@ -3279,6 +3322,17 @@ static void destroyed_worker_func(struct work_struct *w)
> >  	struct intel_gt *gt = guc_to_gt(guc);
> >  	int tmp;
> >  
> > +	/*
> > +	 * In rare cases we can get here via async context-free fence-signals that
> > +	 * come very late in suspend flow or very early in resume flows. In these
> > +	 * cases, GuC won't be ready but just skipping it here is fine as these
> > +	 * pending-destroy-contexts get destroyed totally at GuC reset time at the
> > +	 * end of suspend.. OR.. this worker can be picked up later on the next
> > +	 * context destruction trigger after resume-completes
> 
> who is triggering the work queue again?

alan: short answer: we dont know - and still hunting this
(getting closer now.. using task tgid str-name lookups).
in the few times I've seen it, the callstack I've seen looked like this:

[33763.582036] Call Trace:
[33763.582038]  <TASK>
[33763.582040]  dump_stack_lvl+0x69/0x97
[33763.582054]  guc_context_destroy+0x1b5/0x1ec
[33763.582067]  free_engines+0x52/0x70
[33763.582072]  rcu_do_batch+0x161/0x438
[33763.582084]  rcu_nocb_cb_kthread+0xda/0x2d0
[33763.582093]  kthread+0x13a/0x152
[33763.582102]  ? rcu_nocb_gp_kthread+0x6a7/0x6a7
[33763.582107]  ? css_get+0x38/0x38
[33763.582118]  ret_from_fork+0x1f/0x30
[33763.582128]  </TASK>

I did add additional debug-msg for tracking and I recall seeing this sequence via independant callstacks in the big picture:
	i915_sw_fence_complete > __i915_sw_fence_complete -> __i915_sw_fence_notify(fence, FENCE_FREE) -> <..delayed?..>
	[ drm fence sync func ] <...> engines_notify > call_rcu(&engines>rcu, free_engines_rcu) <..delayed?..>
	free_engines -> intel_context_put -> ... [kref-dec] --> guc_context_destroy

Unfortunately, we still don't know why this initial "i915_sw_fence_complete" is coming during suspend-late.
NOTE1: in the cover letter or prior comment, I hope i mentioned the reproduction steps where
it only occurs when having a workload that does network download that begins downloading just
before suspend is started but completes before suspend late. We are getting close to finding
this - taking time because of the reproduction steps.

Anshuman can chime in if he is seeing new signatures with different callstack /
events after this patch is applied.

> I mean, I'm wondering if it is necessary to re-schedule it from inside...
> but also wonder if this is safe anyway...

alan: so my thought process, also after consulting with John and Daniele, was:
since we have a link list to collect the list of contexts that need to be dereigstered,
and since we have up to 64k (32k excluding multi-lrc) guc-ids, there really is
risk is just keeping it inside the link list until we reach one of the following:

1. full suspend happens and we actually reset the guc - in which case we will remove
   all contexts in this link list and completely destroy them and release all references
   immediately without calling any h2g. (that cleanup will occurs nearly the end of
   gem's suspend late, after which this worker will flush and if it had pending contexts
   would have bailed with !intel_guc_is_ready.

2. suspend is aborted so we come back into the resume steps and we eventually, at some point
   get another request completion that signals a context freeing that we end up retriggering
   this worker in which we'll find two contexts ready to be deregistered.

> 
> > +	 */
> > +	if (!intel_guc_is_ready(guc))
> > +		return;
> > +
> >  	with_intel_gt_pm(gt, tmp)
> >  		deregister_destroyed_contexts(guc);
> >  }
> > -- 
> > 2.39.0
> >