[Intel-gfx] [PATCH v3 2/3] drm/i915/guc: Close deregister-context race against CT-loss

Tue Sep 26 17:19:06 UTC 2023

> 


> > alan:snip
> > > > @@ -3279,6 +3322,17 @@ static void destroyed_worker_func(struct
> > work_struct *w)
> > > >  	struct intel_gt *gt = guc_to_gt(guc);
> > > >  	int tmp;
> > > > 
> > > > +	/*
> > > > +	 * In rare cases we can get here via async context-free fence-signals
> > that
> > > > +	 * come very late in suspend flow or very early in resume flows. In
> > these
> > > > +	 * cases, GuC won't be ready but just skipping it here is fine as these
> > > > +	 * pending-destroy-contexts get destroyed totally at GuC reset time at
> > the
> > > > +	 * end of suspend.. OR.. this worker can be picked up later on the next
> > > > +	 * context destruction trigger after resume-completes
> > > 
> > > who is triggering the work queue again?
> > 
> > alan: short answer: we dont know - and still hunting this (getting closer now..
> > using task tgid str-name lookups).
> > in the few times I've seen it, the callstack I've seen looked like this:
> > 
> > [33763.582036] Call Trace:
> > [33763.582038]  <TASK>
> > [33763.582040]  dump_stack_lvl+0x69/0x97 [33763.582054]
> > guc_context_destroy+0x1b5/0x1ec [33763.582067]
> > free_engines+0x52/0x70 [33763.582072]  rcu_do_batch+0x161/0x438
> > [33763.582084]  rcu_nocb_cb_kthread+0xda/0x2d0 [33763.582093]
> > kthread+0x13a/0x152 [33763.582102]  ?
> > rcu_nocb_gp_kthread+0x6a7/0x6a7 [33763.582107]  ? css_get+0x38/0x38
> > [33763.582118]  ret_from_fork+0x1f/0x30 [33763.582128]  </TASK>

> Alan above trace is not due to missing GT wakeref, it is due to a intel_context_put(),
> Which  called asynchronously by rcu_call(__free_engines), we need insert rcu_barrier() to flush all
> rcu callback in late suspend.
> 
> Thanks,
> Anshuman.
> > 
Thanks Anshuman for following up with the ongoing debug. I shall re-rev accordingly.
...alan