[Intel-gfx] [PATCH v3 2/3] drm/i915/guc: Close deregister-context race against CT-loss
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Tue Sep 26 17:19:06 UTC 2023
>
> > alan:snip
> > > > @@ -3279,6 +3322,17 @@ static void destroyed_worker_func(struct
> > work_struct *w)
> > > > struct intel_gt *gt = guc_to_gt(guc);
> > > > int tmp;
> > > >
> > > > + /*
> > > > + * In rare cases we can get here via async context-free fence-signals
> > that
> > > > + * come very late in suspend flow or very early in resume flows. In
> > these
> > > > + * cases, GuC won't be ready but just skipping it here is fine as these
> > > > + * pending-destroy-contexts get destroyed totally at GuC reset time at
> > the
> > > > + * end of suspend.. OR.. this worker can be picked up later on the next
> > > > + * context destruction trigger after resume-completes
> > >
> > > who is triggering the work queue again?
> >
> > alan: short answer: we dont know - and still hunting this (getting closer now..
> > using task tgid str-name lookups).
> > in the few times I've seen it, the callstack I've seen looked like this:
> >
> > [33763.582036] Call Trace:
> > [33763.582038] <TASK>
> > [33763.582040] dump_stack_lvl+0x69/0x97 [33763.582054]
> > guc_context_destroy+0x1b5/0x1ec [33763.582067]
> > free_engines+0x52/0x70 [33763.582072] rcu_do_batch+0x161/0x438
> > [33763.582084] rcu_nocb_cb_kthread+0xda/0x2d0 [33763.582093]
> > kthread+0x13a/0x152 [33763.582102] ?
> > rcu_nocb_gp_kthread+0x6a7/0x6a7 [33763.582107] ? css_get+0x38/0x38
> > [33763.582118] ret_from_fork+0x1f/0x30 [33763.582128] </TASK>
> Alan above trace is not due to missing GT wakeref, it is due to a intel_context_put(),
> Which called asynchronously by rcu_call(__free_engines), we need insert rcu_barrier() to flush all
> rcu callback in late suspend.
>
> Thanks,
> Anshuman.
> >
Thanks Anshuman for following up with the ongoing debug. I shall re-rev accordingly.
...alan
More information about the Intel-gfx
mailing list