[PATCH 2/2] drm/xe/guc: Explicitly exit CT safe mode on unwind
Matthew Brost
matthew.brost at intel.com
Tue Jun 24 15:32:29 UTC 2025
On Mon, Jun 23, 2025 at 05:29:21PM -0400, Rodrigo Vivi wrote:
> On Fri, Jun 13, 2025 at 12:09:37AM +0200, Michal Wajdeczko wrote:
> > During driver probe we might be briefly using CT safe mode, which
> > is based on a delayed work, but usually we are able to stop this
> > once we have IRQ fully operational. However, if we abort the probe
> > quite early then during unwind we might try to destroy the workqueue
> > while there is still a pending delayed work that attempts to restart
> > itself which triggers a WARN.
> >
> > This was recently observed during unsuccessful VF initialization:
> >
> > [ ] xe 0000:00:02.1: probe with driver xe failed with error -62
> > [ ] ------------[ cut here ]------------
> > [ ] workqueue: cannot queue safe_mode_worker_func [xe] on wq xe-g2h-wq
> > [ ] WARNING: CPU: 9 PID: 0 at kernel/workqueue.c:2257 __queue_work+0x287/0x710
> > [ ] RIP: 0010:__queue_work+0x287/0x710
> > [ ] Call Trace:
> > [ ] delayed_work_timer_fn+0x19/0x30
> > [ ] call_timer_fn+0xa1/0x2a0
> >
> > Exit the CT safe mode on unwind to avoid that warning.
> >
> > Fixes: 09b286950f29 ("drm/xe/guc: Allow CTB G2H processing without G2H IRQ")
> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_guc_ct.c | 10 ++++++----
> > 1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index 822f4c33f730..6e353757e204 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -35,6 +35,11 @@
> > #include "xe_pm.h"
> > #include "xe_trace_guc.h"
> >
> > +static void receive_g2h(struct xe_guc_ct *ct);
> > +static void g2h_worker_func(struct work_struct *w);
> > +static void safe_mode_worker_func(struct work_struct *w);
> > +static void ct_exit_safe_mode(struct xe_guc_ct *ct);
> > +
> > #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> > enum {
> > /* Internal states, not error conditions */
> > @@ -189,14 +194,11 @@ static void guc_ct_fini(struct drm_device *drm, void *arg)
> > {
> > struct xe_guc_ct *ct = arg;
> >
> > + ct_exit_safe_mode(ct);
>
> I was going to ask if we also don't need to disable more of the ct
> at this point, but I believe it is indeed not relevant at this point
> of the code.
>
> But it is any possibility of double call of this function in some of
> the regular removal paths?
>
ct_exit_safe_mode is safe to be called many times - it just cancels a
delayed worker.
The code looks correct to me.
Reviewed-by: Matthew Brost <matthew.brost at intel.com>
> > destroy_workqueue(ct->g2h_wq);
> > xa_destroy(&ct->fence_lookup);
> > }
> >
> > -static void receive_g2h(struct xe_guc_ct *ct);
> > -static void g2h_worker_func(struct work_struct *w);
> > -static void safe_mode_worker_func(struct work_struct *w);
> > -
> > static void primelockdep(struct xe_guc_ct *ct)
> > {
> > if (!IS_ENABLED(CONFIG_LOCKDEP))
> > --
> > 2.47.1
> >
More information about the Intel-xe
mailing list