[PATCH 2/2] drm/xe/guc: Explicitly exit CT safe mode on unwind

Matthew Brost matthew.brost at intel.com
Tue Jun 24 15:32:29 UTC 2025


On Mon, Jun 23, 2025 at 05:29:21PM -0400, Rodrigo Vivi wrote:
> On Fri, Jun 13, 2025 at 12:09:37AM +0200, Michal Wajdeczko wrote:
> > During driver probe we might be briefly using CT safe mode, which
> > is based on a delayed work, but usually we are able to stop this
> > once we have IRQ fully operational.  However, if we abort the probe
> > quite early then during unwind we might try to destroy the workqueue
> > while there is still a pending delayed work that attempts to restart
> > itself which triggers a WARN.
> > 
> > This was recently observed during unsuccessful VF initialization:
> > 
> >  [ ] xe 0000:00:02.1: probe with driver xe failed with error -62
> >  [ ] ------------[ cut here ]------------
> >  [ ] workqueue: cannot queue safe_mode_worker_func [xe] on wq xe-g2h-wq
> >  [ ] WARNING: CPU: 9 PID: 0 at kernel/workqueue.c:2257 __queue_work+0x287/0x710
> >  [ ] RIP: 0010:__queue_work+0x287/0x710
> >  [ ] Call Trace:
> >  [ ]  delayed_work_timer_fn+0x19/0x30
> >  [ ]  call_timer_fn+0xa1/0x2a0
> > 
> > Exit the CT safe mode on unwind to avoid that warning.
> > 
> > Fixes: 09b286950f29 ("drm/xe/guc: Allow CTB G2H processing without G2H IRQ")
> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_guc_ct.c | 10 ++++++----
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index 822f4c33f730..6e353757e204 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -35,6 +35,11 @@
> >  #include "xe_pm.h"
> >  #include "xe_trace_guc.h"
> >  
> > +static void receive_g2h(struct xe_guc_ct *ct);
> > +static void g2h_worker_func(struct work_struct *w);
> > +static void safe_mode_worker_func(struct work_struct *w);
> > +static void ct_exit_safe_mode(struct xe_guc_ct *ct);
> > +
> >  #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> >  enum {
> >  	/* Internal states, not error conditions */
> > @@ -189,14 +194,11 @@ static void guc_ct_fini(struct drm_device *drm, void *arg)
> >  {
> >  	struct xe_guc_ct *ct = arg;
> >  
> > +	ct_exit_safe_mode(ct);
> 
> I was going to ask if we also don't need to disable more of the ct
> at this point, but I believe it is indeed not relevant at this point
> of the code.
> 
> But it is any possibility of double call of this function in some of
> the regular removal paths?
> 

ct_exit_safe_mode is safe to be called many times - it just cancels a
delayed worker.

The code looks correct to me.
Reviewed-by: Matthew Brost <matthew.brost at intel.com>

> >  	destroy_workqueue(ct->g2h_wq);
> >  	xa_destroy(&ct->fence_lookup);
> >  }
> >  
> > -static void receive_g2h(struct xe_guc_ct *ct);
> > -static void g2h_worker_func(struct work_struct *w);
> > -static void safe_mode_worker_func(struct work_struct *w);
> > -
> >  static void primelockdep(struct xe_guc_ct *ct)
> >  {
> >  	if (!IS_ENABLED(CONFIG_LOCKDEP))
> > -- 
> > 2.47.1
> > 


More information about the Intel-xe mailing list