[PATCH v5] drm/xe/guc: Add devm release action to safely tear down CT

Sat Aug 16 13:52:13 UTC 2025

On Sat, Aug 16, 2025 at 08:52:23AM +0200, Michal Wajdeczko wrote:
> 

I missed Michal's points in review, I agree with his statements.

Matt

> 
> On 8/13/2025 12:12 PM, Satyanarayana K V P wrote:
> > When a buffer object (BO) is allocated with the XE_BO_FLAG_GGTT_INVALIDATE
> > flag, the driver initiates TLB invalidation requests via the CTB mechanism
> > while releasing the BO. However a premature release of the CTB BO can lead
> > to system crashes, as observed in:
> > 
> > Oops: Oops: 0000 [#1] SMP NOPTI
> > RIP: 0010:h2g_write+0x2f3/0x7c0 [xe]
> > Call Trace:
> >  guc_ct_send_locked+0x8b/0x670 [xe]
> >  xe_guc_ct_send_locked+0x19/0x60 [xe]
> >  send_tlb_invalidation+0xb4/0x460 [xe]
> >  xe_gt_tlb_invalidation_ggtt+0x15e/0x2e0 [xe]
> >  ggtt_invalidate_gt_tlb.part.0+0x16/0x90 [xe]
> >  ggtt_node_remove+0x110/0x140 [xe]
> >  xe_ggtt_node_remove+0x40/0xa0 [xe]
> >  xe_ggtt_remove_bo+0x87/0x250 [xe]
> > 
> > Introduce a devm-managed release action during xe_guc_ct_init() to ensure
> > proper CTB disablement before resource deallocation, preventing the
> > use-after-free scenario.
> > 
> > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
> > Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> > Cc: Matthew Auld <matthew.auld at intel.com>
> > Cc: Summers Stuart <stuart.summers at intel.com>
> > 
> > ---
> > V4 -> V5:
> > - Updated commit message.
> > - Fixed similar crash while releasing resources on DGPU after reallocating
> > CTB to VRAM from SMEM.
> > 
> > V3 -> V4:
> > - Added new devm release function action_disable_ct()
> > 
> > V2 -> V3:
> > - Moved GGTT validity check to xe_ggtt_invalidate(). (Michal Wajdeczko)
> > 
> > V1 -> V2:
> > - An additional check for validity of GGTT is added in
> > xe_gt_tlb_invalidation_ggtt() instead of removing XE_BO_FLAG_GGTT_INVALIDATE
> > from memirq_alloc_pages(), __xe_sa_bo_manager_init() and xe_guc_ct_init().
> > (Summers Stuart)
> > ---
> >  drivers/gpu/drm/xe/xe_guc.c    |  5 ++++-
> >  drivers/gpu/drm/xe/xe_guc_ct.c | 14 +++++++++++++-
> >  drivers/gpu/drm/xe/xe_guc_ct.h |  2 ++
> >  3 files changed, 19 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> > index 433abc787f7b..a6e46272c0ef 100644
> > --- a/drivers/gpu/drm/xe/xe_guc.c
> > +++ b/drivers/gpu/drm/xe/xe_guc.c
> > @@ -705,7 +705,10 @@ static int xe_guc_realloc_post_hwconfig(struct xe_guc *guc)
> >  	if (ret)
> >  		return ret;
> >  
> > -	return 0;
> > +	ret = devm_add_action_or_reset(xe->drm.dev, xe_guc_action_disable_ct,
> > +				       &guc->ct);
> 
> hmm, while adding CT disable (aka fini) action from the CT init() function
> makes sense, doing that here (due to reinit of the invisible here CT.bo)
> looks suspicious and clearly violates the layering ... if this is the only way,
> can you move CT related code to xe_guc_ct_init_post_hwconfig() or something?
> 
> > +
> > +	return ret;
> >  }
> >  
> >  static int vf_guc_init_noalloc(struct xe_guc *guc)
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index 3f4e6a46ff16..fe2410091bfc 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -39,6 +39,8 @@ static void receive_g2h(struct xe_guc_ct *ct);
> >  static void g2h_worker_func(struct work_struct *w);
> >  static void safe_mode_worker_func(struct work_struct *w);
> >  static void ct_exit_safe_mode(struct xe_guc_ct *ct);
> > +static void guc_ct_change_state(struct xe_guc_ct *ct,
> > +				enum xe_guc_ct_state state);
> >  
> >  #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> >  enum {
> > @@ -252,12 +254,20 @@ int xe_guc_ct_init_noalloc(struct xe_guc_ct *ct)
> >  }
> >  ALLOW_ERROR_INJECTION(xe_guc_ct_init_noalloc, ERRNO); /* See xe_pci_probe() */
> >  
> > +void xe_guc_action_disable_ct(void *arg)
> 
> this should be a static function since it shall be used only
> by the init()
> 
> > +{
> > +	struct xe_guc_ct *ct = arg;
> > +
> > +	guc_ct_change_state(ct, XE_GUC_CT_STATE_DISABLED);
> > +}
> > +
> >  int xe_guc_ct_init(struct xe_guc_ct *ct)
> >  {
> >  	struct xe_device *xe = ct_to_xe(ct);
> >  	struct xe_gt *gt = ct_to_gt(ct);
> >  	struct xe_tile *tile = gt_to_tile(gt);
> >  	struct xe_bo *bo;
> > +	int err = 0;
> >  
> >  	bo = xe_managed_bo_create_pin_map(xe, tile, guc_ct_size(),
> >  					  XE_BO_FLAG_SYSTEM |
> > @@ -268,7 +278,9 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
> >  		return PTR_ERR(bo);
> >  
> >  	ct->bo = bo;
> > -	return 0;
> > +
> > +	err = devm_add_action_or_reset(xe->drm.dev, xe_guc_action_disable_ct, ct);
> > +	return err;
> 
> since this should be always the last step of the init() there is no
> need to split this into assign err variable and then return err
> 
> >  }
> >  ALLOW_ERROR_INJECTION(xe_guc_ct_init, ERRNO); /* See xe_pci_probe() */
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> > index 18d4225e6502..b9f06ec0db52 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> > @@ -73,4 +73,6 @@ xe_guc_ct_send_block_no_fail(struct xe_guc_ct *ct, const u32 *action, u32 len)
> >  
> >  long xe_guc_ct_queue_proc_time_jiffies(struct xe_guc_ct *ct);
> >  
> > +void xe_guc_action_disable_ct(void *arg);
> 
> this shouldn't be a public functions as it takes random argument
> 
> > +
> >  #endif
>