[PATCH v5] drm/xe/guc: Add devm release action to safely tear down CT

Sat Aug 16 06:52:23 UTC 2025

On 8/13/2025 12:12 PM, Satyanarayana K V P wrote:
> When a buffer object (BO) is allocated with the XE_BO_FLAG_GGTT_INVALIDATE
> flag, the driver initiates TLB invalidation requests via the CTB mechanism
> while releasing the BO. However a premature release of the CTB BO can lead
> to system crashes, as observed in:
> 
> Oops: Oops: 0000 [#1] SMP NOPTI
> RIP: 0010:h2g_write+0x2f3/0x7c0 [xe]
> Call Trace:
>  guc_ct_send_locked+0x8b/0x670 [xe]
>  xe_guc_ct_send_locked+0x19/0x60 [xe]
>  send_tlb_invalidation+0xb4/0x460 [xe]
>  xe_gt_tlb_invalidation_ggtt+0x15e/0x2e0 [xe]
>  ggtt_invalidate_gt_tlb.part.0+0x16/0x90 [xe]
>  ggtt_node_remove+0x110/0x140 [xe]
>  xe_ggtt_node_remove+0x40/0xa0 [xe]
>  xe_ggtt_remove_bo+0x87/0x250 [xe]
> 
> Introduce a devm-managed release action during xe_guc_ct_init() to ensure
> proper CTB disablement before resource deallocation, preventing the
> use-after-free scenario.
> 
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Matthew Auld <matthew.auld at intel.com>
> Cc: Summers Stuart <stuart.summers at intel.com>
> 
> ---
> V4 -> V5:
> - Updated commit message.
> - Fixed similar crash while releasing resources on DGPU after reallocating
> CTB to VRAM from SMEM.
> 
> V3 -> V4:
> - Added new devm release function action_disable_ct()
> 
> V2 -> V3:
> - Moved GGTT validity check to xe_ggtt_invalidate(). (Michal Wajdeczko)
> 
> V1 -> V2:
> - An additional check for validity of GGTT is added in
> xe_gt_tlb_invalidation_ggtt() instead of removing XE_BO_FLAG_GGTT_INVALIDATE
> from memirq_alloc_pages(), __xe_sa_bo_manager_init() and xe_guc_ct_init().
> (Summers Stuart)
> ---
>  drivers/gpu/drm/xe/xe_guc.c    |  5 ++++-
>  drivers/gpu/drm/xe/xe_guc_ct.c | 14 +++++++++++++-
>  drivers/gpu/drm/xe/xe_guc_ct.h |  2 ++
>  3 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index 433abc787f7b..a6e46272c0ef 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -705,7 +705,10 @@ static int xe_guc_realloc_post_hwconfig(struct xe_guc *guc)
>  	if (ret)
>  		return ret;
>  
> -	return 0;
> +	ret = devm_add_action_or_reset(xe->drm.dev, xe_guc_action_disable_ct,
> +				       &guc->ct);

hmm, while adding CT disable (aka fini) action from the CT init() function
makes sense, doing that here (due to reinit of the invisible here CT.bo)
looks suspicious and clearly violates the layering ... if this is the only way,
can you move CT related code to xe_guc_ct_init_post_hwconfig() or something?

> +
> +	return ret;
>  }
>  
>  static int vf_guc_init_noalloc(struct xe_guc *guc)
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 3f4e6a46ff16..fe2410091bfc 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -39,6 +39,8 @@ static void receive_g2h(struct xe_guc_ct *ct);
>  static void g2h_worker_func(struct work_struct *w);
>  static void safe_mode_worker_func(struct work_struct *w);
>  static void ct_exit_safe_mode(struct xe_guc_ct *ct);
> +static void guc_ct_change_state(struct xe_guc_ct *ct,
> +				enum xe_guc_ct_state state);
>  
>  #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>  enum {
> @@ -252,12 +254,20 @@ int xe_guc_ct_init_noalloc(struct xe_guc_ct *ct)
>  }
>  ALLOW_ERROR_INJECTION(xe_guc_ct_init_noalloc, ERRNO); /* See xe_pci_probe() */
>  
> +void xe_guc_action_disable_ct(void *arg)

this should be a static function since it shall be used only
by the init()

> +{
> +	struct xe_guc_ct *ct = arg;
> +
> +	guc_ct_change_state(ct, XE_GUC_CT_STATE_DISABLED);
> +}
> +
>  int xe_guc_ct_init(struct xe_guc_ct *ct)
>  {
>  	struct xe_device *xe = ct_to_xe(ct);
>  	struct xe_gt *gt = ct_to_gt(ct);
>  	struct xe_tile *tile = gt_to_tile(gt);
>  	struct xe_bo *bo;
> +	int err = 0;
>  
>  	bo = xe_managed_bo_create_pin_map(xe, tile, guc_ct_size(),
>  					  XE_BO_FLAG_SYSTEM |
> @@ -268,7 +278,9 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
>  		return PTR_ERR(bo);
>  
>  	ct->bo = bo;
> -	return 0;
> +
> +	err = devm_add_action_or_reset(xe->drm.dev, xe_guc_action_disable_ct, ct);
> +	return err;

since this should be always the last step of the init() there is no
need to split this into assign err variable and then return err

>  }
>  ALLOW_ERROR_INJECTION(xe_guc_ct_init, ERRNO); /* See xe_pci_probe() */
>  
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> index 18d4225e6502..b9f06ec0db52 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> @@ -73,4 +73,6 @@ xe_guc_ct_send_block_no_fail(struct xe_guc_ct *ct, const u32 *action, u32 len)
>  
>  long xe_guc_ct_queue_proc_time_jiffies(struct xe_guc_ct *ct);
>  
> +void xe_guc_action_disable_ct(void *arg);

this shouldn't be a public functions as it takes random argument

> +
>  #endif