[PATCH v3] drm/xe: Improve VF provision stability with fault injection

Matthew Brost matthew.brost at intel.com
Wed Jul 23 16:24:55 UTC 2025


On Wed, Jul 23, 2025 at 06:40:08PM +0530, Satyanarayana K V P wrote:
> In unlikely event, due to PF malfunction or misconfiguration, VF may
> receive incomplete or invalid configuration and it must be prepared
> to handle such cases without causing a crash.
> 
> When simulating errors with the kernel's fault injection framework, crashes
> were observed during device unbind. These crashes were primarily due to the
> use of the XE_BO_FLAG_GGTT_INVALIDATE flag when creating buffer objects.
> 
> The GGTT is invalidated using CTB, which is allocated with the
> XE_BO_FLAG_GGTT_INVALIDATE flag. However, the buffer object for CTB is
> freed before GGTT invalidation completes, leading to crashes.
> 
> Similarly, for buffer objects allocated in memirq_alloc_pages() and
> __xe_sa_bo_manager_init(), the CTB is already freed by the time GGTT
> invalidation occurs, resulting in system crashes.
> 
> To prevent these issues, while invalidating GGTT, an additional check for
> validity of GGTT is added which avoids sending CTB to GUC.
> 
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Matthew Auld <matthew.auld at intel.com>
> Cc: Summers Stuart <stuart.summers at intel.com>
> 
> ---
> V2 -> V3:
> - Moved GGTT validity check to xe_ggtt_invalidate(). (Michal Wajdeczko)
> 
> V1 -> V2:
> - An additional check for validity of GGTT is added in
> xe_gt_tlb_invalidation_ggtt() instead of removing XE_BO_FLAG_GGTT_INVALIDATE
> from memirq_alloc_pages(), __xe_sa_bo_manager_init() and xe_guc_ct_init().
> (Summers Stuart)
> ---
>  drivers/gpu/drm/xe/xe_ggtt.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> index 29d4d3f51da1..b3005a462186 100644
> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> @@ -447,6 +447,16 @@ static void xe_ggtt_invalidate(struct xe_ggtt *ggtt)
>  {
>  	struct xe_device *xe = tile_to_xe(ggtt->tile);
>  
> +	/*
> +	 * To avoid crashes during BO unbind operations, always verify the
> +	 * validity of the GGTT before attempting invalidation. This check is
> +	 * particularly crucial when handling BOs created with the
> +	 * XE_BO_FLAG_GGTT_INVALIDATE flag, as improper invalidation of an
> +	 * invalid GGTT can cause crash.
> +	 */
> +	if (!ggtt->scratch)
> +		return;

Isn't this too late? i.e., Are we sure there are not BO created prior to
xe_ggtt_init with XE_BO_FLAG_GGTT_INVALIDATE set? It seems like a hard
this thing to track / maintain this rule for an extended period of time.

What about in xe_uc_init_post_hwconfig which is very late in the driver
load we add an action (thus the action is run early in driver unload)
which turns off the GuC CTs? Any TLB invalidation client would need to
have similar action too.

Matt

> +
>  	/*
>  	 * XXX: Barrier for GGTT pages. Unsure exactly why this required but
>  	 * without this LNL is having issues with the GuC reading scratch page
> -- 
> 2.43.0
> 


More information about the Intel-xe mailing list