[PATCH v2] drm/xe: Improve VF provision stability with fault injection

Wed Jul 23 11:25:25 UTC 2025

On 7/23/2025 1:13 PM, Satyanarayana K V P wrote:
> In unlikely event, due to PF malfunction or misconfiguration, VF may
> receive incomplete or invalid configuration and it must be prepared
> to handle such cases without causing a crash.
> 
> When simulating errors with the kernel's fault injection framework, crashes
> were observed during device unbind. These crashes were primarily due to the
> use of the XE_BO_FLAG_GGTT_INVALIDATE flag when creating buffer objects.
> 
> The GGTT is invalidated using CTB, which is allocated with the
> XE_BO_FLAG_GGTT_INVALIDATE flag. However, the buffer object for CTB is
> freed before GGTT invalidation completes, leading to crashes.
> 
> Similarly, for buffer objects allocated in memirq_alloc_pages() and
> __xe_sa_bo_manager_init(), the CTB is already freed by the time GGTT
> invalidation occurs, resulting in system crashes.
> 
> To prevent these issues, while invalidating ggtt, an additional check for
> validity of ggtt is added which avoids sending CTB to GUC.

s/ggtt/GGTT

> 
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Matthew Auld <matthew.auld at intel.com>
> Cc: Summers Stuart <stuart.summers at intel.com>
> 
> ---
> V1 -> V2:
> - An additional check for validity of ggtt is added in
> xe_gt_tlb_invalidation_ggtt() instead of removing XE_BO_FLAG_GGTT_INVALIDATE
> from memirq_alloc_pages(), __xe_sa_bo_manager_init() and xe_guc_ct_init().
> (Summers Stuart)
> ---
>  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> index 086c12ee3d9d..145b5f90e347 100644
> --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> @@ -294,10 +294,12 @@ static int xe_gt_tlb_invalidation_guc(struct xe_gt *gt,
>   */
>  int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt)
>  {
> +	struct xe_tile *tile = gt_to_tile(gt);
>  	struct xe_device *xe = gt_to_xe(gt);
>  	unsigned int fw_ref;
>  
> -	if (xe_guc_ct_enabled(&gt->uc.guc.ct) &&
> +	if (tile->mem.ggtt->scratch &&

if ggtt->scratch is the only good way to check if GGTT is still
valid, then maybe instead of checking it here, at the GT layer,
test it in the only caller xe_ggtt_invalidate() at GGTT layer,
for which this is internal data, while here it looks we violate
layers by looking up again to reach tile and GGTT ..

> +	    xe_guc_ct_enabled(&gt->uc.guc.ct) &&
>  	    gt->uc.guc.submission_state.enabled) {
>  		struct xe_gt_tlb_invalidation_fence fence;
>  		int ret;