[PATCH] drm/xe: Improve VF provision stability with fault injection

Summers, Stuart stuart.summers at intel.com
Fri Jul 11 15:02:36 UTC 2025


On Fri, 2025-07-11 at 11:51 +0000, Satyanarayana K V P wrote:
> In unlikely event, due to PF malfunction or misconfiguration, VF may
> receive incomplete or invalid configuration and it must be prepared
> to handle such cases without causing a crash.
> 
> When simulating errors with the kernel's fault injection framework,
> crashes
> were observed during device unbind. These crashes were primarily due
> to the
> use of the XE_BO_FLAG_GGTT_INVALIDATE flag when creating buffer
> objects.
> 
> The GGTT is invalidated using CTB, which is allocated with the
> XE_BO_FLAG_GGTT_INVALIDATE flag. However, the buffer object for CTB
> is
> freed before GGTT invalidation completes, leading to crashes.
> 
> Similarly, for buffer objects allocated in memirq_alloc_pages() and
> __xe_sa_bo_manager_init(), the CTB is already freed by the time GGTT
> invalidation occurs, resulting in system crashes.
> 
> To prevent these issues, the XE_BO_FLAG_GGTT_INVALIDATE flag is no
> longer
> used when creating buffer objects in memirq_alloc_pages(),
> __xe_sa_bo_manager_init() and xe_guc_ct_init().

Ok so the idea here is that these allocations are happening once during
driver load (or after reset) and not again until the BO is completely
de-allocated/re-allocated and so we don't need to actually issue an
invalidation since we always allocate the buffer from scratch which
would presumably flush the TLBs at that time? Is that right?

Honestly this feels risky to me. Why can't we just check that the
buffer is still alive when doing the invalidation and if not, skip it?
That way we aren't breaking existing functionality and still handling
the error case.

Thanks,
Stuart

> 
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Matthew Auld <matthew.auld at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_ct.c | 1 -
>  drivers/gpu/drm/xe/xe_memirq.c | 1 -
>  drivers/gpu/drm/xe/xe_sa.c     | 1 -
>  3 files changed, 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
> b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 23e8c155025e..f32103811d00 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -258,7 +258,6 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
>         bo = xe_managed_bo_create_pin_map(xe, tile, guc_ct_size(),
>                                           XE_BO_FLAG_SYSTEM |
>                                           XE_BO_FLAG_GGTT |
> -                                         XE_BO_FLAG_GGTT_INVALIDATE
> |
>                                          
> XE_BO_FLAG_PINNED_NORESTORE);
>         if (IS_ERR(bo))
>                 return PTR_ERR(bo);
> diff --git a/drivers/gpu/drm/xe/xe_memirq.c
> b/drivers/gpu/drm/xe/xe_memirq.c
> index 49c45ec3e83c..678b78f42132 100644
> --- a/drivers/gpu/drm/xe/xe_memirq.c
> +++ b/drivers/gpu/drm/xe/xe_memirq.c
> @@ -180,7 +180,6 @@ static int memirq_alloc_pages(struct xe_memirq
> *memirq)
>         bo = xe_managed_bo_create_pin_map(xe, tile, bo_size,
>                                           XE_BO_FLAG_SYSTEM |
>                                           XE_BO_FLAG_GGTT |
> -                                         XE_BO_FLAG_GGTT_INVALIDATE
> |
>                                           XE_BO_FLAG_NEEDS_UC |
>                                          
> XE_BO_FLAG_NEEDS_CPU_ACCESS);
>         if (IS_ERR(bo)) {
> diff --git a/drivers/gpu/drm/xe/xe_sa.c b/drivers/gpu/drm/xe/xe_sa.c
> index 1d43e183ca21..e11ed0a1ed13 100644
> --- a/drivers/gpu/drm/xe/xe_sa.c
> +++ b/drivers/gpu/drm/xe/xe_sa.c
> @@ -60,7 +60,6 @@ struct xe_sa_manager
> *__xe_sa_bo_manager_init(struct xe_tile *tile, u32 size, u3
>         bo = xe_managed_bo_create_pin_map(xe, tile, size,
>                                          
> XE_BO_FLAG_VRAM_IF_DGFX(tile) |
>                                           XE_BO_FLAG_GGTT |
> -                                         XE_BO_FLAG_GGTT_INVALIDATE
> |
>                                          
> XE_BO_FLAG_PINNED_NORESTORE);
>         if (IS_ERR(bo)) {
>                 drm_err(&xe->drm, "Failed to prepare %uKiB BO for SA
> manager (%pe)\n",



More information about the Intel-xe mailing list