[PATCH v3] drm/xe: Improve VF provision stability with fault injection
Satyanarayana K V P
satyanarayana.k.v.p at intel.com
Wed Jul 23 13:10:08 UTC 2025
In unlikely event, due to PF malfunction or misconfiguration, VF may
receive incomplete or invalid configuration and it must be prepared
to handle such cases without causing a crash.
When simulating errors with the kernel's fault injection framework, crashes
were observed during device unbind. These crashes were primarily due to the
use of the XE_BO_FLAG_GGTT_INVALIDATE flag when creating buffer objects.
The GGTT is invalidated using CTB, which is allocated with the
XE_BO_FLAG_GGTT_INVALIDATE flag. However, the buffer object for CTB is
freed before GGTT invalidation completes, leading to crashes.
Similarly, for buffer objects allocated in memirq_alloc_pages() and
__xe_sa_bo_manager_init(), the CTB is already freed by the time GGTT
invalidation occurs, resulting in system crashes.
To prevent these issues, while invalidating GGTT, an additional check for
validity of GGTT is added which avoids sending CTB to GUC.
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
Cc: Matthew Brost <matthew.brost at intel.com>
Cc: Matthew Auld <matthew.auld at intel.com>
Cc: Summers Stuart <stuart.summers at intel.com>
---
V2 -> V3:
- Moved GGTT validity check to xe_ggtt_invalidate(). (Michal Wajdeczko)
V1 -> V2:
- An additional check for validity of GGTT is added in
xe_gt_tlb_invalidation_ggtt() instead of removing XE_BO_FLAG_GGTT_INVALIDATE
from memirq_alloc_pages(), __xe_sa_bo_manager_init() and xe_guc_ct_init().
(Summers Stuart)
---
drivers/gpu/drm/xe/xe_ggtt.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
index 29d4d3f51da1..b3005a462186 100644
--- a/drivers/gpu/drm/xe/xe_ggtt.c
+++ b/drivers/gpu/drm/xe/xe_ggtt.c
@@ -447,6 +447,16 @@ static void xe_ggtt_invalidate(struct xe_ggtt *ggtt)
{
struct xe_device *xe = tile_to_xe(ggtt->tile);
+ /*
+ * To avoid crashes during BO unbind operations, always verify the
+ * validity of the GGTT before attempting invalidation. This check is
+ * particularly crucial when handling BOs created with the
+ * XE_BO_FLAG_GGTT_INVALIDATE flag, as improper invalidation of an
+ * invalid GGTT can cause crash.
+ */
+ if (!ggtt->scratch)
+ return;
+
/*
* XXX: Barrier for GGTT pages. Unsure exactly why this required but
* without this LNL is having issues with the GuC reading scratch page
--
2.43.0
More information about the Intel-xe
mailing list