[PATCH] drm/xe: Flush delayed frees and retry on user object allocation failure
Tvrtko Ursulin
tvrtko.ursulin at igalia.com
Fri Mar 14 14:57:50 UTC 2025
If userspace workload is operating near the limit of available memory and
it moves from one large working set to another, in other words free some
large buffers and immediately allocate some new ones, the TTM eviction
attempt during new resource allocation may not be effective due the
combination of relying on trylock and the fact released buffer might have
ended on the delayed release path.
>From userspace point of view this reflects as sporadic
VK_ERROR_OUT_OF_DEVICE_MEMORY ie. sometimes the application will work,
sometimes will fail, even if it does exactly the same thing on an
otherwise idle system.
Good examples are two tests from the VK CTS suite:
- dEQP-VK.pipeline.monolithic.render_to_image.core.*
- dEQP-VK.memory.allocation.random*
To improve this we can flush the TTM delayed object free workqueue and
retry when encountering ENOMEM, which so far looks like a significant
improvement in mean-time-to-failure.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
Cc: Lucas De Marchi <lucas.demarchi at intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
---
drivers/gpu/drm/xe/xe_bo.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 64f9c936eea0..53f45c766e59 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -1900,9 +1900,25 @@ struct xe_bo *xe_bo_create_user(struct xe_device *xe, struct xe_tile *tile,
u16 cpu_caching,
u32 flags)
{
- struct xe_bo *bo = __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL,
- cpu_caching, ttm_bo_type_device,
- flags | XE_BO_FLAG_USER, 0);
+ unsigned int retry = 3;
+ struct xe_bo *bo;
+
+ while (retry--) {
+ bo = __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL,
+ cpu_caching, ttm_bo_type_device,
+ flags | XE_BO_FLAG_USER, 0);
+ if (!IS_ERR(bo) || PTR_ERR(bo) != -ENOMEM)
+ break;
+
+ /*
+ * TTM eviction may sporadically fail due reliance on trylock
+ * and delayed ttm_bo deletion causing trylock to fail. Work
+ * around it by retrying after currently pending delayed
+ * releases have been processed.
+ */
+ flush_workqueue(xe->ttm.wq);
+ }
+
if (!IS_ERR(bo))
xe_bo_unlock_vm_held(bo);
--
2.48.0
More information about the Intel-xe
mailing list