[PATCH] drm/xe: Flush delayed frees and retry on user object allocation failure

Mon Mar 17 16:25:10 UTC 2025

On 17/03/2025 15:25, Tvrtko Ursulin wrote:
> 
> On 17/03/2025 13:02, Thomas Hellström wrote:
>> Hi, Tvrtko,
>>
>> On Fri, 2025-03-14 at 14:57 +0000, Tvrtko Ursulin wrote:
>>> If userspace workload is operating near the limit of available memory
>>> and
>>> it moves from one large working set to another, in other words free
>>> some
>>> large buffers and immediately allocate some new ones, the TTM
>>> eviction
>>> attempt during new resource allocation may not be effective due the
>>> combination of relying on trylock and the fact released buffer might
>>> have
>>> ended on the delayed release path.
>>>
>>>  From userspace point of view this reflects as sporadic
>>> VK_ERROR_OUT_OF_DEVICE_MEMORY ie. sometimes the application will
>>> work,
>>> sometimes will fail, even if it does exactly the same thing on an
>>> otherwise idle system.
>>>
>>> Good examples are two tests from the VK CTS suite:
>>>
>>>   - dEQP-VK.pipeline.monolithic.render_to_image.core.*
>>>   - dEQP-VK.memory.allocation.random*
>>>
>>> To improve this we can flush the TTM delayed object free workqueue
>>> and
>>> retry when encountering ENOMEM, which so far looks like a significant
>>> improvement in mean-time-to-failure.
>>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_bo.c | 22 +++++++++++++++++++---
>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>>> index 64f9c936eea0..53f45c766e59 100644
>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>> @@ -1900,9 +1900,25 @@ struct xe_bo *xe_bo_create_user(struct
>>> xe_device *xe, struct xe_tile *tile,
>>>                   u16 cpu_caching,
>>>                   u32 flags)
>>>   {
>>> -    struct xe_bo *bo = __xe_bo_create_locked(xe, tile, vm, size,
>>> 0, ~0ULL,
>>> -                         cpu_caching,
>>> ttm_bo_type_device,
>>> -                         flags |
>>> XE_BO_FLAG_USER, 0);
>>> +    unsigned int retry = 3;
>>> +    struct xe_bo *bo;
>>> +
>>> +    while (retry--) {
>>> +        bo = __xe_bo_create_locked(xe, tile, vm, size, 0,
>>> ~0ULL,
>>> +                       cpu_caching,
>>> ttm_bo_type_device,
>>> +                       flags | XE_BO_FLAG_USER,
>>> 0);
>>> +        if (!IS_ERR(bo) || PTR_ERR(bo) != -ENOMEM)
>>> +            break;
>>> +
>>> +        /*
>>> +         * TTM eviction may sporadically fail due reliance
>>> on trylock
>>> +         * and delayed ttm_bo deletion causing trylock to
>>> fail. Work
>>> +         * around it by retrying after currently pending
>>> delayed
>>> +         * releases have been processed.
>>> +         */
>>> +        flush_workqueue(xe->ttm.wq);
>>> +    }
>>
>> We should use xe_vm_validate_should_retry() for this. (I think actually
>> I saw a patch for that floating around somewhere). IIRC the bos are
>> lockable also when in the delayed object free workqueue?
> 
> They appear mostly lockable while delayed delete is pending, apart 
> during the window when put on the queue, and while delayed delete is 
> actually executing ttm_bo_cleanup_memtype_use() but before it managed to 
> report freeing up of space. I think the sporadic -ENOMEM is when bo 
> creation races with one of those two windows.
> 
> But I see now this landed very recently:
> 
> commit 1d724a2f1b2c3f0cba4975784a808482e0631adf
> Author: Matthew Brost <matthew.brost at intel.com>
> Date:   Wed Mar 5 17:26:26 2025 -0800
> 
>      drm/xe: Retry BO allocation
> 
> And I was testing before that commit.
> 
>> So could you try with that instead? If you think that doesn't help, we
>> should perhaps consider also flushing the we ttm wq from within that
>> function instead.
> 
> So I am pretty sure 1d724a2f1b2c3f0cba4975784a808482e0631adf will work 
> around the issue but will double check just in case and report back.

FWIW VK CTS looks fine with only 1d724a2f1b2c3f0cba4975784a808482e0631adf.

Just as a note if someone will try to repro, what I forgot to mention in 
the commit message. It is not enough to just run the tests cases since 
on Linux the backend currently does not query the available memory but 
has it hardcoded (as opposed to for example on Android). To repro on 
Linux one needs to patch the test suite with something like this 
(depends on the amount of RAM on the test machine too):

diff --git a/framework/common/tcuPlatform.cpp 
b/framework/common/tcuPlatform.cpp
index 1d0f5b06e5b8..f1245b460834 100644
--- a/framework/common/tcuPlatform.cpp
+++ b/framework/common/tcuPlatform.cpp
@@ -57,7 +57,7 @@ const vk::Platform &Platform::getVulkanPlatform(void) 
const
  void Platform::getMemoryLimits(PlatformMemoryLimits &limits) const
  {
      // Default values. Platforms can override these values.
-    limits.totalSystemMemory                 = 256 * 1024 * 1024;
+    limits.totalSystemMemory                 = 10UL * 256 * 1024 * 1024;
      limits.totalDeviceLocalMemory            = 128 * 1024 * 1024;
      limits.deviceMemoryAllocationGranularity = 64 * 1024;
      limits.devicePageSize                    = 4096;

Regards,

Tvrtko

> 
> Whether or not to put a flush inside xe_vm_validate_should_retry() feels 
> a bit moot at the moment given how weak the implementation of the former 
> is. Only if replacing the mdelay(20) with the flush would be enough for 
> all use cases. If it wouldn't then having both mdelay(20), in 
> combination with its up to 1 second retry limit, and a flush does not 
> appear to gain much. Neither in latency or robustness. Only if one would 
> entertain the idea a single flush taking more than a second in some 
> ultra pessimistic scenario.
> 
> Regards,
> 
> Tvrtko
> 
>>
>>> +
>>>       if (!IS_ERR(bo))
>>>           xe_bo_unlock_vm_held(bo);
>>
>