[PATCH] drm/xe: Flush delayed frees and retry on user object allocation failure

Mon Mar 17 15:25:31 UTC 2025

On 17/03/2025 13:02, Thomas Hellström wrote:
> Hi, Tvrtko,
> 
> On Fri, 2025-03-14 at 14:57 +0000, Tvrtko Ursulin wrote:
>> If userspace workload is operating near the limit of available memory
>> and
>> it moves from one large working set to another, in other words free
>> some
>> large buffers and immediately allocate some new ones, the TTM
>> eviction
>> attempt during new resource allocation may not be effective due the
>> combination of relying on trylock and the fact released buffer might
>> have
>> ended on the delayed release path.
>>
>>  From userspace point of view this reflects as sporadic
>> VK_ERROR_OUT_OF_DEVICE_MEMORY ie. sometimes the application will
>> work,
>> sometimes will fail, even if it does exactly the same thing on an
>> otherwise idle system.
>>
>> Good examples are two tests from the VK CTS suite:
>>
>>   - dEQP-VK.pipeline.monolithic.render_to_image.core.*
>>   - dEQP-VK.memory.allocation.random*
>>
>> To improve this we can flush the TTM delayed object free workqueue
>> and
>> retry when encountering ENOMEM, which so far looks like a significant
>> improvement in mean-time-to-failure.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_bo.c | 22 +++++++++++++++++++---
>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>> index 64f9c936eea0..53f45c766e59 100644
>> --- a/drivers/gpu/drm/xe/xe_bo.c
>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>> @@ -1900,9 +1900,25 @@ struct xe_bo *xe_bo_create_user(struct
>> xe_device *xe, struct xe_tile *tile,
>>   				u16 cpu_caching,
>>   				u32 flags)
>>   {
>> -	struct xe_bo *bo = __xe_bo_create_locked(xe, tile, vm, size,
>> 0, ~0ULL,
>> -						 cpu_caching,
>> ttm_bo_type_device,
>> -						 flags |
>> XE_BO_FLAG_USER, 0);
>> +	unsigned int retry = 3;
>> +	struct xe_bo *bo;
>> +
>> +	while (retry--) {
>> +		bo = __xe_bo_create_locked(xe, tile, vm, size, 0,
>> ~0ULL,
>> +					   cpu_caching,
>> ttm_bo_type_device,
>> +					   flags | XE_BO_FLAG_USER,
>> 0);
>> +		if (!IS_ERR(bo) || PTR_ERR(bo) != -ENOMEM)
>> +			break;
>> +
>> +		/*
>> +		 * TTM eviction may sporadically fail due reliance
>> on trylock
>> +		 * and delayed ttm_bo deletion causing trylock to
>> fail. Work
>> +		 * around it by retrying after currently pending
>> delayed
>> +		 * releases have been processed.
>> +		 */
>> +		flush_workqueue(xe->ttm.wq);
>> +	}
> 
> We should use xe_vm_validate_should_retry() for this. (I think actually
> I saw a patch for that floating around somewhere). IIRC the bos are
> lockable also when in the delayed object free workqueue?

They appear mostly lockable while delayed delete is pending, apart 
during the window when put on the queue, and while delayed delete is 
actually executing ttm_bo_cleanup_memtype_use() but before it managed to 
report freeing up of space. I think the sporadic -ENOMEM is when bo 
creation races with one of those two windows.

But I see now this landed very recently:

commit 1d724a2f1b2c3f0cba4975784a808482e0631adf
Author: Matthew Brost <matthew.brost at intel.com>
Date:   Wed Mar 5 17:26:26 2025 -0800

     drm/xe: Retry BO allocation

And I was testing before that commit.

> So could you try with that instead? If you think that doesn't help, we
> should perhaps consider also flushing the we ttm wq from within that
> function instead.

So I am pretty sure 1d724a2f1b2c3f0cba4975784a808482e0631adf will work 
around the issue but will double check just in case and report back.

Whether or not to put a flush inside xe_vm_validate_should_retry() feels 
a bit moot at the moment given how weak the implementation of the former 
is. Only if replacing the mdelay(20) with the flush would be enough for 
all use cases. If it wouldn't then having both mdelay(20), in 
combination with its up to 1 second retry limit, and a flush does not 
appear to gain much. Neither in latency or robustness. Only if one would 
entertain the idea a single flush taking more than a second in some 
ultra pessimistic scenario.

Regards,

Tvrtko

> 
>> +
>>   	if (!IS_ERR(bo))
>>   		xe_bo_unlock_vm_held(bo);
>>   
>