[PATCH 1/2] drm/amdgpu: increase hmm range get pages timeout

Wed Dec 13 16:23:30 UTC 2023

On 2023-12-13 10:24, James Zhu wrote:
> Ping ...
>
> On 2023-12-08 18:01, James Zhu wrote:
>> When application tries to allocate all system memory and cause memory
>> to swap out. Needs more time for hmm_range_fault to validate the
>> remaining page for allocation. To be safe, increase timeout value to
>> 1 second for 64MB range.
>>
>> Signed-off-by: James Zhu <James.Zhu at amd.com>

This is not the first time we're incrementing this timeout. Eventually 
we should get rid of that and find a way to make this work reliably 
without a timeout. There can always be situations where faults take 
longer, and we should not fail randomly in those cases.

There are also some FIXMEs in this code that should be addressed at the 
same time.

That said, as a short-term fix, this patch is

Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>

>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
>> index 081267161d40..b24eb5821fd1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
>> @@ -190,8 +190,8 @@ int amdgpu_hmm_range_get_pages(struct 
>> mmu_interval_notifier *notifier,
>>           pr_debug("hmm range: start = 0x%lx, end = 0x%lx",
>>               hmm_range->start, hmm_range->end);
>>   -        /* Assuming 128MB takes maximum 1 second to fault page 
>> address */
>> -        timeout = max((hmm_range->end - hmm_range->start) >> 27, 1UL);
>> +        /* Assuming 64MB takes maximum 1 second to fault page 
>> address */
>> +        timeout = max((hmm_range->end - hmm_range->start) >> 26, 1UL);
>>           timeout *= HMM_RANGE_DEFAULT_TIMEOUT;
>>           timeout = jiffies + msecs_to_jiffies(timeout);