[PATCH v5 5/6] drm/amdkfd: Increase KFD bo restore wait time

Felix Kuehling felix.kuehling at amd.com
Tue Apr 23 22:06:31 UTC 2024


On 2024-04-23 11:28, Philip Yang wrote:
> TTM allocate contiguous VRAM may takes more than 1 second to evict BOs
> for larger size RDMA buffer. Because KFD restore bo worker reserves all
> KFD BOs, then TTM cannot hold the remainning KFD BOs lock to evict them,
> this causes TTM failed to alloc contiguous VRAM.
>
> Increase the KFD restore BO wait time to 2 seconds, long enough for RDMA
> pin BO to alloc the contiguous VRAM.

Two seconds is a very long time that the GPU will be idle whenever 
memory gets evicted. Maybe we need to look for a solution where the 
restore gets scheduled in response to a fence when the migration completes.

With my most recent changes I made to the eviction fence handling, I 
think we can decouple the scheduling of the restore work from the evict 
work. So we could schedule the delayed restore worker in a fence 
callback set up in amdgpu_bo_move or somewhere around there, and keep a 
short delay that starts counting at the end of the eviction move blit.

Regards,
   Felix


>
> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index a81ef232fdef..c205e2d3acf9 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -698,7 +698,7 @@ struct qcm_process_device {
>   /* KFD Memory Eviction */
>   
>   /* Approx. wait time before attempting to restore evicted BOs */
> -#define PROCESS_RESTORE_TIME_MS 100
> +#define PROCESS_RESTORE_TIME_MS 2000
>   /* Approx. back off time if restore fails due to lack of memory */
>   #define PROCESS_BACK_OFF_TIME_MS 100
>   /* Approx. time before evicting the process again */


More information about the amd-gfx mailing list