[PATCH v5 5/6] drm/amdkfd: Increase KFD bo restore wait time
Felix Kuehling
felix.kuehling at amd.com
Tue Apr 23 22:06:31 UTC 2024
On 2024-04-23 11:28, Philip Yang wrote:
> TTM allocate contiguous VRAM may takes more than 1 second to evict BOs
> for larger size RDMA buffer. Because KFD restore bo worker reserves all
> KFD BOs, then TTM cannot hold the remainning KFD BOs lock to evict them,
> this causes TTM failed to alloc contiguous VRAM.
>
> Increase the KFD restore BO wait time to 2 seconds, long enough for RDMA
> pin BO to alloc the contiguous VRAM.
Two seconds is a very long time that the GPU will be idle whenever
memory gets evicted. Maybe we need to look for a solution where the
restore gets scheduled in response to a fence when the migration completes.
With my most recent changes I made to the eviction fence handling, I
think we can decouple the scheduling of the restore work from the evict
work. So we could schedule the delayed restore worker in a fence
callback set up in amdgpu_bo_move or somewhere around there, and keep a
short delay that starts counting at the end of the eviction move blit.
Regards,
Felix
>
> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index a81ef232fdef..c205e2d3acf9 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -698,7 +698,7 @@ struct qcm_process_device {
> /* KFD Memory Eviction */
>
> /* Approx. wait time before attempting to restore evicted BOs */
> -#define PROCESS_RESTORE_TIME_MS 100
> +#define PROCESS_RESTORE_TIME_MS 2000
> /* Approx. back off time if restore fails due to lack of memory */
> #define PROCESS_BACK_OFF_TIME_MS 100
> /* Approx. time before evicting the process again */
More information about the amd-gfx
mailing list