[PATCH 1/2] amd/amdkfd: sync all devices to wait all processes being evicted

Chen, Xiaogang xiaogang.chen at amd.com
Wed Apr 3 01:31:14 UTC 2024


On 4/1/2024 4:53 PM, Zhigang Luo wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> If there are more than one device doing reset in parallel, the first
> device will call kfd_suspend_all_processes() to evict all processes
> on all devices, this call takes time to finish. other device will
> start reset and recover without waiting. if the process has not been
> evicted before doing recover, it will be restored, then caused page
> fault.
>
> Signed-off-by: Zhigang Luo <Zhigang.Luo at amd.com>
> Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 041ec3de55e7..55f89c858c7a 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -969,11 +969,11 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>          if (!run_pm) {
>                  mutex_lock(&kfd_processes_mutex);
>                  count = ++kfd_locked;
> -               mutex_unlock(&kfd_processes_mutex);
>
>                  /* For first KFD device suspend all the KFD processes */
>                  if (count == 1)
>                          kfd_suspend_all_processes();
> +               mutex_unlock(&kfd_processes_mutex);
>          }

I do not understand why use kfd_lock here. You want evict all processes 
when first device got suspended. The kfd_lock indicates if all kfd 
driver functions got locked. It is not same meaning as device suspend. 
That is not your patch issue, but I think using different flag to record 
device suspend is better. ex, if kfd_lock got set for some other 
reasons, we would skip evicting processes here.

Regards

Xiaogang

>          for (i = 0; i < kfd->num_nodes; i++) {
> --
> 2.25.1
>


More information about the amd-gfx mailing list