[PATCH 1/2] amd/amdkfd: sync all devices to wait all processes being evicted

Felix Kuehling felix.kuehling at amd.com
Tue Apr 2 22:32:56 UTC 2024


On 2024-04-01 17:53, Zhigang Luo wrote:
> If there are more than one device doing reset in parallel, the first
> device will call kfd_suspend_all_processes() to evict all processes
> on all devices, this call takes time to finish. other device will
> start reset and recover without waiting. if the process has not been
> evicted before doing recover, it will be restored, then caused page
> fault.
>
> Signed-off-by: Zhigang Luo<Zhigang.Luo at amd.com>
> Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd

Please remove the Change-Id: before you push. Other than that, this patch is


> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 041ec3de55e7..55f89c858c7a 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -969,11 +969,11 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>   	if (!run_pm) {
>   		mutex_lock(&kfd_processes_mutex);
>   		count = ++kfd_locked;
> -		mutex_unlock(&kfd_processes_mutex);
>   
>   		/* For first KFD device suspend all the KFD processes */
>   		if (count == 1)
>   			kfd_suspend_all_processes();

This could be simplified now. The variable "count" was only needed for 
the broken attempt to do call suspend outside the lock. Now you can just do:

	mutex_lock(&kfd_processes_mutex);
	if (++kfd_locked == 1)
		kfd_suspend_all_processes();
	mutex_unlock(&kfd_processes_mutex);

To be consistent, we probably need to make a similar change in 
kgd2kfd_resume and run kfd_resume_all_processes under the lock as well. 
Otherwise there could be a race condition between suspend and resume.

Regards,
   Felix


> +		mutex_unlock(&kfd_processes_mutex);
>   	}
>   
>   	for (i = 0; i < kfd->num_nodes; i++) {
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240402/719a5d05/attachment-0001.htm>


More information about the amd-gfx mailing list