[PATCH] drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails

Zhang, Yifan Yifan1.Zhang at amd.com
Tue Aug 17 09:13:21 UTC 2021


[AMD Official Use Only]

Hi Felix, 

Thanks for comments. I will send Patch V2.

BRs,
Yifan

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling at amd.com> 
Sent: Tuesday, August 17, 2021 9:16 AM
To: Zhang, Yifan <Yifan1.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails


Am 2021-08-14 um 6:12 a.m. schrieb Yifan Zhang:
> [ RUN      ] KFDSVMRangeTest.PartialUnmapSysMemTest
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:245: 
> Failure Value of: (hsaKmtAllocMemory(m_Node, m_Size, m_Flags, &m_pBuf))
>   Actual: 1
> Expected: HSAKMT_STATUS_SUCCESS
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:248: 
> Failure Value of: (hsaKmtMapMemoryToGPUNodes(m_pBuf, m_Size, __null, mapFlags, 1, &m_Node))
>   Actual: 1
> Expected: HSAKMT_STATUS_SUCCESS
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:306: 
> Failure
> Expected: ((void *)__null) != (ptr), actual: NULL vs NULL Segmentation 
> fault (core dumped)
> [          ] Profile: Full Test
> [          ] HW capabilities: 0x9
>
> kernel log:
>
> [  102.029150]  ret_from_fork+0x22/0x30 [  102.029158] ---[ end trace 
> 15c34e782714f9a3 ]--- [ 3613.603598] amdgpu: Address: 0x7f7149ccc000 
> already allocated by SVM [ 3613.610620] show_signal_msg: 27 callbacks 
> suppressed
>
> These is race with deferred actions from previous memory map changes 
> (e.g. munmap).Flush pending deffered work to avoid such case.
>
> Signed-off-by: Yifan Zhang <yifan1.zhang at amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index 3177c4a0e753..982bf538dc3d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -1261,6 +1261,13 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
>  		return -EINVAL;
>  
>  #if IS_ENABLED(CONFIG_HSA_AMD_SVM)
> +	/* Flush pending deferred work to avoid racing with deferred actions from
> +	 * previous memory map changes (e.g. munmap). Concurrent memory map changes
> +	 * can still race with get_attr because we don't hold the mmap lock. 
> +But that

This comment would need to be updated. This is not get_attr. Whether or not undefined behaviour is acceptable in this case is a different question from get_attr. In the get_attr case, a race is caused by a broken application and causes potentially incorrect results being reported to user mode. Garbage-in ==> garbage-out. No problem.

In the case of this race here, the cause is still a broken application.
But this check is here to catch broken applications and prevent them from mapping the same virtual address range with two different allocators. So I'd argue that the race condition is not acceptable here because it has consequences for VM mappings managed by the kernel mode driver.


> +	 * would be a race condition in the application anyway, and undefined
> +	 * behaviour is acceptable in that case.
> +	 */
> +	flush_work(&svms->deferred_list_work);
>  	mutex_lock(&svms->lock);

This still leaves a brief race. There is an easy way to fix that: Use svm_range_list_lock_and_flush_work like this:

	svm_range_list_lock_and_flush_work(svms, current->mm);
	mutex_lock(&svms->lock);
	mmap_write_unlock(current->mm);
	...

Regards,
  Felix


>  	if (interval_tree_iter_first(&svms->objects,
>  				     args->va_addr >> PAGE_SHIFT,


More information about the amd-gfx mailing list