[PATCH] drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails
Felix Kuehling
felix.kuehling at amd.com
Tue Aug 17 01:15:46 UTC 2021
Am 2021-08-14 um 6:12 a.m. schrieb Yifan Zhang:
> [ RUN ] KFDSVMRangeTest.PartialUnmapSysMemTest
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:245: Failure
> Value of: (hsaKmtAllocMemory(m_Node, m_Size, m_Flags, &m_pBuf))
> Actual: 1
> Expected: HSAKMT_STATUS_SUCCESS
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:248: Failure
> Value of: (hsaKmtMapMemoryToGPUNodes(m_pBuf, m_Size, __null, mapFlags, 1, &m_Node))
> Actual: 1
> Expected: HSAKMT_STATUS_SUCCESS
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:306: Failure
> Expected: ((void *)__null) != (ptr), actual: NULL vs NULL
> Segmentation fault (core dumped)
> [ ] Profile: Full Test
> [ ] HW capabilities: 0x9
>
> kernel log:
>
> [ 102.029150] ret_from_fork+0x22/0x30
> [ 102.029158] ---[ end trace 15c34e782714f9a3 ]---
> [ 3613.603598] amdgpu: Address: 0x7f7149ccc000 already allocated by SVM
> [ 3613.610620] show_signal_msg: 27 callbacks suppressed
>
> These is race with deferred actions from previous memory map
> changes (e.g. munmap).Flush pending deffered work to avoid such case.
>
> Signed-off-by: Yifan Zhang <yifan1.zhang at amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index 3177c4a0e753..982bf538dc3d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -1261,6 +1261,13 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
> return -EINVAL;
>
> #if IS_ENABLED(CONFIG_HSA_AMD_SVM)
> + /* Flush pending deferred work to avoid racing with deferred actions from
> + * previous memory map changes (e.g. munmap). Concurrent memory map changes
> + * can still race with get_attr because we don't hold the mmap lock. But that
This comment would need to be updated. This is not get_attr. Whether or
not undefined behaviour is acceptable in this case is a different
question from get_attr. In the get_attr case, a race is caused by a
broken application and causes potentially incorrect results being
reported to user mode. Garbage-in ==> garbage-out. No problem.
In the case of this race here, the cause is still a broken application.
But this check is here to catch broken applications and prevent them
from mapping the same virtual address range with two different
allocators. So I'd argue that the race condition is not acceptable here
because it has consequences for VM mappings managed by the kernel mode
driver.
> + * would be a race condition in the application anyway, and undefined
> + * behaviour is acceptable in that case.
> + */
> + flush_work(&svms->deferred_list_work);
> mutex_lock(&svms->lock);
This still leaves a brief race. There is an easy way to fix that: Use
svm_range_list_lock_and_flush_work like this:
svm_range_list_lock_and_flush_work(svms, current->mm);
mutex_lock(&svms->lock);
mmap_write_unlock(current->mm);
...
Regards,
Felix
> if (interval_tree_iter_first(&svms->objects,
> args->va_addr >> PAGE_SHIFT,
More information about the amd-gfx
mailing list