[PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure

Felix Kuehling felix.kuehling at amd.com
Tue Aug 10 15:56:46 UTC 2021


Am 2021-08-10 um 12:43 a.m. schrieb Yifan Zhang:
> KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
>
> Note: Google Test filter = KFDSVMRangeTest.*
> [==========] Running 18 tests from 1 test case.
> [----------] Global test environment set-up.
> [----------] 18 tests from KFDSVMRangeTest
> [ RUN      ] KFDSVMRangeTest.BasicSystemMemTest
> [       OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
> [ RUN      ] KFDSVMRangeTest.SetGetAttributesTest
> [          ] Get default atrributes
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
> Value of: expectedDefaultResults[i]
>   Actual: 4294967295
> Expected: outputAttributes[i].value
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
> Value of: expectedDefaultResults[i]
>   Actual: 4294967295
> Expected: outputAttributes[i].value
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152: Failure
> Value of: expectedDefaultResults[i]
>   Actual: 4
> Expected: outputAttributes[i].type
> Which is: 2
> [          ] Setting/Getting atrributes
> [  FAILED  ]
>
> the root cause is that svm work queue has not finished when svm_range_get_attr is called, thus
> some garbage svm interval tree data make svm_range_get_attr get wrong result. Flush work queue before
> iterate svm interval tree.
>
> Signed-off-by: Yifan Zhang <yifan1.zhang at amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index f811a3a24cd2..192e9401bed5 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -3072,6 +3072,9 @@ svm_range_get_attr(struct kfd_process *p, uint64_t start, uint64_t size,
>  	pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start,
>  		 start + size - 1, nattr);
>  
> +	/* flush pending deferred work */
> +	flush_work(&p->svms.deferred_list_work);
> +

There is still a race condition here. More work can be added to the
deferred_list_work after the flush call.

Work gets added to the deferred_list asynchronously, for example in MMU
notifiers. Trying to synchronize with asynchronous events is inherently
problematic. It appears that the test is making some assumptions about
things happening asynchronously (page faults or MMU notifiers) and
that's probably a problem with the test, not with the driver.

Alternatively, there may be a problem with a set-attribute call that
leaves some operations on the deferred list and results in unexpected
get-attribute results. If that's the problem, we may need to add a
flush-call to the end of the set-attributes function.

Can you provide more details about the exact sequence of set-attribute
and get-attribute calls that is causing the problem?

Regards,
  Felix


>  	mmap_read_lock(mm);
>  	r = svm_range_is_valid(p, start, size);
>  	mmap_read_unlock(mm);


More information about the amd-gfx mailing list