<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Am 20.12.23 um 17:58 schrieb Felix Kuehling:<br>
<blockquote type="cite" cite="mid:ef8c4273-d54d-42b6-ac60-fed5f8ed3848@amd.com">
<div class="moz-cite-prefix">On 2023-12-11 10:56, Felix Kuehling
wrote:<br>
</div>
<blockquote type="cite" cite="mid:7b67d4f1-cfcd-47a9-a80e-f4c1eee235a1@amd.com"> <br>
On 2023-12-08 05:11, Christian König wrote: <br>
<blockquote type="cite">Am 07.12.23 um 20:14 schrieb Felix
Kuehling: <br>
<blockquote type="cite"> <br>
On 2023-12-05 17:20, Felix Kuehling wrote: <br>
<blockquote type="cite">Properly mark kfd_process->ef as
__rcu and consistently access it with <br>
rcu_dereference_protected. <br>
<br>
Reported-by: kernel test robot <a class="moz-txt-link-rfc2396E" href="mailto:lkp@intel.com" moz-do-not-send="true"><lkp@intel.com></a>
<br>
Closes:
<a class="moz-txt-link-freetext" href="https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/" moz-do-not-send="true">https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/</a><br>
Signed-off-by: Felix Kuehling <a class="moz-txt-link-rfc2396E" href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true"><Felix.Kuehling@amd.com></a>
<br>
</blockquote>
<br>
ping. <br>
<br>
Christian, would you review this patch, please? <br>
</blockquote>
<br>
Looks a bit suspicious, especially the
rcu_dereference_protected() use. <br>
<br>
What is the static checker complaining about in the first
place? <br>
</blockquote>
From
<a class="moz-txt-link-freetext" href="https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/" moz-do-not-send="true">https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/</a>:<br>
<br>
<blockquote type="cite">
<blockquote type="cite">drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1671:9:
sparse: sparse: incompatible types in comparison expression
(different address spaces): >>
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1671:9:
sparse: </blockquote>
</blockquote>
struct dma_fence [noderef] __rcu * >>
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1671:9:
sparse: struct dma_fence * ... >>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
sparse: sparse: incompatible types in comparison expression
(different address spaces): >>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
sparse: struct dma_fence [noderef] __rcu * >>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
sparse: struct dma_fence * >>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
sparse: sparse: incompatible types in comparison expression
(different address spaces): >>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
sparse: struct dma_fence [noderef] __rcu * >>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
sparse: struct dma_fence * <br>
<br>
As far as I can tell, the reason is, that I'm using
dma_fence_get_rcu_safe and rcu_replace_pointer to get and update
kfd_process->ef, without annotating the fence pointers with
__rcu. This patch fixes that by marking kfd_process->ef as an
__rcu pointer. The only place that was dereferencing it directly
was kfd_process_wq_release, where I added
rcu_dereference_protected. The condition I'm using here is, that
the process ref count is 0 at that point, which means nobody
else is referencing the process or this fence pointer at the
time. <br>
</blockquote>
<p>Hi Christian,</p>
<p>We discussed offline that you think rcu_dereference_protected
is not needed in the teardown function. After reading over
rcupdate.h, I think a simpler alternative would be to use
rcu_access_pointer, after a grace period to ensure there can be
no more readers. Based on this comment in rcupdate.h:</p>
<pre> * It is also permissible to use rcu_access_pointer() when read-side
* access to the pointer was removed at least one grace period ago, as is
* the case in the context of the RCU callback that is freeing up the data,
* or after a synchronize_rcu() returns. This can be useful when tearing
* down multi-linked structures after a grace period has elapsed. However,
* rcu_dereference_protected() is normally preferred for this use case.
</pre>
<p>The last sentence sounds like rcu_dereference_protected should
also be OK, though. Either way, it sounds like I need to add a
synchronize_rcu call in any case, before freeing the fence. Do
you agree with this proposal?</p>
</blockquote>
<br>
Yeah, completely agree. I think the reference to using
rcu_dereference_protected() is when you protect the pointer with
some lock, which isn't the case here.<br>
<br>
And the question is who is accessing this pointer? If you can
guarantee that there are no more readers you don't need an
synchronize_rcu(), if you can't then yes a grace period is necessary
here.<br>
<br>
Regards,<br>
Christian.<br>
<br>
<blockquote type="cite" cite="mid:ef8c4273-d54d-42b6-ac60-fed5f8ed3848@amd.com">
<p>Regards,<br>
Felix<br>
</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:7b67d4f1-cfcd-47a9-a80e-f4c1eee235a1@amd.com"> <br>
Regards, <br>
Felix <br>
<br>
<br>
<blockquote type="cite"> <br>
Regards, <br>
Christian. <br>
<br>
<blockquote type="cite"> <br>
Thanks, <br>
Felix <br>
<br>
<br>
<br>
<blockquote type="cite">--- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 2 +-
<br>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4
++-- <br>
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +-
<br>
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 6
++++-- <br>
4 files changed, 8 insertions(+), 6 deletions(-) <br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h <br>
index f2e920734c98..20cb266dcedd 100644 <br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h <br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h <br>
@@ -314,7 +314,7 @@ void
amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(struct
kgd_mem *mem); <br>
int amdgpu_amdkfd_map_gtt_bo_to_gart(struct
amdgpu_device *adev, struct amdgpu_bo *bo); <br>
int amdgpu_amdkfd_gpuvm_restore_process_bos(void
*process_info, <br>
- struct dma_fence **ef); <br>
+ struct dma_fence __rcu **ef); <br>
int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct
amdgpu_device *adev, <br>
struct kfd_vm_fault_info
*info); <br>
int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct
amdgpu_device *adev, int fd, <br>
diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c <br>
index 7d91f99acb59..8ba6f6c8363d 100644 <br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c <br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c <br>
@@ -2806,7 +2806,7 @@ static void
amdgpu_amdkfd_restore_userptr_worker(struct work_struct
*work) <br>
put_task_struct(usertask); <br>
} <br>
-static void replace_eviction_fence(struct dma_fence
**ef, <br>
+static void replace_eviction_fence(struct dma_fence __rcu
**ef, <br>
struct dma_fence *new_ef) <br>
{ <br>
struct dma_fence *old_ef = rcu_replace_pointer(*ef,
new_ef, true <br>
@@ -2841,7 +2841,7 @@ static void
replace_eviction_fence(struct dma_fence **ef, <br>
* 7. Add fence to all PD and PT BOs. <br>
* 8. Unreserve all BOs <br>
*/ <br>
-int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info,
struct dma_fence **ef) <br>
+int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info,
struct dma_fence __rcu **ef) <br>
{ <br>
struct amdkfd_process_info *process_info = info; <br>
struct amdgpu_vm *peer_vm; <br>
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h <br>
index 45366b4ca976..5a24097a9f28 100644 <br>
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h <br>
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h <br>
@@ -917,7 +917,7 @@ struct kfd_process { <br>
* fence will be triggered during eviction and new
one will be created <br>
* during restore <br>
*/ <br>
- struct dma_fence *ef; <br>
+ struct dma_fence __rcu *ef; <br>
/* Work items for evicting and restoring BOs */ <br>
struct delayed_work eviction_work; <br>
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c <br>
index 71df51fcc1b0..14b11d61f8dd 100644 <br>
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c <br>
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c <br>
@@ -1110,6 +1110,8 @@ static void
kfd_process_wq_release(struct work_struct *work) <br>
{ <br>
struct kfd_process *p = container_of(work, struct
kfd_process, <br>
release_work); <br>
+ struct dma_fence *ef =
rcu_dereference_protected(p->ef, <br>
+ kref_read(&p->ref) == 0); <br>
kfd_process_dequeue_from_all_devices(p); <br>
pqm_uninit(&p->pqm); <br>
@@ -1118,7 +1120,7 @@ static void
kfd_process_wq_release(struct work_struct *work) <br>
* destroyed. This allows any BOs to be freed
without <br>
* triggering pointless evictions or waiting for
fences. <br>
*/ <br>
- dma_fence_signal(p->ef); <br>
+ dma_fence_signal(ef); <br>
kfd_process_remove_sysfs(p); <br>
@@ -1127,7 +1129,7 @@ static void
kfd_process_wq_release(struct work_struct *work) <br>
svm_range_list_fini(p); <br>
kfd_process_destroy_pdds(p); <br>
- dma_fence_put(p->ef); <br>
+ dma_fence_put(ef); <br>
kfd_event_free_process(p); <br>
</blockquote>
</blockquote>
<br>
</blockquote>
<br>
</blockquote>
</blockquote>
<br>
</body>
</html>