<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    Am 20.12.23 um 17:58 schrieb Felix Kuehling:<br>
    <blockquote type="cite" cite="mid:ef8c4273-d54d-42b6-ac60-fed5f8ed3848@amd.com">
      
      <div class="moz-cite-prefix">On 2023-12-11 10:56, Felix Kuehling
        wrote:<br>
      </div>
      <blockquote type="cite" cite="mid:7b67d4f1-cfcd-47a9-a80e-f4c1eee235a1@amd.com"> <br>
        On 2023-12-08 05:11, Christian König wrote: <br>
        <blockquote type="cite">Am 07.12.23 um 20:14 schrieb Felix
          Kuehling: <br>
          <blockquote type="cite"> <br>
            On 2023-12-05 17:20, Felix Kuehling wrote: <br>
            <blockquote type="cite">Properly mark kfd_process->ef as
              __rcu and consistently access it with <br>
              rcu_dereference_protected. <br>
              <br>
              Reported-by: kernel test robot <a class="moz-txt-link-rfc2396E" href="mailto:lkp@intel.com" moz-do-not-send="true"><lkp@intel.com></a>
              <br>
              Closes:
              <a class="moz-txt-link-freetext" href="https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/" moz-do-not-send="true">https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/</a><br>
              Signed-off-by: Felix Kuehling <a class="moz-txt-link-rfc2396E" href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true"><Felix.Kuehling@amd.com></a>
              <br>
            </blockquote>
            <br>
            ping. <br>
            <br>
            Christian, would you review this patch, please? <br>
          </blockquote>
          <br>
          Looks a bit suspicious, especially the
          rcu_dereference_protected() use. <br>
          <br>
          What is the static checker complaining about in the first
          place? <br>
        </blockquote>
        From
        <a class="moz-txt-link-freetext" href="https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/" moz-do-not-send="true">https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/</a>:<br>
        <br>
        <blockquote type="cite">
          <blockquote type="cite">drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1671:9:
            sparse: sparse: incompatible types in comparison expression
            (different address spaces):  >>
            drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1671:9:
            sparse: </blockquote>
        </blockquote>
        struct dma_fence [noderef] __rcu * >>
        drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1671:9:
        sparse: struct dma_fence * ... >>
        drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
        sparse: sparse: incompatible types in comparison expression
        (different address spaces): >>
        drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
        sparse: struct dma_fence [noderef] __rcu * >>
        drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
        sparse: struct dma_fence * >>
        drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
        sparse: sparse: incompatible types in comparison expression
        (different address spaces): >>
        drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
        sparse: struct dma_fence [noderef] __rcu * >>
        drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36:
        sparse: struct dma_fence * <br>
        <br>
        As far as I can tell, the reason is, that I'm using
        dma_fence_get_rcu_safe and rcu_replace_pointer to get and update
        kfd_process->ef, without annotating the fence pointers with
        __rcu. This patch fixes that by marking kfd_process->ef as an
        __rcu pointer. The only place that was dereferencing it directly
        was kfd_process_wq_release, where I added
        rcu_dereference_protected. The condition I'm using here is, that
        the process ref count is 0 at that point, which means nobody
        else is referencing the process or this fence pointer at the
        time. <br>
      </blockquote>
      <p>Hi Christian,</p>
      <p>We discussed offline that you think rcu_dereference_protected
        is not needed in the teardown function. After reading over
        rcupdate.h, I think a simpler alternative would be to use
        rcu_access_pointer, after a grace period to ensure there can be
        no more readers. Based on this comment in rcupdate.h:</p>
      <pre> * It is also permissible to use rcu_access_pointer() when read-side
 * access to the pointer was removed at least one grace period ago, as is
 * the case in the context of the RCU callback that is freeing up the data,
 * or after a synchronize_rcu() returns.  This can be useful when tearing
 * down multi-linked structures after a grace period has elapsed.  However,
 * rcu_dereference_protected() is normally preferred for this use case.
</pre>
      <p>The last sentence sounds like rcu_dereference_protected should
        also be OK, though. Either way, it sounds like I need to add a
        synchronize_rcu call in any case, before freeing the fence. Do
        you agree with this proposal?</p>
    </blockquote>
    <br>
    Yeah, completely agree. I think the reference to using
    rcu_dereference_protected() is when you protect the pointer with
    some lock, which isn't the case here.<br>
    <br>
    And the question is who is accessing this pointer? If you can
    guarantee that there are no more readers you don't need an
    synchronize_rcu(), if you can't then yes a grace period is necessary
    here.<br>
    <br>
    Regards,<br>
    Christian.<br>
    <br>
    <blockquote type="cite" cite="mid:ef8c4273-d54d-42b6-ac60-fed5f8ed3848@amd.com">
      <p>Regards,<br>
          Felix<br>
      </p>
      <p><br>
      </p>
      <blockquote type="cite" cite="mid:7b67d4f1-cfcd-47a9-a80e-f4c1eee235a1@amd.com"> <br>
        Regards, <br>
          Felix <br>
        <br>
        <br>
        <blockquote type="cite"> <br>
          Regards, <br>
          Christian. <br>
          <br>
          <blockquote type="cite"> <br>
            Thanks, <br>
              Felix <br>
            <br>
            <br>
            <br>
            <blockquote type="cite">--- <br>
                drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h       | 2 +-
              <br>
                drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4
              ++-- <br>
                drivers/gpu/drm/amd/amdkfd/kfd_priv.h            | 2 +-
              <br>
                drivers/gpu/drm/amd/amdkfd/kfd_process.c         | 6
              ++++-- <br>
                4 files changed, 8 insertions(+), 6 deletions(-) <br>
              <br>
              diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
              b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h <br>
              index f2e920734c98..20cb266dcedd 100644 <br>
              --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h <br>
              +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h <br>
              @@ -314,7 +314,7 @@ void
              amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(struct
              kgd_mem *mem); <br>
                int amdgpu_amdkfd_map_gtt_bo_to_gart(struct
              amdgpu_device *adev, struct amdgpu_bo *bo); <br>
                  int amdgpu_amdkfd_gpuvm_restore_process_bos(void
              *process_info, <br>
              -                        struct dma_fence **ef); <br>
              +                        struct dma_fence __rcu **ef); <br>
                int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct
              amdgpu_device *adev, <br>
                                          struct kfd_vm_fault_info
              *info); <br>
                int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct
              amdgpu_device *adev, int fd, <br>
              diff --git
              a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
              b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c <br>
              index 7d91f99acb59..8ba6f6c8363d 100644 <br>
              --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c <br>
              +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c <br>
              @@ -2806,7 +2806,7 @@ static void
              amdgpu_amdkfd_restore_userptr_worker(struct work_struct
              *work) <br>
                    put_task_struct(usertask); <br>
                } <br>
                -static void replace_eviction_fence(struct dma_fence
              **ef, <br>
              +static void replace_eviction_fence(struct dma_fence __rcu
              **ef, <br>
                                   struct dma_fence *new_ef) <br>
                { <br>
                    struct dma_fence *old_ef = rcu_replace_pointer(*ef,
              new_ef, true <br>
              @@ -2841,7 +2841,7 @@ static void
              replace_eviction_fence(struct dma_fence **ef, <br>
                 * 7.  Add fence to all PD and PT BOs. <br>
                 * 8.  Unreserve all BOs <br>
                 */ <br>
              -int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info,
              struct dma_fence **ef) <br>
              +int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info,
              struct dma_fence __rcu **ef) <br>
                { <br>
                    struct amdkfd_process_info *process_info = info; <br>
                    struct amdgpu_vm *peer_vm; <br>
              diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
              b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h <br>
              index 45366b4ca976..5a24097a9f28 100644 <br>
              --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h <br>
              +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h <br>
              @@ -917,7 +917,7 @@ struct kfd_process { <br>
                     * fence will be triggered during eviction and new
              one will be created <br>
                     * during restore <br>
                     */ <br>
              -    struct dma_fence *ef; <br>
              +    struct dma_fence __rcu *ef; <br>
                      /* Work items for evicting and restoring BOs */ <br>
                    struct delayed_work eviction_work; <br>
              diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
              b/drivers/gpu/drm/amd/amdkfd/kfd_process.c <br>
              index 71df51fcc1b0..14b11d61f8dd 100644 <br>
              --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c <br>
              +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c <br>
              @@ -1110,6 +1110,8 @@ static void
              kfd_process_wq_release(struct work_struct *work) <br>
                { <br>
                    struct kfd_process *p = container_of(work, struct
              kfd_process, <br>
                                         release_work); <br>
              +    struct dma_fence *ef =
              rcu_dereference_protected(p->ef, <br>
              +        kref_read(&p->ref) == 0); <br>
                      kfd_process_dequeue_from_all_devices(p); <br>
                    pqm_uninit(&p->pqm); <br>
              @@ -1118,7 +1120,7 @@ static void
              kfd_process_wq_release(struct work_struct *work) <br>
                     * destroyed. This allows any BOs to be freed
              without <br>
                     * triggering pointless evictions or waiting for
              fences. <br>
                     */ <br>
              -    dma_fence_signal(p->ef); <br>
              +    dma_fence_signal(ef); <br>
                      kfd_process_remove_sysfs(p); <br>
                @@ -1127,7 +1129,7 @@ static void
              kfd_process_wq_release(struct work_struct *work) <br>
                    svm_range_list_fini(p); <br>
                      kfd_process_destroy_pdds(p); <br>
              -    dma_fence_put(p->ef); <br>
              +    dma_fence_put(ef); <br>
                      kfd_event_free_process(p); <br>
            </blockquote>
          </blockquote>
          <br>
        </blockquote>
        <br>
      </blockquote>
    </blockquote>
    <br>
  </body>
</html>