<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <div class="moz-cite-prefix">On 2023-12-11 10:56, Felix Kuehling
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:7b67d4f1-cfcd-47a9-a80e-f4c1eee235a1@amd.com">
      <br>
      On 2023-12-08 05:11, Christian König wrote:
      <br>
      <blockquote type="cite">Am 07.12.23 um 20:14 schrieb Felix
        Kuehling:
        <br>
        <blockquote type="cite">
          <br>
          On 2023-12-05 17:20, Felix Kuehling wrote:
          <br>
          <blockquote type="cite">Properly mark kfd_process->ef as
            __rcu and consistently access it with
            <br>
            rcu_dereference_protected.
            <br>
            <br>
            Reported-by: kernel test robot <a class="moz-txt-link-rfc2396E" href="mailto:lkp@intel.com"><lkp@intel.com></a>
            <br>
            Closes:
<a class="moz-txt-link-freetext" href="https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/">https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/</a><br>
            Signed-off-by: Felix Kuehling <a class="moz-txt-link-rfc2396E" href="mailto:Felix.Kuehling@amd.com"><Felix.Kuehling@amd.com></a>
            <br>
          </blockquote>
          <br>
          ping.
          <br>
          <br>
          Christian, would you review this patch, please?
          <br>
        </blockquote>
        <br>
        Looks a bit suspicious, especially the
        rcu_dereference_protected() use.
        <br>
        <br>
        What is the static checker complaining about in the first place?
        <br>
      </blockquote>
      From
<a class="moz-txt-link-freetext" href="https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/">https://lore.kernel.org/oe-kbuild-all/202312052245.yFpBSgNH-lkp@intel.com/</a>:<br>
      <br>
      <blockquote type="cite">
        <blockquote type="cite">drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1671:9:
          sparse: sparse: incompatible types in comparison expression
          (different address spaces):  >>
          drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1671:9:
          sparse: </blockquote>
      </blockquote>
      struct dma_fence [noderef] __rcu * >>
      drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1671:9: sparse:
      struct dma_fence * ... >>
      drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36: sparse:
      sparse: incompatible types in comparison expression (different
      address spaces): >>
      drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36: sparse:
      struct dma_fence [noderef] __rcu * >>
      drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36: sparse:
      struct dma_fence * >>
      drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36: sparse:
      sparse: incompatible types in comparison expression (different
      address spaces): >>
      drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36: sparse:
      struct dma_fence [noderef] __rcu * >>
      drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2765:36: sparse:
      struct dma_fence *
      <br>
      <br>
      As far as I can tell, the reason is, that I'm using
      dma_fence_get_rcu_safe and rcu_replace_pointer to get and update
      kfd_process->ef, without annotating the fence pointers with
      __rcu. This patch fixes that by marking kfd_process->ef as an
      __rcu pointer. The only place that was dereferencing it directly
      was kfd_process_wq_release, where I added
      rcu_dereference_protected. The condition I'm using here is, that
      the process ref count is 0 at that point, which means nobody else
      is referencing the process or this fence pointer at the time.
      <br>
    </blockquote>
    <p>Hi Christian,</p>
    <p>We discussed offline that you think rcu_dereference_protected is
      not needed in the teardown function. After reading over
      rcupdate.h, I think a simpler alternative would be to use
      rcu_access_pointer, after a grace period to ensure there can be no
      more readers. Based on this comment in rcupdate.h:</p>
    <pre> * It is also permissible to use rcu_access_pointer() when read-side
 * access to the pointer was removed at least one grace period ago, as is
 * the case in the context of the RCU callback that is freeing up the data,
 * or after a synchronize_rcu() returns.  This can be useful when tearing
 * down multi-linked structures after a grace period has elapsed.  However,
 * rcu_dereference_protected() is normally preferred for this use case.
</pre>
    <p>The last sentence sounds like rcu_dereference_protected should
      also be OK, though. Either way, it sounds like I need to add a
      synchronize_rcu call in any case, before freeing the fence. Do you
      agree with this proposal?</p>
    <p>Regards,<br>
        Felix<br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:7b67d4f1-cfcd-47a9-a80e-f4c1eee235a1@amd.com">
      <br>
      Regards,
      <br>
        Felix
      <br>
      <br>
      <br>
      <blockquote type="cite">
        <br>
        Regards,
        <br>
        Christian.
        <br>
        <br>
        <blockquote type="cite">
          <br>
          Thanks,
          <br>
            Felix
          <br>
          <br>
          <br>
          <br>
          <blockquote type="cite">---
            <br>
              drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h       | 2 +-
            <br>
              drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
            <br>
              drivers/gpu/drm/amd/amdkfd/kfd_priv.h            | 2 +-
            <br>
              drivers/gpu/drm/amd/amdkfd/kfd_process.c         | 6
            ++++--
            <br>
              4 files changed, 8 insertions(+), 6 deletions(-)
            <br>
            <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
            <br>
            index f2e920734c98..20cb266dcedd 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
            <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
            <br>
            @@ -314,7 +314,7 @@ void
            amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(struct kgd_mem
            *mem);
            <br>
              int amdgpu_amdkfd_map_gtt_bo_to_gart(struct amdgpu_device
            *adev, struct amdgpu_bo *bo);
            <br>
                int amdgpu_amdkfd_gpuvm_restore_process_bos(void
            *process_info,
            <br>
            -                        struct dma_fence **ef);
            <br>
            +                        struct dma_fence __rcu **ef);
            <br>
              int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct
            amdgpu_device *adev,
            <br>
                                        struct kfd_vm_fault_info *info);
            <br>
              int amdgpu_amdkfd_gpuvm_import_dmabuf_fd(struct
            amdgpu_device *adev, int fd,
            <br>
            diff --git
            a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
            <br>
            index 7d91f99acb59..8ba6f6c8363d 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
            <br>
            @@ -2806,7 +2806,7 @@ static void
            amdgpu_amdkfd_restore_userptr_worker(struct work_struct
            *work)
            <br>
                  put_task_struct(usertask);
            <br>
              }
            <br>
              -static void replace_eviction_fence(struct dma_fence **ef,
            <br>
            +static void replace_eviction_fence(struct dma_fence __rcu
            **ef,
            <br>
                                 struct dma_fence *new_ef)
            <br>
              {
            <br>
                  struct dma_fence *old_ef = rcu_replace_pointer(*ef,
            new_ef, true
            <br>
            @@ -2841,7 +2841,7 @@ static void
            replace_eviction_fence(struct dma_fence **ef,
            <br>
               * 7.  Add fence to all PD and PT BOs.
            <br>
               * 8.  Unreserve all BOs
            <br>
               */
            <br>
            -int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info,
            struct dma_fence **ef)
            <br>
            +int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info,
            struct dma_fence __rcu **ef)
            <br>
              {
            <br>
                  struct amdkfd_process_info *process_info = info;
            <br>
                  struct amdgpu_vm *peer_vm;
            <br>
            diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
            b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
            <br>
            index 45366b4ca976..5a24097a9f28 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
            <br>
            +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
            <br>
            @@ -917,7 +917,7 @@ struct kfd_process {
            <br>
                   * fence will be triggered during eviction and new one
            will be created
            <br>
                   * during restore
            <br>
                   */
            <br>
            -    struct dma_fence *ef;
            <br>
            +    struct dma_fence __rcu *ef;
            <br>
                    /* Work items for evicting and restoring BOs */
            <br>
                  struct delayed_work eviction_work;
            <br>
            diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
            b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
            <br>
            index 71df51fcc1b0..14b11d61f8dd 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
            <br>
            @@ -1110,6 +1110,8 @@ static void
            kfd_process_wq_release(struct work_struct *work)
            <br>
              {
            <br>
                  struct kfd_process *p = container_of(work, struct
            kfd_process,
            <br>
                                       release_work);
            <br>
            +    struct dma_fence *ef =
            rcu_dereference_protected(p->ef,
            <br>
            +        kref_read(&p->ref) == 0);
            <br>
                    kfd_process_dequeue_from_all_devices(p);
            <br>
                  pqm_uninit(&p->pqm);
            <br>
            @@ -1118,7 +1120,7 @@ static void
            kfd_process_wq_release(struct work_struct *work)
            <br>
                   * destroyed. This allows any BOs to be freed without
            <br>
                   * triggering pointless evictions or waiting for
            fences.
            <br>
                   */
            <br>
            -    dma_fence_signal(p->ef);
            <br>
            +    dma_fence_signal(ef);
            <br>
                    kfd_process_remove_sysfs(p);
            <br>
              @@ -1127,7 +1129,7 @@ static void
            kfd_process_wq_release(struct work_struct *work)
            <br>
                  svm_range_list_fini(p);
            <br>
                    kfd_process_destroy_pdds(p);
            <br>
            -    dma_fence_put(p->ef);
            <br>
            +    dma_fence_put(ef);
            <br>
                    kfd_event_free_process(p);
            <br>
          </blockquote>
        </blockquote>
        <br>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>