<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2024-01-26 10:35, Christian König
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:5ad7382e-5b3f-49b1-9aba-59a316feb7ab@amd.com">Am
      26.01.24 um 16:17 schrieb Philip Yang:
      <br>
      <blockquote type="cite">On 2024-01-26 09:59, Christian König
        wrote:
        <br>
        <blockquote type="cite">Am 26.01.24 um 15:38 schrieb Philip
          Yang:
          <br>
          <blockquote type="cite">svm range support partial migration
            and mapping update, for size 4MB
            <br>
            virtual address 4MB alignment and physical address
            continuous range, if
            <br>
            mapping to GPU with fs=10, after updating mapping of the
            first 2MB,
            <br>
            if the second 2MB mapping fs=10 in cache TLB, this causes
            the first 2MB
            <br>
            access to the stale mapping.
            <br>
          </blockquote>
          <br>
          Well that sounds fishy. When that happens with (for example)
          4MiB and 2MiB, why doesn't it happen with 8KiB and 4KiB as
          well?
          <br>
        </blockquote>
        <br>
        unmap svm range is aligned to granularity size, if the range
        size is 8KB (all within one 2MB granularity range), it will be
        mapped/unmapped as 8KB, even if only 4KB is migrated. This is
        handled in another patch series "amd/amdkfd: Unmap range from
        GPU based on granularity".
        <br>
        <br>
      </blockquote>
      <br>
      Ok that makes a bit more sense.
      <br>
      <br>
      But when you have a linear 4MiB mapping and unmap the first 2MiB
      of it you need to flush the TLB anyway.
      <br>
      <br>
      So why would that cause a stale access?
      <br>
    </blockquote>
    <p>yes, unmap does flush the TLB, the issue happens if GPU access
      the second 2MB to load fs=10 entry to TLB, and then access the
      first 2MB.</p>
    <p>Originally I thought this could be fixed by using granularity
      aligned address, size to map/unmap to GPU, after debugging,
      realize we still need limit the max fragment size. We could change
      this in svm map function, but it is more efficient to pass the max
      fragment size to GPU page table update level.</p>
    <p>Regards,</p>
    <p>Philip<br>
    </p>
    <blockquote type="cite" cite="mid:5ad7382e-5b3f-49b1-9aba-59a316feb7ab@amd.com">
      <br>
      Regards,
      <br>
      Christian.
      <br>
      <br>
      <blockquote type="cite">Regards,
        <br>
        <br>
        Philip
        <br>
        <br>
        <blockquote type="cite">
          <br>
          Christian.
          <br>
          <br>
          <blockquote type="cite">
            <br>
            Limit the maximum fragment size to granularity size, 2MB by
            default,
            <br>
            with the mapping and unmapping based on gramularity size, to
            solve this
            <br>
            issue.
            <br>
            <br>
            The change is only for SVM map/unmap range, no change for
            gfx and legacy
            <br>
            API path.
            <br>
            <br>
            Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
            <br>
            ---
            <br>
              drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c    | 12
            +++++++-----
            <br>
              drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h    |  4 ++--
            <br>
              drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c | 22
            ++++++++++++----------
            <br>
              drivers/gpu/drm/amd/amdkfd/kfd_svm.c      |  9 +++++----
            <br>
              4 files changed, 26 insertions(+), 21 deletions(-)
            <br>
            <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
            <br>
            index ed4a8c5d26d7..a2bef94cb959 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
            <br>
            @@ -897,6 +897,7 @@ static void amdgpu_vm_tlb_seq_cb(struct
            dma_fence *fence,
            <br>
               * @res: ttm_resource to map
            <br>
               * @pages_addr: DMA addresses to use for mapping
            <br>
               * @fence: optional resulting fence
            <br>
            + * @frag_size: max map fragment size
            <br>
               *
            <br>
               * Fill in the page table entries between @start and
            @last.
            <br>
               *
            <br>
            @@ -908,7 +909,7 @@ int amdgpu_vm_update_range(struct
            amdgpu_device *adev, struct amdgpu_vm *vm,
            <br>
                             struct dma_resv *resv, uint64_t start,
            uint64_t last,
            <br>
                             uint64_t flags, uint64_t offset, uint64_t
            vram_base,
            <br>
                             struct ttm_resource *res, dma_addr_t
            *pages_addr,
            <br>
            -               struct dma_fence **fence)
            <br>
            +               struct dma_fence **fence, unsigned int
            frag_size)
            <br>
              {
            <br>
                  struct amdgpu_vm_update_params params;
            <br>
                  struct amdgpu_vm_tlb_seq_struct *tlb_cb;
            <br>
            @@ -1016,7 +1017,7 @@ int amdgpu_vm_update_range(struct
            amdgpu_device *adev, struct amdgpu_vm *vm,
            <br>
                      }
            <br>
                        tmp = start + num_entries;
            <br>
            -        r = amdgpu_vm_ptes_update(&params, start, tmp,
            addr, flags);
            <br>
            +        r = amdgpu_vm_ptes_update(&params, start, tmp,
            addr, flags, frag_size);
            <br>
                      if (r)
            <br>
                          goto error_free;
            <br>
              @@ -1197,7 +1198,7 @@ int amdgpu_vm_bo_update(struct
            amdgpu_device *adev, struct amdgpu_bo_va *bo_va,
            <br>
                                     !uncached, resv, mapping->start,
            mapping->last,
            <br>
                                     update_flags, mapping->offset,
            <br>
                                     vram_base, mem, pages_addr,
            <br>
            -                       last_update);
            <br>
            +                       last_update, 0);
            <br>
                      if (r)
            <br>
                          return r;
            <br>
                  }
            <br>
            @@ -1392,7 +1393,7 @@ int amdgpu_vm_clear_freed(struct
            amdgpu_device *adev,
            <br>
                      r = amdgpu_vm_update_range(adev, vm, false, false,
            true, false,
            <br>
                                     resv, mapping->start,
            mapping->last,
            <br>
                                     init_pte_value, 0, 0, NULL, NULL,
            <br>
            -                       &f);
            <br>
            +                       &f, 0);
            <br>
                      amdgpu_vm_free_mapping(adev, vm, mapping, f);
            <br>
                      if (r) {
            <br>
                          dma_fence_put(f);
            <br>
            @@ -2733,7 +2734,8 @@ bool amdgpu_vm_handle_fault(struct
            amdgpu_device *adev, u32 pasid,
            <br>
                  }
            <br>
                    r = amdgpu_vm_update_range(adev, vm, true, false,
            false, false,
            <br>
            -                   NULL, addr, addr, flags, value, 0, NULL,
            NULL, NULL);
            <br>
            +                   NULL, addr, addr, flags, value, 0, NULL,
            NULL,
            <br>
            +                   NULL, 0);
            <br>
                  if (r)
            <br>
                      goto error_unlock;
            <br>
              diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
            <br>
            index 666698a57192..b34466b5086f 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
            <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
            <br>
            @@ -465,7 +465,7 @@ int amdgpu_vm_update_range(struct
            amdgpu_device *adev, struct amdgpu_vm *vm,
            <br>
                             struct dma_resv *resv, uint64_t start,
            uint64_t last,
            <br>
                             uint64_t flags, uint64_t offset, uint64_t
            vram_base,
            <br>
                             struct ttm_resource *res, dma_addr_t
            *pages_addr,
            <br>
            -               struct dma_fence **fence);
            <br>
            +               struct dma_fence **fence, unsigned int
            frag_size);
            <br>
              int amdgpu_vm_bo_update(struct amdgpu_device *adev,
            <br>
                          struct amdgpu_bo_va *bo_va,
            <br>
                          bool clear);
            <br>
            @@ -531,7 +531,7 @@ int amdgpu_vm_pde_update(struct
            amdgpu_vm_update_params *params,
            <br>
                           struct amdgpu_vm_bo_base *entry);
            <br>
              int amdgpu_vm_ptes_update(struct amdgpu_vm_update_params
            *params,
            <br>
                            uint64_t start, uint64_t end,
            <br>
            -              uint64_t dst, uint64_t flags);
            <br>
            +              uint64_t dst, uint64_t flags, unsigned int
            frag_size);
            <br>
              void amdgpu_vm_pt_free_work(struct work_struct *work);
            <br>
                #if defined(CONFIG_DEBUG_FS)
            <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
            <br>
            index a160265ddc07..4647f700f1c6 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
            <br>
            @@ -860,12 +860,14 @@ static void
            amdgpu_vm_pte_update_flags(struct amdgpu_vm_update_params
            *params,
            <br>
               * @flags: hw mapping flags
            <br>
               * @frag: resulting fragment size
            <br>
               * @frag_end: end of this fragment
            <br>
            + * @frag_size: max map fragment size
            <br>
               *
            <br>
               * Returns the first possible fragment for the start and
            end address.
            <br>
               */
            <br>
              static void amdgpu_vm_pte_fragment(struct
            amdgpu_vm_update_params *params,
            <br>
                                 uint64_t start, uint64_t end, uint64_t
            flags,
            <br>
            -                   unsigned int *frag, uint64_t *frag_end)
            <br>
            +                   unsigned int *frag, uint64_t *frag_end,
            <br>
            +                   unsigned int frag_size)
            <br>
              {
            <br>
                  /**
            <br>
                   * The MC L1 TLB supports variable sized pages, based
            on a fragment
            <br>
            @@ -893,7 +895,7 @@ static void
            amdgpu_vm_pte_fragment(struct amdgpu_vm_update_params
            *params,
            <br>
                  if (params->adev->asic_type < CHIP_VEGA10)
            <br>
                      max_frag =
            params->adev->vm_manager.fragment_size;
            <br>
                  else
            <br>
            -        max_frag = 31;
            <br>
            +        max_frag = frag_size ? frag_size : 31;
            <br>
                    /* system pages are non continuously */
            <br>
                  if (params->pages_addr) {
            <br>
            @@ -904,12 +906,10 @@ static void
            amdgpu_vm_pte_fragment(struct amdgpu_vm_update_params
            *params,
            <br>
                    /* This intentionally wraps around if no bit is set
            */
            <br>
                  *frag = min_t(unsigned int, ffs(start) - 1, fls64(end
            - start) - 1);
            <br>
            -    if (*frag >= max_frag) {
            <br>
            +    if (*frag >= max_frag)
            <br>
                      *frag = max_frag;
            <br>
            -        *frag_end = end & ~((1ULL << max_frag) -
            1);
            <br>
            -    } else {
            <br>
            -        *frag_end = start + (1 << *frag);
            <br>
            -    }
            <br>
            +
            <br>
            +    *frag_end = start + (1 << *frag);
            <br>
              }
            <br>
                /**
            <br>
            @@ -920,6 +920,7 @@ static void
            amdgpu_vm_pte_fragment(struct amdgpu_vm_update_params
            *params,
            <br>
               * @end: end of GPU address range
            <br>
               * @dst: destination address to map to, the next dst
            inside the function
            <br>
               * @flags: mapping flags
            <br>
            + * @frag_size: max map fragment size
            <br>
               *
            <br>
               * Update the page tables in the range @start - @end.
            <br>
               *
            <br>
            @@ -928,7 +929,7 @@ static void
            amdgpu_vm_pte_fragment(struct amdgpu_vm_update_params
            *params,
            <br>
               */
            <br>
              int amdgpu_vm_ptes_update(struct amdgpu_vm_update_params
            *params,
            <br>
                            uint64_t start, uint64_t end,
            <br>
            -              uint64_t dst, uint64_t flags)
            <br>
            +              uint64_t dst, uint64_t flags, unsigned int
            frag_size)
            <br>
              {
            <br>
                  struct amdgpu_device *adev = params->adev;
            <br>
                  struct amdgpu_vm_pt_cursor cursor;
            <br>
            @@ -938,7 +939,7 @@ int amdgpu_vm_ptes_update(struct
            amdgpu_vm_update_params *params,
            <br>
                    /* figure out the initial fragment */
            <br>
                  amdgpu_vm_pte_fragment(params, frag_start, end, flags,
            &frag,
            <br>
            -                   &frag_end);
            <br>
            +                   &frag_end, frag_size);
            <br>
                    /* walk over the address space and update the PTs */
            <br>
                  amdgpu_vm_pt_start(adev, params->vm, start,
            &cursor);
            <br>
            @@ -1040,7 +1041,8 @@ int amdgpu_vm_ptes_update(struct
            amdgpu_vm_update_params *params,
            <br>
                          if (frag_start >= frag_end) {
            <br>
                              /* figure out the next fragment */
            <br>
                              amdgpu_vm_pte_fragment(params, frag_start,
            end,
            <br>
            -                               flags, &frag,
            &frag_end);
            <br>
            +                               flags, &frag,
            &frag_end,
            <br>
            +                               frag_size);
            <br>
                              if (frag < shift)
            <br>
                                  break;
            <br>
                          }
            <br>
            diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
            b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
            <br>
            index ed35a490fd9e..d71b2c8bf51a 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
            <br>
            @@ -1488,7 +1488,7 @@ svm_range_get_pte_flags(struct
            kfd_node *node,
            <br>
              static int
            <br>
              svm_range_unmap_from_gpu(struct amdgpu_device *adev,
            struct amdgpu_vm *vm,
            <br>
                           uint64_t start, uint64_t last,
            <br>
            -             struct dma_fence **fence)
            <br>
            +             struct dma_fence **fence, unsigned int
            frag_size)
            <br>
              {
            <br>
                  uint64_t init_pte_value = 0;
            <br>
              @@ -1496,7 +1496,7 @@ svm_range_unmap_from_gpu(struct
            amdgpu_device *adev, struct amdgpu_vm *vm,
            <br>
                    return amdgpu_vm_update_range(adev, vm, false, true,
            true, false, NULL, start,
            <br>
                                    last, init_pte_value, 0, 0, NULL,
            NULL,
            <br>
            -                      fence);
            <br>
            +                      fence, frag_size);
            <br>
              }
            <br>
                /**
            <br>
            @@ -1579,7 +1579,7 @@ svm_range_unmap_from_gpus(struct
            svm_range *prange, unsigned long start,
            <br>
                        r =
            svm_range_unmap_from_gpu(pdd->dev->adev,
            <br>
                                       drm_priv_to_vm(pdd->drm_priv),
            <br>
            -                         start, last, &fence);
            <br>
            +                         start, last, &fence,
            prange->granularity);
            <br>
                      if (r)
            <br>
                          break;
            <br>
              @@ -1647,7 +1647,8 @@ svm_range_map_to_gpu(struct
            kfd_process_device *pdd, struct svm_range *prange,
            <br>
                                     pte_flags,
            <br>
                                     (last_start - prange->start)
            << PAGE_SHIFT,
            <br>
                                     bo_adev ?
            bo_adev->vm_manager.vram_base_offset : 0,
            <br>
            -                       NULL, dma_addr,
            &vm->last_update);
            <br>
            +                       NULL, dma_addr,
            &vm->last_update,
            <br>
            +                       prange->granularity);
            <br>
                        for (j = last_start - prange->start; j <=
            i; j++)
            <br>
                          dma_addr[j] |= last_domain;
            <br>
          </blockquote>
          <br>
        </blockquote>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>