<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2025-01-15 06:01, Christian König
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:f5f41f09-b62d-495f-9339-0c9dc6535bf8@amd.com">Am
      14.01.25 um 15:53 schrieb Philip Yang:
      <br>
      <blockquote type="cite">SVM migration unmap pages from GPU and
        then update mapping to GPU to
        <br>
        recover page fault. Currently unmap clears the PDE entry for
        range
        <br>
        length >= huge page and free PTB bo, update mapping to alloc
        new PT bo.
        <br>
        There is race bug that the freed entry bo maybe still on the
        pt_free
        <br>
        list, reused when updating mapping and then freed, leave invalid
        PDE
        <br>
        entry and cause GPU page fault.
        <br>
        <br>
        By setting the update to clear only one PDE entry or clear PTB,
        to
        <br>
        avoid unmap to free PTE bo. This fixes the race bug and improve
        the
        <br>
        unmap and map to GPU performance. Update mapping to huge page
        will
        <br>
        still free the PTB bo.
        <br>
        <br>
        With this change, the vm->pt_freed list and work is not
        needed. Add
        <br>
        WARN_ON(unlocked) in amdgpu_vm_pt_free_dfs to catch if unmap to
        free the
        <br>
        PTB.
        <br>
        <br>
        Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
        <br>
        ---
        <br>
          drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c    |  4 ---
        <br>
          drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h    |  4 ---
        <br>
          drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c | 43
        +++++++----------------
        <br>
          3 files changed, 13 insertions(+), 38 deletions(-)
        <br>
        <br>
        diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
        b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
        <br>
        index c9c48b782ec1..48b2c0b3b315 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
        <br>
        +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
        <br>
        @@ -2440,8 +2440,6 @@ int amdgpu_vm_init(struct amdgpu_device
        *adev, struct amdgpu_vm *vm,
        <br>
              spin_lock_init(&vm->status_lock);
        <br>
              INIT_LIST_HEAD(&vm->freed);
        <br>
              INIT_LIST_HEAD(&vm->done);
        <br>
        -    INIT_LIST_HEAD(&vm->pt_freed);
        <br>
        -    INIT_WORK(&vm->pt_free_work,
        amdgpu_vm_pt_free_work);
        <br>
              INIT_KFIFO(vm->faults);
        <br>
                r = amdgpu_vm_init_entities(adev, vm);
        <br>
        @@ -2613,8 +2611,6 @@ void amdgpu_vm_fini(struct amdgpu_device
        *adev, struct amdgpu_vm *vm)
        <br>
                amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm);
        <br>
          -    flush_work(&vm->pt_free_work);
        <br>
        -
        <br>
              root = amdgpu_bo_ref(vm->root.bo);
        <br>
              amdgpu_bo_reserve(root, true);
        <br>
              amdgpu_vm_put_task_info(vm->task_info);
        <br>
        diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
        b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
        <br>
        index 5d119ac26c4f..160889e5e64d 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
        <br>
        +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
        <br>
        @@ -369,10 +369,6 @@ struct amdgpu_vm {
        <br>
              /* BOs which are invalidated, has been updated in the PTs
        */
        <br>
              struct list_head        done;
        <br>
          -    /* PT BOs scheduled to free and fill with zero if vm_resv
        is not hold */
        <br>
        -    struct list_head    pt_freed;
        <br>
        -    struct work_struct    pt_free_work;
        <br>
        -
        <br>
              /* contains the page directory */
        <br>
              struct amdgpu_vm_bo_base     root;
        <br>
              struct dma_fence    *last_update;
        <br>
        diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
        b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
        <br>
        index f78a0434a48f..54ae0e9bc6d7 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
        <br>
        +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
        <br>
        @@ -546,27 +546,6 @@ static void amdgpu_vm_pt_free(struct
        amdgpu_vm_bo_base *entry)
        <br>
              amdgpu_bo_unref(&entry->bo);
        <br>
          }
        <br>
          -void amdgpu_vm_pt_free_work(struct work_struct *work)
        <br>
        -{
        <br>
        -    struct amdgpu_vm_bo_base *entry, *next;
        <br>
        -    struct amdgpu_vm *vm;
        <br>
        -    LIST_HEAD(pt_freed);
        <br>
        -
        <br>
        -    vm = container_of(work, struct amdgpu_vm, pt_free_work);
        <br>
        -
        <br>
        -    spin_lock(&vm->status_lock);
        <br>
        -    list_splice_init(&vm->pt_freed, &pt_freed);
        <br>
        -    spin_unlock(&vm->status_lock);
        <br>
        -
        <br>
        -    /* flush_work in amdgpu_vm_fini ensure vm->root.bo is
        valid. */
        <br>
        -    amdgpu_bo_reserve(vm->root.bo, true);
        <br>
        -
        <br>
        -    list_for_each_entry_safe(entry, next, &pt_freed,
        vm_status)
        <br>
        -        amdgpu_vm_pt_free(entry);
        <br>
        -
        <br>
        -    amdgpu_bo_unreserve(vm->root.bo);
        <br>
        -}
        <br>
        -
        <br>
          /**
        <br>
           * amdgpu_vm_pt_free_list - free PD/PT levels
        <br>
           *
        <br>
        @@ -579,19 +558,15 @@ void amdgpu_vm_pt_free_list(struct
        amdgpu_device *adev,
        <br>
                          struct amdgpu_vm_update_params *params)
        <br>
          {
        <br>
              struct amdgpu_vm_bo_base *entry, *next;
        <br>
        -    struct amdgpu_vm *vm = params->vm;
        <br>
              bool unlocked = params->unlocked;
        <br>
                if (list_empty(&params->tlb_flush_waitlist))
        <br>
                  return;
        <br>
          -    if (unlocked) {
        <br>
        -        spin_lock(&vm->status_lock);
        <br>
        -        list_splice_init(&params->tlb_flush_waitlist,
        &vm->pt_freed);
        <br>
        -        spin_unlock(&vm->status_lock);
        <br>
        -        schedule_work(&vm->pt_free_work);
        <br>
        -        return;
        <br>
        -    }
        <br>
        +    /*
        <br>
        +     * unlocked unmap clear page table leaves, warning to free
        the page entry.
        <br>
        +     */
        <br>
        +    WARN_ON(unlocked);
        <br>
                list_for_each_entry_safe(entry, next,
        &params->tlb_flush_waitlist, vm_status)
        <br>
                  amdgpu_vm_pt_free(entry);
        <br>
        @@ -899,7 +874,15 @@ int amdgpu_vm_ptes_update(struct
        amdgpu_vm_update_params *params,
        <br>
                  incr = (uint64_t)AMDGPU_GPU_PAGE_SIZE << shift;
        <br>
                  mask = amdgpu_vm_pt_entries_mask(adev, cursor.level);
        <br>
                  pe_start = ((cursor.pfn >> shift) & mask) *
        8;
        <br>
        -        entry_end = ((uint64_t)mask + 1) << shift;
        <br>
        +
        <br>
        +        if (cursor.level < AMDGPU_VM_PTB &&
        params->unlocked)
        <br>
        +            /*
        <br>
        +             * MMU notifier callback unlocked unmap huge page,
        leave is PDE entry,
        <br>
        +             * only clear one entry. Next entry search again
        for PDE or PTE leave.
        <br>
        +             */
        <br>
        +            entry_end = 1ULL << shift;
        <br>
        +        else
        <br>
        +            entry_end = ((uint64_t)mask + 1) << shift;
        <br>
      </blockquote>
      <br>
      That here looks to much like a hack for my taste. entry_end
      basically denotes the end of the pages tables and not the updated
      region.
      <br>
    </blockquote>
    yes, agree.<br>
    <blockquote type="cite" cite="mid:f5f41f09-b62d-495f-9339-0c9dc6535bf8@amd.com">
      <br>
      We already set frag_end to this here:
      <br>
      <br>
      frag_end = max(frag_end, ALIGN(frag_start + 1, 1ULL <<
      shift));
      <br>
    </blockquote>
    <p>Here seems the root cause, for example, unmapping frag_start is
      8MB aligned address, frag_end is frag_start + 8MB, but for
      unlocked update we want to do page walk again after unmapping 2MB.<br>
    </p>
    <p>Also the max(....) seems meaningless, as frag_end is always >=
      frag_start + 1, this can be changed to</p>
    <p>if (params->unlocked)</p>
    <p>   frag_end = frag_start + 1;</p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:f5f41f09-b62d-495f-9339-0c9dc6535bf8@amd.com">
      <br>
      Which basically means that we should update just one entry at the
      time and then walk the page tables again.
      <br>
      <br>
      The issue is just that we immediately calculate a new fragment
      after each update:
      <br>
      <br>
                              if (frag_start >= frag_end) {
      <br>
                                      /* figure out the next fragment */
      <br>
                                      amdgpu_vm_pte_fragment(params,
      frag_start, end,
      <br>
                                                             flags,
      &frag, &frag_end);
      <br>
                                      if (frag < shift)
      <br>
                                              break;
      <br>
                              }
      <br>
      And that looks like the place we need to adjust something to allow
      updates of the leave nodes.
      <br>
    </blockquote>
    <p>yes, also we should always break here to restart page walk for
      unlocked unmapping<br>
    </p>
    <p>if (frag < shift || params->unlocked)</p>
    <p>     break;<br>
    </p>
    <blockquote type="cite" cite="mid:f5f41f09-b62d-495f-9339-0c9dc6535bf8@amd.com">
      <br>
      Alternatively I wouldn't mind having a complete separate function
      for unlocked invalidations.
      <br>
    </blockquote>
    <p>A complete separate function will duplicate lots of page walker
      code. Alternatively we could change amdgpu_vm_pte_fragment, for
      unlocked unmapping, always return frag_end to 2MB (or 1GB), not
      4MB, 8MB.... <br>
    </p>
    <p>Regards,</p>
    <p>Philip<br>
    </p>
    <blockquote type="cite" cite="mid:f5f41f09-b62d-495f-9339-0c9dc6535bf8@amd.com">
      <br>
      Regards,
      <br>
      Christian.
      <br>
      <br>
      <br>
      <blockquote type="cite">          entry_end += cursor.pfn &
        ~(entry_end - 1);
        <br>
                  entry_end = min(entry_end, end);
        <br>
          </blockquote>
      <br>
    </blockquote>
  </body>
</html>