<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <div class="moz-cite-prefix">On 2023-10-30 4:23, Christian König
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:8e81a28a-5d50-44df-b441-6ceb2133c5e9@amd.com">Am
      28.10.23 um 00:39 schrieb Felix Kuehling:
      <br>
      <blockquote type="cite">Make restore workers freezable so we don't
        have to explicitly flush them
        <br>
        in suspend and GPU reset code paths, and we don't accidentally
        try to
        <br>
        restore BOs while the GPU is suspended. Not having to flush
        restore_work
        <br>
        also helps avoid lock/fence dependencies in the GPU reset case
        where we're
        <br>
        not allowed to wait for fences.
        <br>
        <br>
        This is an RFC and request for testing. I have not tested this
        myself yet.
        <br>
        My notes below:
        <br>
        <br>
        Restore work won't be frozen during GPU reset. Does it matter?
        Queues will
        <br>
        stay evicted until resume in any case. But restore_work may be
        in trouble
        <br>
        if it gets run in the middle of a GPU reset. In that case, if
        anything
        <br>
        fails, it will just reschedule itself, so should be fine as long
        as it
        <br>
        doesn't interfere with the reset itself (respects any mechanisms
        in place
        <br>
        to prevent HW access during the reset).
        <br>
        <br>
        What HW access does restore_work perform:
        <br>
        - migrating buffers: uses GPU scheduler, should be suspended
        during reset
        <br>
        - TLB flushes in userptr restore worker: not called directly,
        relies on
        <br>
           HWS to flush TLBs on VMID attachment
        <br>
        - TLB flushes in SVM restore worker: Does TLB flush in the
        mapping code
        <br>
        - Resuming user mode queues: should not happen while GPU reset
        keeps queue
        <br>
           eviction counter elevated
        <br>
        Ergo: Except for the SVM case, it's OK to not flush restore work
        before
        <br>
        GPU resets. I'll need to rethink the interaction of SVM
        restore_work with
        <br>
        GPU resets.
        <br>
      </blockquote>
      <br>
      It also sounds like the restore work is some high level work and
      shouldn't interact with the lower level GPU reset.
      <br>
    </blockquote>
    <p>That was my hope as well. Just trying to think through to make
      sure I'm not making any incorrect assumptions.</p>
    <p>Do you know if there is anything preventing a TLB flush using
      MMIO from messing up a GPU reset in progress? That's the only
      thing in the SVM restore work that tries to touch HW directly.<br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:8e81a28a-5d50-44df-b441-6ceb2133c5e9@amd.com">
      <br>
      <blockquote type="cite">
        <br>
        What about cancelling p->eviction_work? Eviction work would
        normally be
        <br>
        needed to signal eviction fences, but we're doing that
        explicitly in
        <br>
        suspend_all_processes. Does eviction_work wait for fences
        anywhere? Yes,
        <br>
        indirectly by flushing restore_work. So we should not try to
        cancel it
        <br>
        during a GPU reset.
        <br>
        <br>
        Problem: accessing p->ef concurrently in evict_process_worker
        and
        <br>
        suspend_all_processes. Need a spinlock to use and update it
        safely.
        <br>
        Problem: What if evict_process_worker gets stuck in flushing
        restore_work?
        <br>
        We can skip all of that if p->ef is NULL (which it is during
        the reset).
        <br>
        Even if it gets stuck, it's no problem if the reset doesn't
        depend on it.
        <br>
        It should get unstuck after the reset.
        <br>
        <br>
        Signed-off-by: Felix Kuehling <a class="moz-txt-link-rfc2396E" href="mailto:Felix.Kuehling@amd.com"><Felix.Kuehling@amd.com></a>
        <br>
        ---
        <br>
          .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  9 ++--
        <br>
          drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  1 +
        <br>
          drivers/gpu/drm/amd/amdkfd/kfd_process.c      | 49
        +++++++++++++------
        <br>
          drivers/gpu/drm/amd/amdkfd/kfd_svm.c          |  4 +-
        <br>
          4 files changed, 44 insertions(+), 19 deletions(-)
        <br>
        <br>
        diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
        b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
        <br>
        index 54f31a420229..89e632257663 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
        <br>
        +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
        <br>
        @@ -1644,7 +1644,8 @@ int amdgpu_amdkfd_criu_resume(void *p)
        <br>
                  goto out_unlock;
        <br>
              }
        <br>
              WRITE_ONCE(pinfo->block_mmu_notifications, false);
        <br>
        -    schedule_delayed_work(&pinfo->restore_userptr_work,
        0);
        <br>
        +    queue_delayed_work(system_freezable_wq,
        <br>
        +               &pinfo->restore_userptr_work, 0);
        <br>
            out_unlock:
        <br>
              mutex_unlock(&pinfo->lock);
        <br>
        @@ -2458,7 +2459,8 @@ int amdgpu_amdkfd_evict_userptr(struct
        mmu_interval_notifier *mni,
        <br>
                                 KFD_QUEUE_EVICTION_TRIGGER_USERPTR);
        <br>
                  if (r)
        <br>
                      pr_err("Failed to quiesce KFD\n");
        <br>
        -       
        schedule_delayed_work(&process_info->restore_userptr_work,
        <br>
        +        queue_delayed_work(system_freezable_wq,
        <br>
        +            &process_info->restore_userptr_work,
        <br>
                     
        msecs_to_jiffies(AMDGPU_USERPTR_RESTORE_DELAY_MS));
        <br>
              }
        <br>
              mutex_unlock(&process_info->notifier_lock);
        <br>
        @@ -2793,7 +2795,8 @@ static void
        amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
        <br>
                /* If validation failed, reschedule another attempt */
        <br>
              if (evicted_bos) {
        <br>
        -       
        schedule_delayed_work(&process_info->restore_userptr_work,
        <br>
        +        queue_delayed_work(system_freezable_wq,
        <br>
        +            &process_info->restore_userptr_work,
        <br>
                     
        msecs_to_jiffies(AMDGPU_USERPTR_RESTORE_DELAY_MS));
        <br>
                    kfd_smi_event_queue_restore_rescheduled(mm);
        <br>
        diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
        b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
        <br>
        index 9cc32f577e38..cf017d027fee 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
        <br>
        +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
        <br>
        @@ -919,6 +919,7 @@ struct kfd_process {
        <br>
               * during restore
        <br>
               */
        <br>
              struct dma_fence *ef;
        <br>
        +    spinlock_t ef_lock;
        <br>
                /* Work items for evicting and restoring BOs */
        <br>
              struct delayed_work eviction_work;
        <br>
        diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
        b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
        <br>
        index fbf053001af9..a07cba58ec5e 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
        <br>
        +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
        <br>
        @@ -664,7 +664,8 @@ int kfd_process_create_wq(void)
        <br>
              if (!kfd_process_wq)
        <br>
                  kfd_process_wq = alloc_workqueue("kfd_process_wq", 0,
        0);
        <br>
              if (!kfd_restore_wq)
        <br>
        -        kfd_restore_wq =
        alloc_ordered_workqueue("kfd_restore_wq", 0);
        <br>
        +        kfd_restore_wq =
        alloc_ordered_workqueue("kfd_restore_wq",
        <br>
        +                             WQ_FREEZABLE);
        <br>
                if (!kfd_process_wq || !kfd_restore_wq) {
        <br>
                  kfd_process_destroy_wq();
        <br>
        @@ -1460,6 +1461,7 @@ static struct kfd_process
        *create_process(const struct task_struct *thread)
        <br>
                kref_init(&process->ref);
        <br>
              mutex_init(&process->mutex);
        <br>
        +    spin_lock_init(&process->ef_lock);
        <br>
              process->mm = thread->mm;
        <br>
              process->lead_thread = thread->group_leader;
        <br>
              process->n_pdds = 0;
        <br>
        @@ -1904,6 +1906,19 @@ kfd_process_gpuid_from_node(struct
        kfd_process *p, struct kfd_node *node,
        <br>
              return -EINVAL;
        <br>
          }
        <br>
          +static void signal_eviction_fence(struct kfd_process *p)
        <br>
        +{
        <br>
        +    spin_lock(&p->ef_lock);
        <br>
        +    if (!p->ef)
        <br>
        +        goto unlock_out;
        <br>
        +    dma_fence_signal(p->ef);
        <br>
      </blockquote>
      <br>
      This needs to grab the internal lock of the eviction fence, I'm
      not sure that has correct ordering with the newly added ef_lock.
      <br>
    </blockquote>
    <p>Hmm, the only thing we could get a circular lock dependency would
      be, if we took the p->ef_lock in a fence callback. I don't
      think that ever happens, because even the eviction work runs on a
      worker thread (exactly to avoid such lock dependency issues).</p>
    <p>Anyway, I could try to move the fence_signal out of the critical
      section. The lock is only there to ensure that exactly one process
      signals and frees the fence.</p>
    <pre> spin_lock(&p->ef_lock);
        ef = p->ef;
        WRITE_ONCE(p->ef, NULL);
        spin_unlock(&p->ef_lock);
        if (ef) {
                dma_fence_signal(ef);
                dma_fence_put(ef);
        }
</pre>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:8e81a28a-5d50-44df-b441-6ceb2133c5e9@amd.com">
      <br>
      <blockquote type="cite">+    dma_fence_put(p->ef);
        <br>
        +    WRITE_ONCE(p->ef, NULL);
        <br>
        +
        <br>
        +unlock_out:
        <br>
        +    spin_unlock(&p->ef_lock);
        <br>
        +}
        <br>
        +
        <br>
          static void evict_process_worker(struct work_struct *work)
        <br>
          {
        <br>
              int ret;
        <br>
        @@ -1916,8 +1931,11 @@ static void evict_process_worker(struct
        work_struct *work)
        <br>
               * lifetime of this thread, kfd_process p will be valid
        <br>
               */
        <br>
              p = container_of(dwork, struct kfd_process,
        eviction_work);
        <br>
        -    WARN_ONCE(p->last_eviction_seqno != p->ef->seqno,
        <br>
        -          "Eviction fence mismatch\n");
        <br>
        +    /* If the eviction fence is not valid, we're probably in a
        suspend
        <br>
        +     * or GPU reset cycle. There is nothing to do in this case.
        <br>
        +     */
        <br>
        +    if (!READ_ONCE(p->ef))
        <br>
        +        return;
        <br>
      </blockquote>
      <br>
      This evict process work is high level I don't think it should have
      any dependency on the GPU reset.
      <br>
    </blockquote>
    <p>Right. This is not here to avoid issues, just a short-cut to
      avoid unnecessary work.</p>
    <p>Regards,<br>
        Felix<br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:8e81a28a-5d50-44df-b441-6ceb2133c5e9@amd.com">
      <br>
      Regards,
      <br>
      Christian.
      <br>
      <br>
      <blockquote type="cite">        /* Narrow window of overlap
        between restore and evict work
        <br>
               * item is possible. Once
        amdgpu_amdkfd_gpuvm_restore_process_bos
        <br>
        @@ -1930,9 +1948,7 @@ static void evict_process_worker(struct
        work_struct *work)
        <br>
              pr_debug("Started evicting pasid 0x%x\n", p->pasid);
        <br>
              ret = kfd_process_evict_queues(p,
        KFD_QUEUE_EVICTION_TRIGGER_TTM);
        <br>
              if (!ret) {
        <br>
        -        dma_fence_signal(p->ef);
        <br>
        -        dma_fence_put(p->ef);
        <br>
        -        p->ef = NULL;
        <br>
        +        signal_eviction_fence(p);
        <br>
                  queue_delayed_work(kfd_restore_wq,
        &p->restore_work,
        <br>
                          msecs_to_jiffies(PROCESS_RESTORE_TIME_MS));
        <br>
          @@ -1967,9 +1983,19 @@ static void
        restore_process_worker(struct work_struct *work)
        <br>
                p->last_restore_timestamp = get_jiffies_64();
        <br>
              /* VMs may not have been acquired yet during debugging. */
        <br>
        -    if (p->kgd_process_info)
        <br>
        +    if (p->kgd_process_info) {
        <br>
        +        struct dma_fence *ef = NULL;
        <br>
        +
        <br>
                  ret =
        amdgpu_amdkfd_gpuvm_restore_process_bos(p->kgd_process_info,
        <br>
        -                                 &p->ef);
        <br>
        +                                 &ef);
        <br>
        +        if (!ret) {
        <br>
        +            spin_lock(&p->ef_lock);
        <br>
        +            WARN_ONCE(p->ef, "Eviction fence is not NULL");
        <br>
        +            WRITE_ONCE(p->ef, ef);
        <br>
        +            spin_unlock(&p->ef_lock);
        <br>
        +        }
        <br>
        +    }
        <br>
        +
        <br>
              if (ret) {
        <br>
                  pr_debug("Failed to restore BOs of pasid 0x%x, retry
        after %d ms\n",
        <br>
                       p->pasid, PROCESS_BACK_OFF_TIME_MS);
        <br>
        @@ -1994,14 +2020,9 @@ void kfd_suspend_all_processes(void)
        <br>
                WARN(debug_evictions, "Evicting all processes");
        <br>
              hash_for_each_rcu(kfd_processes_table, temp, p,
        kfd_processes) {
        <br>
        -        cancel_delayed_work_sync(&p->eviction_work);
        <br>
        -        flush_delayed_work(&p->restore_work);
        <br>
        -
        <br>
                  if (kfd_process_evict_queues(p,
        KFD_QUEUE_EVICTION_TRIGGER_SUSPEND))
        <br>
                      pr_err("Failed to suspend process 0x%x\n",
        p->pasid);
        <br>
        -        dma_fence_signal(p->ef);
        <br>
        -        dma_fence_put(p->ef);
        <br>
        -        p->ef = NULL;
        <br>
        +        signal_eviction_fence(p);
        <br>
              }
        <br>
              srcu_read_unlock(&kfd_processes_srcu, idx);
        <br>
          }
        <br>
        diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
        b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
        <br>
        index fe937670c927..aa6c34127be9 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
        <br>
        +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
        <br>
        @@ -1869,7 +1869,7 @@ static void svm_range_restore_work(struct
        work_struct *work)
        <br>
              /* If validation failed, reschedule another attempt */
        <br>
              if (evicted_ranges) {
        <br>
                  pr_debug("reschedule to restore svm range\n");
        <br>
        -        schedule_delayed_work(&svms->restore_work,
        <br>
        +        queue_delayed_work(system_freezable_wq,
        &svms->restore_work,
        <br>
                     
        msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS));
        <br>
                    kfd_smi_event_queue_restore_rescheduled(mm);
        <br>
        @@ -1945,7 +1945,7 @@ svm_range_evict(struct svm_range *prange,
        struct mm_struct *mm,
        <br>
                      pr_debug("failed to quiesce KFD\n");
        <br>
                    pr_debug("schedule to restore svm %p ranges\n",
        svms);
        <br>
        -        schedule_delayed_work(&svms->restore_work,
        <br>
        +        queue_delayed_work(system_freezable_wq,
        &svms->restore_work,
        <br>
                     
        msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS));
        <br>
              } else {
        <br>
                  unsigned long s, l;
        <br>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>