<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2022-07-05 15:12, Felix Kuehling
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:f330f4bc-427d-c2de-fe14-745569e9e4de@amd.com">On
      2022-07-05 15:02, philip yang wrote:
      <br>
      <blockquote type="cite">
        <br>
        <br>
        On 2022-07-05 14:48, Felix Kuehling wrote:
        <br>
        <blockquote type="cite">On 2022-06-10 21:03, Philip Yang wrote:
          <br>
          <blockquote type="cite">If destroying/unmapping queue failed,
            application may destroy queue
            <br>
            again, cause below kernel warning backtrace.
            <br>
            <br>
            For outstanding queues, either applications forget to
            destroy or failed
            <br>
            to destroy, kfd_process_notifier_release will remove queue
            sysfs
            <br>
            objects, kfd_process_wq_release will free queue doorbell.
            <br>
            <br>
              refcount_t: underflow; use-after-free.
            <br>
              WARNING: CPU: 7 PID: 3053 at lib/refcount.c:28
            <br>
               Call Trace:
            <br>
                kobject_put+0xd6/0x1a0
            <br>
                kfd_procfs_del_queue+0x27/0x30 [amdgpu]
            <br>
                pqm_destroy_queue+0xeb/0x240 [amdgpu]
            <br>
                kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
            <br>
                kfd_ioctl+0x27d/0x500 [amdgpu]
            <br>
                do_syscall_64+0x35/0x80
            <br>
            <br>
              WARNING: CPU: 2 PID: 3053 at
            drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:400
            <br>
               Call Trace:
            <br>
                deallocate_doorbell.isra.0+0x39/0x40 [amdgpu]
            <br>
                destroy_queue_cpsch+0xb3/0x270 [amdgpu]
            <br>
                pqm_destroy_queue+0x108/0x240 [amdgpu]
            <br>
                kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
            <br>
                kfd_ioctl+0x27d/0x500 [amdgpu]
            <br>
            <br>
            Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
            <br>
            ---
            <br>
              drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 4
            ++--
            <br>
              drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 2
            +-
            <br>
              2 files changed, 3 insertions(+), 3 deletions(-)
            <br>
            <br>
            diff --git
            a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
            b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
            <br>
            index e1797657b04c..1c519514ca1a 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
            <br>
            @@ -1876,8 +1876,6 @@ static int destroy_queue_cpsch(struct
            device_queue_manager *dqm,
            <br>
                  mqd_mgr =
            dqm->mqd_mgrs[get_mqd_type_from_queue_type(
            <br>
                          q->properties.type)];
            <br>
              -    deallocate_doorbell(qpd, q);
            <br>
            -
            <br>
                  if ((q->properties.type == KFD_QUEUE_TYPE_SDMA) ||
            <br>
                      (q->properties.type ==
            KFD_QUEUE_TYPE_SDMA_XGMI)) {
            <br>
                      deallocate_sdma_queue(dqm, q);
            <br>
            @@ -1898,6 +1896,8 @@ static int destroy_queue_cpsch(struct
            device_queue_manager *dqm,
            <br>
                      }
            <br>
                  }
            <br>
              +    deallocate_doorbell(qpd, q);
            <br>
            +
            <br>
          </blockquote>
          <br>
          I'm not sure what this change is meant to do. I don't see an
          early return in this function, so deallocate_doorbell will
          always be executed either way.
          <br>
        </blockquote>
        <br>
        If app destroy queue deallocate_doorbell, but then unmap queue
        failed, app will destroy queue again when app exit,
        deallocate_doorbell second time will trigger the WARN backtrace.
        <br>
        <br>
      </blockquote>
      I get that. But even with your change, deallocate_doorbell will
      still be executed if the queue unmap fails because there is no
      early return or "goto error".
      <br>
    </blockquote>
    Yes, that was fixed in patch v2. [PATCH v2 2/2] drm/amdkfd: Free
    queue after unmap queue success<br>
    <blockquote type="cite" cite="mid:f330f4bc-427d-c2de-fe14-745569e9e4de@amd.com">
      <br>
      <br>
      <blockquote type="cite">As queue_count and q->list is used to
        alloc ring buf in execute_queues_cpsch, so this change causes
        regression on gfx9, I will submit new patch to handle unmap
        failed case with MES and fix those issues.
        <br>
        <br>
      </blockquote>
      I think the intention of the code was to treat HWS in a way that
      does not prevent queue destruction. Basically, there is not point
      reporting HWS errors to the application because the application
      cannot do anything about them anyway. Eventually it will cause a
      GPU reset and the application will be killed. If you look at how
      -ETIME is handled in pqm_destroy_queue, you see that it finishes
      the job regardless.
      <br>
      <br>
      However, this has always been somewhat inconsistent. With MES
      maybe it's getting worse because there may be other error
      conditions we didn't hit before. Are you seeing those backtraces
      on a GPU with MES by any chance?
      <br>
      <br>
    </blockquote>
    <p>Thanks for the info, MES returns other error code than -ETIME, so
      pqm_destroy_queue don't finish the destroy procedure and return
      error code to user space. Seems we just need change MES error
      handle the same way as old HWS, to finish the job regardless.<br>
    </p>
    <p>Thanks,</p>
    <p>Philip<br>
    </p>
    <blockquote type="cite" cite="mid:f330f4bc-427d-c2de-fe14-745569e9e4de@amd.com">Regards,
      <br>
        Felix
      <br>
      <br>
      <br>
      <blockquote type="cite">Regards,
        <br>
        <br>
        Philip
        <br>
        <br>
        <blockquote type="cite">
          <br>
          <blockquote type="cite">      /*
            <br>
                   * Unconditionally decrement this counter, regardless
            of the queue's
            <br>
                   * type
            <br>
            diff --git
            a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
            b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
            <br>
            index dc00484ff484..99f2a6412201 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
            <br>
            @@ -419,7 +419,6 @@ int pqm_destroy_queue(struct
            process_queue_manager *pqm, unsigned int qid)
            <br>
                  }
            <br>
                    if (pqn->q) {
            <br>
            -        kfd_procfs_del_queue(pqn->q);
            <br>
                      dqm = pqn->q->device->dqm;
            <br>
                      retval = dqm->ops.destroy_queue(dqm,
            &pdd->qpd, pqn->q);
            <br>
                      if (retval) {
            <br>
            @@ -439,6 +438,7 @@ int pqm_destroy_queue(struct
            process_queue_manager *pqm, unsigned int qid)
            <br>
                      if (dev->shared_resources.enable_mes)
            <br>
                          amdgpu_amdkfd_free_gtt_mem(dev->adev,
            <br>
                                         pqn->q->gang_ctx_bo);
            <br>
            +        kfd_procfs_del_queue(pqn->q);
            <br>
          </blockquote>
          <br>
          This part makes sense.
          <br>
          <br>
          Regards,
          <br>
            Felix
          <br>
          <br>
          <br>
          <blockquote type="cite">          uninit_queue(pqn->q);
            <br>
                  }
            <br>
          </blockquote>
        </blockquote>
      </blockquote>
    </blockquote>
  </body>
</html>