<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2022-07-05 14:48, Felix Kuehling
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:2f71d9b5-59aa-45cb-0245-7dc139f108b1@amd.com">On
      2022-06-10 21:03, Philip Yang wrote:
      <br>
      <blockquote type="cite">If destroying/unmapping queue failed,
        application may destroy queue
        <br>
        again, cause below kernel warning backtrace.
        <br>
        <br>
        For outstanding queues, either applications forget to destroy or
        failed
        <br>
        to destroy, kfd_process_notifier_release will remove queue sysfs
        <br>
        objects, kfd_process_wq_release will free queue doorbell.
        <br>
        <br>
          refcount_t: underflow; use-after-free.
        <br>
          WARNING: CPU: 7 PID: 3053 at lib/refcount.c:28
        <br>
           Call Trace:
        <br>
            kobject_put+0xd6/0x1a0
        <br>
            kfd_procfs_del_queue+0x27/0x30 [amdgpu]
        <br>
            pqm_destroy_queue+0xeb/0x240 [amdgpu]
        <br>
            kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
        <br>
            kfd_ioctl+0x27d/0x500 [amdgpu]
        <br>
            do_syscall_64+0x35/0x80
        <br>
        <br>
          WARNING: CPU: 2 PID: 3053 at
        drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:400
        <br>
           Call Trace:
        <br>
            deallocate_doorbell.isra.0+0x39/0x40 [amdgpu]
        <br>
            destroy_queue_cpsch+0xb3/0x270 [amdgpu]
        <br>
            pqm_destroy_queue+0x108/0x240 [amdgpu]
        <br>
            kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
        <br>
            kfd_ioctl+0x27d/0x500 [amdgpu]
        <br>
        <br>
        Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
        <br>
        ---
        <br>
          drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 4
        ++--
        <br>
          drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 2 +-
        <br>
          2 files changed, 3 insertions(+), 3 deletions(-)
        <br>
        <br>
        diff --git
        a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
        b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
        <br>
        index e1797657b04c..1c519514ca1a 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
        <br>
        +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
        <br>
        @@ -1876,8 +1876,6 @@ static int destroy_queue_cpsch(struct
        device_queue_manager *dqm,
        <br>
              mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type(
        <br>
                      q->properties.type)];
        <br>
          -    deallocate_doorbell(qpd, q);
        <br>
        -
        <br>
              if ((q->properties.type == KFD_QUEUE_TYPE_SDMA) ||
        <br>
                  (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)) {
        <br>
                  deallocate_sdma_queue(dqm, q);
        <br>
        @@ -1898,6 +1896,8 @@ static int destroy_queue_cpsch(struct
        device_queue_manager *dqm,
        <br>
                  }
        <br>
              }
        <br>
          +    deallocate_doorbell(qpd, q);
        <br>
        +
        <br>
      </blockquote>
      <br>
      I'm not sure what this change is meant to do. I don't see an early
      return in this function, so deallocate_doorbell will always be
      executed either way.
      <br>
    </blockquote>
    <p>If app destroy queue deallocate_doorbell, but then unmap queue
      failed, app will destroy queue again when app exit,
      deallocate_doorbell second time will trigger the WARN backtrace.</p>
    <p>As queue_count and q->list is used to alloc ring buf in
      execute_queues_cpsch, so this change causes regression on gfx9, I
      will submit new patch to handle unmap failed case with MES and fix
      those issues.</p>
    <p>Regards,</p>
    <p>Philip<br>
    </p>
    <blockquote type="cite" cite="mid:2f71d9b5-59aa-45cb-0245-7dc139f108b1@amd.com">
      <br>
      <blockquote type="cite">      /*
        <br>
               * Unconditionally decrement this counter, regardless of
        the queue's
        <br>
               * type
        <br>
        diff --git
        a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
        b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
        <br>
        index dc00484ff484..99f2a6412201 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
        <br>
        +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
        <br>
        @@ -419,7 +419,6 @@ int pqm_destroy_queue(struct
        process_queue_manager *pqm, unsigned int qid)
        <br>
              }
        <br>
                if (pqn->q) {
        <br>
        -        kfd_procfs_del_queue(pqn->q);
        <br>
                  dqm = pqn->q->device->dqm;
        <br>
                  retval = dqm->ops.destroy_queue(dqm,
        &pdd->qpd, pqn->q);
        <br>
                  if (retval) {
        <br>
        @@ -439,6 +438,7 @@ int pqm_destroy_queue(struct
        process_queue_manager *pqm, unsigned int qid)
        <br>
                  if (dev->shared_resources.enable_mes)
        <br>
                      amdgpu_amdkfd_free_gtt_mem(dev->adev,
        <br>
                                     pqn->q->gang_ctx_bo);
        <br>
        +        kfd_procfs_del_queue(pqn->q);
        <br>
      </blockquote>
      <br>
      This part makes sense.
      <br>
      <br>
      Regards,
      <br>
        Felix
      <br>
      <br>
      <br>
      <blockquote type="cite">          uninit_queue(pqn->q);
        <br>
              }
        <br>
          </blockquote>
    </blockquote>
  </body>
</html>