<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2024-10-21 04:12, Christian König
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:efb4b082-d518-46b6-adda-776458772e1c@gmail.com">Am
      18.10.24 um 23:59 schrieb Philip Yang:
      <br>
      <blockquote type="cite">On 2024-10-18 14:28, Felix Kuehling wrote:
        <br>
        <blockquote type="cite">
          <br>
          On 2024-10-17 04:34, Victor Zhao wrote:
          <br>
          <blockquote type="cite">make sure KFD_FENCE_INIT write to
            fence_addr before pm_send_query_status
            <br>
            called, to avoid qcm fence timeout caused by incorrect
            ordering.
            <br>
            <br>
            Signed-off-by: Victor Zhao <a class="moz-txt-link-rfc2396E" href="mailto:Victor.Zhao@amd.com"><Victor.Zhao@amd.com></a>
            <br>
            ---
            <br>
              drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1
            +
            <br>
              drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2
            +-
            <br>
              2 files changed, 2 insertions(+), 1 deletion(-)
            <br>
            <br>
            diff --git
            a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
            b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
            <br>
            index b2b16a812e73..d9264a353775 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
            <br>
            @@ -2254,6 +2254,7 @@ static int unmap_queues_cpsch(struct
            device_queue_manager *dqm,
            <br>
                      goto out;
            <br>
                    *dqm->fence_addr = KFD_FENCE_INIT;
            <br>
            +    mb();
            <br>
                  pm_send_query_status(&dqm->packet_mgr,
            dqm->fence_gpu_addr,
            <br>
                              KFD_FENCE_COMPLETED);
            <br>
                  /* should be timed out */
            <br>
            diff --git
            a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
            b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
            <br>
            index 09ab36f8e8c6..bddb169bb301 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
            <br>
            +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
            <br>
            @@ -260,7 +260,7 @@ struct device_queue_manager {
            <br>
                  uint16_t        vmid_pasid[VMID_NUM];
            <br>
                  uint64_t        pipelines_addr;
            <br>
                  uint64_t        fence_gpu_addr;
            <br>
            -    uint64_t        *fence_addr;
            <br>
            +    volatile uint64_t    *fence_addr;
            <br>
          </blockquote>
          <br>
          [+Christian]
          <br>
          <br>
          Is the volatile keyword really needed here? I just saw other
          patches removing volatile in some places because it's not
          sufficient, and not needed if you use memory barriers
          correctly.
          <br>
        </blockquote>
        <br>
        After reading kernel memory barrier document and below link, I
        think we need both volatile type and memory barrier, to
        guarantee F/W get the updated fence value. This fixes an CP hang
        issue on SRIOV.
        <br>
        <br>
<a class="moz-txt-link-freetext" href="https://stackoverflow.com/questions/75750110/volatile-vs-memory-barriers#:~:text=volatile%20will%20make%20sure%20that,not%20reorder%20writes%20or%20reads">https://stackoverflow.com/questions/75750110/volatile-vs-memory-barriers#:~:text=volatile%20will%20make%20sure%20that,not%20reorder%20writes%20or%20reads</a>.
        <br>
        <br>
      </blockquote>
      <br>
      No, that isn't correct. Using volatile is considered harmful and
      almost never correct, see here
      <a class="moz-txt-link-freetext" href="https://docs.kernel.org/process/volatile-considered-harmful.html">https://docs.kernel.org/process/volatile-considered-harmful.html</a>
      <br>
      <br>
      Placing appropriate memory barriers must be sufficient or
      otherwise there is a rather bad platform or compiler bug lurking
      around.
      <br>
    </blockquote>
    <p>Yes, Victor confirmed that memory barrier fixes the issue, will
      send new patch to remove the volatile type.</p>
    <p>Regards,</p>
    <p>Philip<br>
    </p>
    <blockquote type="cite" cite="mid:efb4b082-d518-46b6-adda-776458772e1c@gmail.com">
      <br>
      Regards,
      <br>
      Christian.
      <br>
      <br>
      <blockquote type="cite">Regards,
        <br>
        <br>
        Philip
        <br>
        <br>
        <blockquote type="cite">
          <br>
          Regards,
          <br>
            Felix
          <br>
          <br>
          <br>
          <blockquote type="cite">      struct kfd_mem_obj   
            *fence_mem;
            <br>
                  bool            active_runlist;
            <br>
                  int            sched_policy;
            <br>
          </blockquote>
        </blockquote>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>