<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2022-01-10 7:10 p.m., Felix Kuehling
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:116c8cf4-57c2-f3a1-f4b9-5f0ef4526967@amd.com">On
      2022-01-05 10:22 a.m., philip yang wrote:
      <br>
      <blockquote type="cite">
        <br>
        <br>
        On 2021-12-22 7:37 p.m., Rajneesh Bhardwaj wrote:
        <br>
        <blockquote type="cite">Recoverable page faults are represented
          by the xnack mode setting inside
          <br>
          a kfd process and are used to represent the device page
          faults. For CR,
          <br>
          we don't consider negative values which are typically used for
          querying
          <br>
          the current xnack mode without modifying it.
          <br>
          <br>
          Signed-off-by: Rajneesh
          Bhardwaj<a class="moz-txt-link-rfc2396E" href="mailto:rajneesh.bhardwaj@amd.com"><rajneesh.bhardwaj@amd.com></a>
          <br>
          ---
          <br>
            drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 15
          +++++++++++++++
          <br>
            drivers/gpu/drm/amd/amdkfd/kfd_priv.h    |  1 +
          <br>
            2 files changed, 16 insertions(+)
          <br>
          <br>
          diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
          b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
          <br>
          index 178b0ccfb286..446eb9310915 100644
          <br>
          --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
          <br>
          +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
          <br>
          @@ -1845,6 +1845,11 @@ static int
          criu_checkpoint_process(struct kfd_process *p,
          <br>
                memset(&process_priv, 0, sizeof(process_priv));
          <br>
                  process_priv.version = KFD_CRIU_PRIV_VERSION;
          <br>
          +    /* For CR, we don't consider negative xnack mode which is
          used for
          <br>
          +     * querying without changing it, here 0 simply means
          disabled and 1
          <br>
          +     * means enabled so retry for finding a valid PTE.
          <br>
          +     */
          <br>
        </blockquote>
        Negative value to query xnack mode is for
        kfd_ioctl_set_xnack_mode user space ioctl interface, which is
        not used by CRIU, I think this comment is misleading,
        <br>
        <blockquote type="cite">+    process_priv.xnack_mode =
          p->xnack_enabled ? 1 : 0;
          <br>
        </blockquote>
        change to process_priv.xnack_enabled
        <br>
        <blockquote type="cite">        ret =
          copy_to_user(user_priv_data + *priv_offset,
          <br>
                            &process_priv, sizeof(process_priv));
          <br>
          @@ -2231,6 +2236,16 @@ static int criu_restore_process(struct
          kfd_process *p,
          <br>
                    return -EINVAL;
          <br>
                }
          <br>
            +    pr_debug("Setting XNACK mode\n");
          <br>
          +    if (process_priv.xnack_mode &&
          !kfd_process_xnack_mode(p, true)) {
          <br>
          +        pr_err("xnack mode cannot be set\n");
          <br>
          +        ret = -EPERM;
          <br>
          +        goto exit;
          <br>
          +    } else {
          <br>
        </blockquote>
        <br>
        On GFXv9 GPUs except Aldebaran, this means the process
        checkpointed is xnack off, it can restore and resume on GPU with
        xnack on, then shader will continue running successfully, but
        driver is not guaranteed to map svm ranges on GPU all the time,
        if retry fault happens, the shader will not recover. Maybe
        change to:
        <br>
        <br>
        If (KFD_GC_VERSION(dev) != IP_VERSION(9, 4, 2) {
        <br>
        <br>
      </blockquote>
      The code here was correct. The xnack mode applies to the whole
      process, not just one GPU. The logic for checking the capabilities
      of all GPUs is already in kfd_process_xnack_mode. If XNACK cannot
      be supported by all GPUs, restoring a non-0 XNACK mode will fail.
      <br>
      <br>
      Any GPU can run in XNACK-disabled mode. So we don't need any
      limitations for process_priv.xnack_enabled == 0.
      <br>
    </blockquote>
    <p>Yes, the code was correct, for case all GPUs dev->noretry=0
      (xnack on), process->xnack_enabled=0, we unmap the queues while
      migrating, guarantee to map svm ranges on GPUs then resume queues.
      If retry fault happens, we don't recover the fault, report the
      fault to user space. That is all correct.</p>
    <p>Regards,</p>
    <p>Philip<br>
    </p>
    <blockquote type="cite" cite="mid:116c8cf4-57c2-f3a1-f4b9-5f0ef4526967@amd.com">
      <br>
      Regards,
      <br>
        Felix
      <br>
      <br>
      <br>
      <blockquote type="cite">    if (process_priv.xnack_enabled !=
        kfd_process_xnack_mode(p, true)) {
        <br>
        <br>
                     pr_err("xnack mode cannot be set\n");
        <br>
        <br>
                     ret = -EPERM;
        <br>
        <br>
                     goto exit;
        <br>
        <br>
            }
        <br>
        <br>
        }
        <br>
        <br>
        pr_debug("set xnack mode: %d\n", process_priv.xnack_enabled);
        <br>
        <br>
        p->xnack_enabled = process_priv.xnack_enabled;
        <br>
        <br>
        <br>
        <blockquote type="cite">+        pr_debug("set xnack mode:
          %d\n", process_priv.xnack_mode);
          <br>
          +        p->xnack_enabled = process_priv.xnack_mode;
          <br>
          +    }
          <br>
          +
          <br>
            exit:
          <br>
                return ret;
          <br>
            }
          <br>
          diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
          b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
          <br>
          index 855c162b85ea..d72dda84c18c 100644
          <br>
          --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
          <br>
          +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
          <br>
          @@ -1057,6 +1057,7 @@ void kfd_process_set_trap_handler(struct
          qcm_process_device *qpd,
          <br>
              struct kfd_criu_process_priv_data {
          <br>
                uint32_t version;
          <br>
          +    uint32_t xnack_mode;
          <br>
        </blockquote>
        <br>
        bool xnack_enabled;
        <br>
        <br>
        Regards,
        <br>
        <br>
        Philip
        <br>
        <br>
        <blockquote type="cite">  };
          <br>
              struct kfd_criu_device_priv_data {
          <br>
        </blockquote>
      </blockquote>
    </blockquote>
  </body>
</html>