<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <br>
    <br>
    <div class="moz-cite-prefix">On 2017年05月17日 16:40, Christian König
      wrote:<br>
    </div>
    <blockquote
      cite="mid:7a302ebe-1de1-734f-fb21-aadcc7904d37@vodafone.de"
      type="cite">Am 17.05.2017 um 10:01 schrieb Michel Dänzer:
      <br>
      <blockquote type="cite">On 17/05/17 04:13 PM, zhoucm1 wrote:
        <br>
        <blockquote type="cite">On 2017年05月17日 14:57, Michel Dänzer
          wrote:
          <br>
          <blockquote type="cite">On 17/05/17 01:28 PM, zhoucm1 wrote:
            <br>
            <blockquote type="cite">On 2017年05月17日 11:15, Michel Dänzer
              wrote:
              <br>
              <blockquote type="cite">On 17/05/17 12:04 PM, zhoucm1
                wrote:
                <br>
                <blockquote type="cite">On 2017年05月17日 09:18, Michel
                  Dänzer wrote:
                  <br>
                  <blockquote type="cite">On 16/05/17 06:25 PM, Chunming
                    Zhou wrote:
                    <br>
                    <blockquote type="cite">Change-Id:
                      I8eb6d7f558da05510e429d3bf1d48c8cec6c1977
                      <br>
                      Signed-off-by: Chunming Zhou
                      <a class="moz-txt-link-rfc2396E" href="mailto:David1.Zhou@amd.com"><David1.Zhou@amd.com></a>
                      <br>
                      <br>
                      diff --git
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
                      <br>
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
                      <br>
                      index bca1fb5..f3e7525 100644
                      <br>
                      --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
                      <br>
                      +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
                      <br>
                      @@ -2547,6 +2547,9 @@ int amdgpu_vm_ioctl(struct
                      drm_device *dev,
                      <br>
                      void *data, struct drm_file *filp)
                      <br>
                               case AMDGPU_VM_OP_UNRESERVE_VMID:
                      <br>
                                   amdgpu_vm_free_reserved_vmid(adev,
                      &fpriv->vm,
                      <br>
                      AMDGPU_GFXHUB);
                      <br>
                                   break;
                      <br>
                      +    case AMDGPU_VM_OP_RESET:
                      <br>
                      +        fpriv->vram_lost_counter =
                      <br>
                      atomic_read(&adev->vram_lost_counter);
                      <br>
                      +        break;
                      <br>
                    </blockquote>
                    How do you envision the UMDs using this? I can
                    mostly think of them
                    <br>
                    calling this ioctl when a context is created or
                    destroyed. But that
                    <br>
                    would also allow any other remaining contexts using
                    the same DRM file
                    <br>
                    descriptor to use all ioctls again. So, I think
                    there needs to be a
                    <br>
                    vram_lost_counter in struct amdgpu_ctx instead of in
                    struct
                    <br>
                    amdgpu_fpriv.
                    <br>
                  </blockquote>
                  struct amdgpu_fpriv for vram_lost_counter is proper
                  place, especially
                  <br>
                  for ioctl return value.
                  <br>
                  if you need to reset ctx one by one, we can mark all
                  contexts of that
                  <br>
                  vm, and then reset by userspace.
                  <br>
                </blockquote>
                I'm not following. With vram_lost_counter in
                amdgpu_fpriv, if any
                <br>
                context calls this ioctl, all other contexts using the
                same file
                <br>
                descriptor will also be considered safe again, right?
                <br>
              </blockquote>
              Yes, but it really depends on userspace requirement, if
              you need to
              <br>
              reset ctx one by one, we can mark all contexts of that vm
              to guilty, and
              <br>
              then reset one context by userspace.
              <br>
            </blockquote>
            Still not sure what you mean by that.
            <br>
            <br>
            E.g. what do you mean by "guilty"? I thought that refers to
            the context
            <br>
            which caused a hang. But it seems like you're using it to
            refer to any
            <br>
            context which hasn't reacted yet to VRAM contents being
            lost.
            <br>
          </blockquote>
          When vram is lost, we treat all contexts need to reset.
          <br>
        </blockquote>
        Essentially, your patches only track VRAM contents being lost
        per file
        <br>
        descriptor, not per context. I'm not sure (rather skeptical)
        that this
        <br>
        is suitable for OpenGL UMDs, since state is usually tracked per
        context.
        <br>
        Marek / Nicolai?
        <br>
      </blockquote>
      <br>
      Oh, yeah that's a good point.
      <br>
      <br>
      The problem with tracking it per context is that Vulkan also wants
      the ENODEV on the amdgpu_gem_va_ioct() and amdgpu_info_ioctl()
      which are context less.
      <br>
      <br>
      But thinking more about this blocking those two doesn't make much
      sense. The VM content can be restored and why should be disallow
      reading GPU info?
      <br>
    </blockquote>
    I can re-paste the Vulkan APIs requiring ENODEV:<br>
    "
    <p class="MsoNormal">The Vulkan APIs listed below could return
      VK_ERROR_DEVICE_LOST according to the spec. <o:p></o:p></p>
    <p class="MsoNormal">I tries to provide a list of u/k interfaces
      that could be called for each vk API.<o:p></o:p></p>
    <p class="MsoNormal"><o:p> </o:p></p>
    <p class="MsoNormal">vkCreateDevice<o:p></o:p></p>
    <p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l1
      level1 lfo2"><span style="mso-list:Ignore">-<span
          style="font:7.0pt "Times New Roman"">          </span></span><span
        style="color:#1F497D">amdgpu_device_initialize.<o:p></o:p></span></p>
    <p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l1
      level1 lfo2"><span style="mso-list:Ignore">-<span
          style="font:7.0pt "Times New Roman"">          </span></span><span
        style="color:#1F497D">amdgpu_query_gpu_info<o:p></o:p></span></p>
    <p class="MsoNormal"><o:p> </o:p></p>
    <p class="MsoNormal">vkQueueSubmit<o:p></o:p></p>
    <p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l1
      level1 lfo2"><span style="mso-list:Ignore">-<span
          style="font:7.0pt "Times New Roman"">          </span></span>amdgpu_cs_submit<o:p></o:p></p>
    <p class="MsoNormal"><o:p> </o:p></p>
    <p class="MsoNormal">vkWaitForFences<o:p></o:p></p>
    <p class="MsoNormal">                amdgpu_cs_wait_fences<span
        style="color:#1F497D"><o:p></o:p></span></p>
    <p class="MsoNormal"><o:p> </o:p></p>
    <p class="MsoNormal">vkGetEventStatus<o:p></o:p></p>
    <p class="MsoNormal">vkQueueWaitIdle<o:p></o:p></p>
    <p class="MsoNormal">vkDeviceWaitIdle<o:p></o:p></p>
    <p class="MsoNormal">vkGetQueryPoolResults<b><o:p></o:p></b></p>
    <p class="MsoNormal">                amdgpu_cs_query_Fence_status<o:p></o:p></p>
    <p class="MsoNormal"><o:p> </o:p></p>
    <p class="MsoNormal">vkQueueBindSparse<b><o:p></o:p></b></p>
    <p class="MsoNormal">                amdgpu_bo_va_op<o:p></o:p></p>
    <p class="MsoNormal">                amdgpu_bo_va_op_raw<o:p></o:p></p>
    <p class="MsoNormal"><o:p> </o:p></p>
    <p class="MsoNormal">vkCreateSwapchainKHR<b><o:p></o:p></b></p>
    <p class="MsoNormal">vkAcquireNextImageKHR<b><o:p></o:p></b></p>
    <p class="MsoNormal">vkQueuePresentKHR<o:p></o:p></p>
    <p class="MsoNormal">                Not related with u/k interface.<b><o:p></o:p></b></p>
    <p class="MsoNormal"><b><o:p> </o:p></b></p>
    Besides those listed above, I think
    amdgpu_cs_signal_Sem/amdgpu_cs_wait_sem should respond to gpu reset
    as well."<br>
    <blockquote
      cite="mid:7a302ebe-1de1-734f-fb21-aadcc7904d37@vodafone.de"
      type="cite">
      <br>
      Christian.
      <br>
      <br>
    </blockquote>
    <br>
  </body>
</html>