<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Hi Jon,<br>
      <br>
      <blockquote type="cite">Also cwsr tests fail on Vega20 with or
        without the revert with the same RAS error.</blockquote>
      <br>
      That sounds like the system/setup has a more general problem.<br>
      <br>
      Could it be that we are seeing RAS errors because there really is
      some hardware failure, but with the MM path we don't trigger a RAS
      interrupt?<br>
      <br>
      Thanks,<br>
      Christian.<br>
      <br>
      Am 14.04.20 um 22:30 schrieb Kim, Jonathan:<br>
    </div>
    <blockquote type="cite" cite="mid:MN2PR12MB4518A3D9746674DA688AD34885DA0@MN2PR12MB4518.namprd12.prod.outlook.com">
      
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
      <style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
p.msipheader4d0fcdd7, li.msipheader4d0fcdd7, div.msipheader4d0fcdd7
        {mso-style-name:msipheader4d0fcdd7;
        mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
span.EmailStyle20
        {mso-style-type:personal-compose;
        font-family:"Arial",sans-serif;
        color:#0078D7;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="msipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
            Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">If we’re passing the test on the revert,
          then the only thing that’s different is we’re not invalidating
          HDP and doing a copy to host anymore in
          amdgpu_device_vram_access since the function is still called
          in ttm access_memory with BAR.<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Also cwsr tests fail on Vega20 with or
          without the revert with the same RAS error.<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Thanks,<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Jon<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <div style="border:none;border-top:solid #E1E1E1
            1.0pt;padding:3.0pt 0in 0in 0in">
            <p class="MsoNormal"><b>From:</b> Kuehling, Felix
              <a class="moz-txt-link-rfc2396E" href="mailto:Felix.Kuehling@amd.com"><Felix.Kuehling@amd.com></a> <br>
              <b>Sent:</b> Tuesday, April 14, 2020 2:32 PM<br>
              <b>To:</b> Kim, Jonathan <a class="moz-txt-link-rfc2396E" href="mailto:Jonathan.Kim@amd.com"><Jonathan.Kim@amd.com></a>;
              Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>;
              Deucher, Alexander <a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a><br>
              <b>Cc:</b> Russell, Kent <a class="moz-txt-link-rfc2396E" href="mailto:Kent.Russell@amd.com"><Kent.Russell@amd.com></a>;
              <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
              <b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use the
              BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></p>
          </div>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p>I wouldn't call it premature. Revert is a usual practice when
          there is a serious regression that isn't fully understood or
          root-caused. As far as I can tell, the problem has been
          reproduced on multiple systems, different GPUs, and clearly
          regressed to Christian's commit. I think that justifies
          reverting it for now.<o:p></o:p></p>
        <p>I agree with Christian that a general HDP memory access
          problem causing RAS errors would potentially cause problems in
          other tests as well. For example common operations like GART
          table updates, and GPUVM page table updates and PCIe peer2peer
          accesses in ROCm applications use HDP. But we're not seeing
          obvious problems from those. So we need to understand what's
          special about this test. I asked questions to that effect on
          our other email thread.<o:p></o:p></p>
        <p>Regards,<br>
            Felix<o:p></o:p></p>
        <div>
          <p class="MsoNormal">Am 2020-04-14 um 10:51 a.m. schrieb Kim,
            Jonathan:<o:p></o:p></p>
        </div>
        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
          <p class="msipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
              Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p class="MsoNormal">I think it’s premature to push this
            revert.<o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p class="MsoNormal">With more testing, I’m getting failures
            from different tests or sometimes none at all on my machine.<o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p class="MsoNormal">Kent, let’s continue the discussion on
            the original thread.<o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p class="MsoNormal">Thanks,<o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p class="MsoNormal">Jon<o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <div>
            <div style="border:none;border-top:solid #E1E1E1
              1.0pt;padding:3.0pt 0in 0in 0in">
              <p class="MsoNormal"><b>From:</b> Koenig, Christian <a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">
                  <Christian.Koenig@amd.com></a> <br>
                <b>Sent:</b> Tuesday, April 14, 2020 10:47 AM<br>
                <b>To:</b> Deucher, Alexander <a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a><br>
                <b>Cc:</b> Russell, Kent <a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true"><Kent.Russell@amd.com></a>;
                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>;
                Kuehling, Felix
                <a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true"><Felix.Kuehling@amd.com></a>;
                Kim, Jonathan
                <a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true"><Jonathan.Kim@amd.com></a><br>
                <b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use the
                BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></p>
            </div>
          </div>
          <p class="MsoNormal"> <o:p></o:p></p>
          <div>
            <div>
              <div>
                <div>
                  <div>
                    <p class="MsoNormal">That's exactly my concern as
                      well. <o:p></o:p></p>
                    <div>
                      <p class="MsoNormal"> <o:p></o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal">This looks a bit like the
                        test creates erroneous data somehow, but there
                        doesn't seems to be a RAS check in the MM data
                        path.<o:p></o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal"> <o:p></o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal">And now that we use the BAR
                        path it goes up in flames.<o:p></o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal"> <o:p></o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal">I just don't see how we can
                        create erroneous data in a test case?<o:p></o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal"> <o:p></o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal">Christian.<o:p></o:p></p>
                    </div>
                  </div>
                  <div>
                    <p class="MsoNormal"> <o:p></o:p></p>
                    <div>
                      <p class="MsoNormal">Am 14.04.2020 16:35 schrieb
                        "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                      <blockquote style="border:none;border-left:solid
                        #CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                        <div>
                          <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                              Public Use]</span><o:p></o:p></p>
                          <p class="MsoNormal"> <o:p></o:p></p>
                          <div>
                            <div>
                              <p class="MsoNormal"><span style="font-size:12.0pt;color:black">If
                                  this causes an issue, any access to
                                  vram via the BAR could cause an issue.</span><o:p></o:p></p>
                            </div>
                            <div>
                              <p class="MsoNormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
                            </div>
                            <div>
                              <p class="MsoNormal"><span style="font-size:12.0pt;color:black">Alex</span><o:p></o:p></p>
                            </div>
                            <div class="MsoNormal" style="text-align:center" align="center">
                              <hr width="98%" size="2" align="center">
                            </div>
                            <div>
                              <p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                                  on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                                  <b>Sent:</b> Tuesday, April 14, 2020
                                  10:19 AM<br>
                                  <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                                  <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                                  <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                                  <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                                  Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                                  <b>Subject:</b> RE: [PATCH] Revert
                                  "drm/amdgpu: use the BAR if possible
                                  in amdgpu_device_vram_access v2"</span>
                                <o:p></o:p></p>
                              <div>
                                <p class="MsoNormal"> <o:p></o:p></p>
                              </div>
                            </div>
                            <div>
                              <div>
                                <p class="MsoNormal">[AMD Official Use
                                  Only - Internal Distribution Only]<br>
                                  <br>
                                  On VG20 or MI100, as soon as we run
                                  the subtest, we get the dmesg output
                                  below, and then the kernel ends up
                                  hanging. I don't know enough about the
                                  test itself to know why this is
                                  occurring, but Jon Kim and Felix were
                                  discussing it on a separate thread
                                  when the issue was first reported, so
                                  they can hopefully provide some
                                  additional information.<br>
                                  <br>
                                   Kent<br>
                                  <br>
                                  > -----Original Message-----<br>
                                  > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                                  > Sent: Tuesday, April 14, 2020
                                  9:52 AM<br>
                                  > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                                  <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                  > Subject: Re: [PATCH] Revert
                                  "drm/amdgpu: use the BAR if possible
                                  in<br>
                                  > amdgpu_device_vram_access v2"<br>
                                  > <br>
                                  > Am 13.04.20 um 20:20 schrieb Kent
                                  Russell:<br>
                                  > > This reverts commit
                                  c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                                  > > The original patch causes a
                                  RAS event and subsequent kernel
                                  hard-hang<br>
                                  > > when running the
                                  KFDMemoryTest.PtraceAccessInvisibleVram
                                  on VG20 and<br>
                                  > > Arcturus<br>
                                  > ><br>
                                  > > dmesg output at hang time:<br>
                                  > > [drm] RAS event of type
                                  ERREVENT_ATHUB_INTERRUPT detected!<br>
                                  > > amdgpu 0000:67:00.0: GPU
                                  reset begin!<br>
                                  > > Evicting PASID 0x8000 queues<br>
                                  > > Started evicting pasid
                                  0x8000<br>
                                  > > qcm fence wait loop timeout
                                  expired<br>
                                  > > The cp might be in an
                                  unrecoverable state due to an
                                  unsuccessful<br>
                                  > > queues preemption Failed to
                                  evict process queues Failed to suspend<br>
                                  > > process 0x8000 Finished
                                  evicting pasid 0x8000 Started
                                  restoring pasid<br>
                                  > > 0x8000 Finished restoring
                                  pasid 0x8000 [drm] UVD VCPU state may
                                  lost<br>
                                  > > due to RAS
                                  ERREVENT_ATHUB_INTERRUPT<br>
                                  > > amdgpu: [powerplay] Failed
                                  to send message 0x26, response 0x0<br>
                                  > > amdgpu: [powerplay] Failed
                                  to set soft min gfxclk !<br>
                                  > > amdgpu: [powerplay] Failed
                                  to upload DPM Bootup Levels!<br>
                                  > > amdgpu: [powerplay] Failed
                                  to send message 0x7, response 0x0<br>
                                  > > amdgpu: [powerplay]
                                  [DisableAllSMUFeatures] Failed to
                                  disable all smu<br>
                                  > features!<br>
                                  > > amdgpu: [powerplay]
                                  [DisableDpmTasks] Failed to disable
                                  all smu features!<br>
                                  > > amdgpu: [powerplay]
                                  [PowerOffAsic] Failed to disable DPM!<br>
                                  > >
                                  [drm:amdgpu_device_ip_suspend_phase2
                                  [amdgpu]] *ERROR* suspend of IP<br>
                                  > > block <powerplay>
                                  failed -5<br>
                                  > <br>
                                  > Do you have more information on
                                  what's going wrong here since this is
                                  a really<br>
                                  > important patch for KFD
                                  debugging.<br>
                                  > <br>
                                  > ><br>
                                  > > Signed-off-by: Kent Russell
                                  <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                                  > <br>
                                  > Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                                  > <br>
                                  > > ---<br>
                                  > >  
                                  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                                  | 26 ----------------------<br>
                                  > >   1 file changed, 26
                                  deletions(-)<br>
                                  > ><br>
                                  > > diff --git
                                  a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                  > >
                                  b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                  > > index
                                  cf5d6e585634..a3f997f84020 100644<br>
                                  > > ---
                                  a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                  > > +++
                                  b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                  > > @@ -254,32 +254,6 @@ void
                                  amdgpu_device_vram_access(struct<br>
                                  > amdgpu_device *adev, loff_t pos,<br>
                                  > >      uint32_t hi = ~0;<br>
                                  > >      uint64_t last;<br>
                                  > ><br>
                                  > > -<br>
                                  > > -#ifdef CONFIG_64BIT<br>
                                  > > -   last = min(pos + size,
                                  adev->gmc.visible_vram_size);<br>
                                  > > -   if (last > pos) {<br>
                                  > > -           void __iomem
                                  *addr = adev->mman.aper_base_kaddr
                                  + pos;<br>
                                  > > -           size_t count =
                                  last - pos;<br>
                                  > > -<br>
                                  > > -           if (write) {<br>
                                  > > -                  
                                  memcpy_toio(addr, buf, count);<br>
                                  > > -                   mb();<br>
                                  > > -                  
                                  amdgpu_asic_flush_hdp(adev, NULL);<br>
                                  > > -           } else {<br>
                                  > > -                  
                                  amdgpu_asic_invalidate_hdp(adev,
                                  NULL);<br>
                                  > > -                   mb();<br>
                                  > > -                  
                                  memcpy_fromio(buf, addr, count);<br>
                                  > > -           }<br>
                                  > > -<br>
                                  > > -           if (count ==
                                  size)<br>
                                  > > -                   return;<br>
                                  > > -<br>
                                  > > -           pos += count;<br>
                                  > > -           buf += count /
                                  4;<br>
                                  > > -           size -= count;<br>
                                  > > -   }<br>
                                  > > -#endif<br>
                                  > > -<br>
                                  > >     
                                  spin_lock_irqsave(&adev->mmio_idx_lock,
                                  flags);<br>
                                  > >      for (last = pos + size;
                                  pos < last; pos += 4) {<br>
                                  > >              uint32_t tmp =
                                  pos >> 31;<br>
_______________________________________________<br>
                                  amd-gfx mailing list<br>
                                  <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                  <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                              </div>
                            </div>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                    <p class="MsoNormal"> <o:p></o:p></p>
                  </div>
                </div>
                <div>
                  <p class="MsoNormal"> <o:p></o:p></p>
                  <div>
                    <p class="MsoNormal">Am 14.04.2020 16:35 schrieb
                      "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                    <blockquote style="border:none;border-left:solid
                      #CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                      <div>
                        <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                            Public Use]</span><o:p></o:p></p>
                        <p class="MsoNormal"> <o:p></o:p></p>
                        <div>
                          <div>
                            <p class="MsoNormal"><span style="font-size:12.0pt;color:black">If
                                this causes an issue, any access to vram
                                via the BAR could cause an issue.</span><o:p></o:p></p>
                          </div>
                          <div>
                            <p class="MsoNormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
                          </div>
                          <div>
                            <p class="MsoNormal"><span style="font-size:12.0pt;color:black">Alex</span><o:p></o:p></p>
                          </div>
                          <div class="MsoNormal" style="text-align:center" align="center">
                            <hr width="98%" size="2" align="center">
                          </div>
                          <div>
                            <p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                                on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                                <b>Sent:</b> Tuesday, April 14, 2020
                                10:19 AM<br>
                                <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                                <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                                <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                                Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                                <b>Subject:</b> RE: [PATCH] Revert
                                "drm/amdgpu: use the BAR if possible in
                                amdgpu_device_vram_access v2"</span>
                              <o:p></o:p></p>
                            <div>
                              <p class="MsoNormal"> <o:p></o:p></p>
                            </div>
                          </div>
                          <div>
                            <div>
                              <p class="MsoNormal">[AMD Official Use
                                Only - Internal Distribution Only]<br>
                                <br>
                                On VG20 or MI100, as soon as we run the
                                subtest, we get the dmesg output below,
                                and then the kernel ends up hanging. I
                                don't know enough about the test itself
                                to know why this is occurring, but Jon
                                Kim and Felix were discussing it on a
                                separate thread when the issue was first
                                reported, so they can hopefully provide
                                some additional information.<br>
                                <br>
                                 Kent<br>
                                <br>
                                > -----Original Message-----<br>
                                > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                                > Sent: Tuesday, April 14, 2020 9:52
                                AM<br>
                                > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                > Subject: Re: [PATCH] Revert
                                "drm/amdgpu: use the BAR if possible in<br>
                                > amdgpu_device_vram_access v2"<br>
                                > <br>
                                > Am 13.04.20 um 20:20 schrieb Kent
                                Russell:<br>
                                > > This reverts commit
                                c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                                > > The original patch causes a
                                RAS event and subsequent kernel
                                hard-hang<br>
                                > > when running the
                                KFDMemoryTest.PtraceAccessInvisibleVram
                                on VG20 and<br>
                                > > Arcturus<br>
                                > ><br>
                                > > dmesg output at hang time:<br>
                                > > [drm] RAS event of type
                                ERREVENT_ATHUB_INTERRUPT detected!<br>
                                > > amdgpu 0000:67:00.0: GPU reset
                                begin!<br>
                                > > Evicting PASID 0x8000 queues<br>
                                > > Started evicting pasid 0x8000<br>
                                > > qcm fence wait loop timeout
                                expired<br>
                                > > The cp might be in an
                                unrecoverable state due to an
                                unsuccessful<br>
                                > > queues preemption Failed to
                                evict process queues Failed to suspend<br>
                                > > process 0x8000 Finished
                                evicting pasid 0x8000 Started restoring
                                pasid<br>
                                > > 0x8000 Finished restoring
                                pasid 0x8000 [drm] UVD VCPU state may
                                lost<br>
                                > > due to RAS
                                ERREVENT_ATHUB_INTERRUPT<br>
                                > > amdgpu: [powerplay] Failed to
                                send message 0x26, response 0x0<br>
                                > > amdgpu: [powerplay] Failed to
                                set soft min gfxclk !<br>
                                > > amdgpu: [powerplay] Failed to
                                upload DPM Bootup Levels!<br>
                                > > amdgpu: [powerplay] Failed to
                                send message 0x7, response 0x0<br>
                                > > amdgpu: [powerplay]
                                [DisableAllSMUFeatures] Failed to
                                disable all smu<br>
                                > features!<br>
                                > > amdgpu: [powerplay]
                                [DisableDpmTasks] Failed to disable all
                                smu features!<br>
                                > > amdgpu: [powerplay]
                                [PowerOffAsic] Failed to disable DPM!<br>
                                > >
                                [drm:amdgpu_device_ip_suspend_phase2
                                [amdgpu]] *ERROR* suspend of IP<br>
                                > > block <powerplay> failed
                                -5<br>
                                > <br>
                                > Do you have more information on
                                what's going wrong here since this is a
                                really<br>
                                > important patch for KFD debugging.<br>
                                > <br>
                                > ><br>
                                > > Signed-off-by: Kent Russell
                                <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                                > <br>
                                > Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                                > <br>
                                > > ---<br>
                                > >  
                                drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                                | 26 ----------------------<br>
                                > >   1 file changed, 26
                                deletions(-)<br>
                                > ><br>
                                > > diff --git
                                a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                > >
                                b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                > > index
                                cf5d6e585634..a3f997f84020 100644<br>
                                > > ---
                                a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                > > +++
                                b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                > > @@ -254,32 +254,6 @@ void
                                amdgpu_device_vram_access(struct<br>
                                > amdgpu_device *adev, loff_t pos,<br>
                                > >      uint32_t hi = ~0;<br>
                                > >      uint64_t last;<br>
                                > ><br>
                                > > -<br>
                                > > -#ifdef CONFIG_64BIT<br>
                                > > -   last = min(pos + size,
                                adev->gmc.visible_vram_size);<br>
                                > > -   if (last > pos) {<br>
                                > > -           void __iomem *addr
                                = adev->mman.aper_base_kaddr + pos;<br>
                                > > -           size_t count =
                                last - pos;<br>
                                > > -<br>
                                > > -           if (write) {<br>
                                > > -                  
                                memcpy_toio(addr, buf, count);<br>
                                > > -                   mb();<br>
                                > > -                  
                                amdgpu_asic_flush_hdp(adev, NULL);<br>
                                > > -           } else {<br>
                                > > -                  
                                amdgpu_asic_invalidate_hdp(adev, NULL);<br>
                                > > -                   mb();<br>
                                > > -                  
                                memcpy_fromio(buf, addr, count);<br>
                                > > -           }<br>
                                > > -<br>
                                > > -           if (count == size)<br>
                                > > -                   return;<br>
                                > > -<br>
                                > > -           pos += count;<br>
                                > > -           buf += count / 4;<br>
                                > > -           size -= count;<br>
                                > > -   }<br>
                                > > -#endif<br>
                                > > -<br>
                                > >     
                                spin_lock_irqsave(&adev->mmio_idx_lock,
                                flags);<br>
                                > >      for (last = pos + size;
                                pos < last; pos += 4) {<br>
                                > >              uint32_t tmp =
                                pos >> 31;<br>
_______________________________________________<br>
                                amd-gfx mailing list<br>
                                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                  <p class="MsoNormal"> <o:p></o:p></p>
                </div>
              </div>
              <div>
                <p class="MsoNormal"> <o:p></o:p></p>
                <div>
                  <p class="MsoNormal">Am 14.04.2020 16:35 schrieb
                    "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                  <blockquote style="border:none;border-left:solid
                    #CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                    <div>
                      <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                          Public Use]</span><o:p></o:p></p>
                      <p class="MsoNormal"> <o:p></o:p></p>
                      <div>
                        <div>
                          <p class="MsoNormal"><span style="font-size:12.0pt;color:black">If
                              this causes an issue, any access to vram
                              via the BAR could cause an issue.</span><o:p></o:p></p>
                        </div>
                        <div>
                          <p class="MsoNormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
                        </div>
                        <div>
                          <p class="MsoNormal"><span style="font-size:12.0pt;color:black">Alex</span><o:p></o:p></p>
                        </div>
                        <div class="MsoNormal" style="text-align:center" align="center">
                          <hr width="98%" size="2" align="center">
                        </div>
                        <div>
                          <p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                              on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                              <b>Sent:</b> Tuesday, April 14, 2020 10:19
                              AM<br>
                              <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                              <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                              <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                              <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                              Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                              <b>Subject:</b> RE: [PATCH] Revert
                              "drm/amdgpu: use the BAR if possible in
                              amdgpu_device_vram_access v2"</span>
                            <o:p></o:p></p>
                          <div>
                            <p class="MsoNormal"> <o:p></o:p></p>
                          </div>
                        </div>
                        <div>
                          <div>
                            <p class="MsoNormal">[AMD Official Use Only
                              - Internal Distribution Only]<br>
                              <br>
                              On VG20 or MI100, as soon as we run the
                              subtest, we get the dmesg output below,
                              and then the kernel ends up hanging. I
                              don't know enough about the test itself to
                              know why this is occurring, but Jon Kim
                              and Felix were discussing it on a separate
                              thread when the issue was first reported,
                              so they can hopefully provide some
                              additional information.<br>
                              <br>
                               Kent<br>
                              <br>
                              > -----Original Message-----<br>
                              > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                              > Sent: Tuesday, April 14, 2020 9:52 AM<br>
                              > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                              <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                              > Subject: Re: [PATCH] Revert
                              "drm/amdgpu: use the BAR if possible in<br>
                              > amdgpu_device_vram_access v2"<br>
                              > <br>
                              > Am 13.04.20 um 20:20 schrieb Kent
                              Russell:<br>
                              > > This reverts commit
                              c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                              > > The original patch causes a RAS
                              event and subsequent kernel hard-hang<br>
                              > > when running the
                              KFDMemoryTest.PtraceAccessInvisibleVram on
                              VG20 and<br>
                              > > Arcturus<br>
                              > ><br>
                              > > dmesg output at hang time:<br>
                              > > [drm] RAS event of type
                              ERREVENT_ATHUB_INTERRUPT detected!<br>
                              > > amdgpu 0000:67:00.0: GPU reset
                              begin!<br>
                              > > Evicting PASID 0x8000 queues<br>
                              > > Started evicting pasid 0x8000<br>
                              > > qcm fence wait loop timeout
                              expired<br>
                              > > The cp might be in an
                              unrecoverable state due to an unsuccessful<br>
                              > > queues preemption Failed to
                              evict process queues Failed to suspend<br>
                              > > process 0x8000 Finished evicting
                              pasid 0x8000 Started restoring pasid<br>
                              > > 0x8000 Finished restoring pasid
                              0x8000 [drm] UVD VCPU state may lost<br>
                              > > due to RAS
                              ERREVENT_ATHUB_INTERRUPT<br>
                              > > amdgpu: [powerplay] Failed to
                              send message 0x26, response 0x0<br>
                              > > amdgpu: [powerplay] Failed to
                              set soft min gfxclk !<br>
                              > > amdgpu: [powerplay] Failed to
                              upload DPM Bootup Levels!<br>
                              > > amdgpu: [powerplay] Failed to
                              send message 0x7, response 0x0<br>
                              > > amdgpu: [powerplay]
                              [DisableAllSMUFeatures] Failed to disable
                              all smu<br>
                              > features!<br>
                              > > amdgpu: [powerplay]
                              [DisableDpmTasks] Failed to disable all
                              smu features!<br>
                              > > amdgpu: [powerplay]
                              [PowerOffAsic] Failed to disable DPM!<br>
                              > >
                              [drm:amdgpu_device_ip_suspend_phase2
                              [amdgpu]] *ERROR* suspend of IP<br>
                              > > block <powerplay> failed
                              -5<br>
                              > <br>
                              > Do you have more information on
                              what's going wrong here since this is a
                              really<br>
                              > important patch for KFD debugging.<br>
                              > <br>
                              > ><br>
                              > > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                              > <br>
                              > Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                              > <br>
                              > > ---<br>
                              > >  
                              drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                              | 26 ----------------------<br>
                              > >   1 file changed, 26
                              deletions(-)<br>
                              > ><br>
                              > > diff --git
                              a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > >
                              b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > > index cf5d6e585634..a3f997f84020
                              100644<br>
                              > > ---
                              a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > > +++
                              b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > > @@ -254,32 +254,6 @@ void
                              amdgpu_device_vram_access(struct<br>
                              > amdgpu_device *adev, loff_t pos,<br>
                              > >      uint32_t hi = ~0;<br>
                              > >      uint64_t last;<br>
                              > ><br>
                              > > -<br>
                              > > -#ifdef CONFIG_64BIT<br>
                              > > -   last = min(pos + size,
                              adev->gmc.visible_vram_size);<br>
                              > > -   if (last > pos) {<br>
                              > > -           void __iomem *addr =
                              adev->mman.aper_base_kaddr + pos;<br>
                              > > -           size_t count = last
                              - pos;<br>
                              > > -<br>
                              > > -           if (write) {<br>
                              > > -                  
                              memcpy_toio(addr, buf, count);<br>
                              > > -                   mb();<br>
                              > > -                  
                              amdgpu_asic_flush_hdp(adev, NULL);<br>
                              > > -           } else {<br>
                              > > -                  
                              amdgpu_asic_invalidate_hdp(adev, NULL);<br>
                              > > -                   mb();<br>
                              > > -                  
                              memcpy_fromio(buf, addr, count);<br>
                              > > -           }<br>
                              > > -<br>
                              > > -           if (count == size)<br>
                              > > -                   return;<br>
                              > > -<br>
                              > > -           pos += count;<br>
                              > > -           buf += count / 4;<br>
                              > > -           size -= count;<br>
                              > > -   }<br>
                              > > -#endif<br>
                              > > -<br>
                              > >     
                              spin_lock_irqsave(&adev->mmio_idx_lock,
                              flags);<br>
                              > >      for (last = pos + size; pos
                              < last; pos += 4) {<br>
                              > >              uint32_t tmp = pos
                              >> 31;<br>
_______________________________________________<br>
                              amd-gfx mailing list<br>
                              <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                              <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                          </div>
                        </div>
                      </div>
                    </div>
                  </blockquote>
                </div>
                <p class="MsoNormal"> <o:p></o:p></p>
              </div>
            </div>
            <div>
              <p class="MsoNormal"> <o:p></o:p></p>
              <div>
                <p class="MsoNormal">Am 14.04.2020 16:35 schrieb
                  "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                <blockquote style="border:none;border-left:solid #CCCCCC
                  1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                  <div>
                    <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                        Public Use]</span><o:p></o:p></p>
                    <p class="MsoNormal"> <o:p></o:p></p>
                    <div>
                      <div>
                        <p class="MsoNormal"><span style="font-size:12.0pt;color:black">If this
                            causes an issue, any access to vram via the
                            BAR could cause an issue.</span><o:p></o:p></p>
                      </div>
                      <div>
                        <p class="MsoNormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
                      </div>
                      <div>
                        <p class="MsoNormal"><span style="font-size:12.0pt;color:black">Alex</span><o:p></o:p></p>
                      </div>
                      <div class="MsoNormal" style="text-align:center" align="center">
                        <hr width="98%" size="2" align="center">
                      </div>
                      <div>
                        <p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                            on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                            <b>Sent:</b> Tuesday, April 14, 2020 10:19
                            AM<br>
                            <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                            <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                            <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                            <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                            Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                            <b>Subject:</b> RE: [PATCH] Revert
                            "drm/amdgpu: use the BAR if possible in
                            amdgpu_device_vram_access v2"</span>
                          <o:p></o:p></p>
                        <div>
                          <p class="MsoNormal"> <o:p></o:p></p>
                        </div>
                      </div>
                      <div>
                        <div>
                          <p class="MsoNormal">[AMD Official Use Only -
                            Internal Distribution Only]<br>
                            <br>
                            On VG20 or MI100, as soon as we run the
                            subtest, we get the dmesg output below, and
                            then the kernel ends up hanging. I don't
                            know enough about the test itself to know
                            why this is occurring, but Jon Kim and Felix
                            were discussing it on a separate thread when
                            the issue was first reported, so they can
                            hopefully provide some additional
                            information.<br>
                            <br>
                             Kent<br>
                            <br>
                            > -----Original Message-----<br>
                            > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                            > Sent: Tuesday, April 14, 2020 9:52 AM<br>
                            > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                            <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                            > Subject: Re: [PATCH] Revert
                            "drm/amdgpu: use the BAR if possible in<br>
                            > amdgpu_device_vram_access v2"<br>
                            > <br>
                            > Am 13.04.20 um 20:20 schrieb Kent
                            Russell:<br>
                            > > This reverts commit
                            c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                            > > The original patch causes a RAS
                            event and subsequent kernel hard-hang<br>
                            > > when running the
                            KFDMemoryTest.PtraceAccessInvisibleVram on
                            VG20 and<br>
                            > > Arcturus<br>
                            > ><br>
                            > > dmesg output at hang time:<br>
                            > > [drm] RAS event of type
                            ERREVENT_ATHUB_INTERRUPT detected!<br>
                            > > amdgpu 0000:67:00.0: GPU reset
                            begin!<br>
                            > > Evicting PASID 0x8000 queues<br>
                            > > Started evicting pasid 0x8000<br>
                            > > qcm fence wait loop timeout
                            expired<br>
                            > > The cp might be in an
                            unrecoverable state due to an unsuccessful<br>
                            > > queues preemption Failed to evict
                            process queues Failed to suspend<br>
                            > > process 0x8000 Finished evicting
                            pasid 0x8000 Started restoring pasid<br>
                            > > 0x8000 Finished restoring pasid
                            0x8000 [drm] UVD VCPU state may lost<br>
                            > > due to RAS
                            ERREVENT_ATHUB_INTERRUPT<br>
                            > > amdgpu: [powerplay] Failed to send
                            message 0x26, response 0x0<br>
                            > > amdgpu: [powerplay] Failed to set
                            soft min gfxclk !<br>
                            > > amdgpu: [powerplay] Failed to
                            upload DPM Bootup Levels!<br>
                            > > amdgpu: [powerplay] Failed to send
                            message 0x7, response 0x0<br>
                            > > amdgpu: [powerplay]
                            [DisableAllSMUFeatures] Failed to disable
                            all smu<br>
                            > features!<br>
                            > > amdgpu: [powerplay]
                            [DisableDpmTasks] Failed to disable all smu
                            features!<br>
                            > > amdgpu: [powerplay] [PowerOffAsic]
                            Failed to disable DPM!<br>
                            > >
                            [drm:amdgpu_device_ip_suspend_phase2
                            [amdgpu]] *ERROR* suspend of IP<br>
                            > > block <powerplay> failed -5<br>
                            > <br>
                            > Do you have more information on what's
                            going wrong here since this is a really<br>
                            > important patch for KFD debugging.<br>
                            > <br>
                            > ><br>
                            > > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                            > <br>
                            > Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                            > <br>
                            > > ---<br>
                            > >  
                            drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |
                            26 ----------------------<br>
                            > >   1 file changed, 26 deletions(-)<br>
                            > ><br>
                            > > diff --git
                            a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                            > >
                            b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                            > > index cf5d6e585634..a3f997f84020
                            100644<br>
                            > > ---
                            a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                            > > +++
                            b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                            > > @@ -254,32 +254,6 @@ void
                            amdgpu_device_vram_access(struct<br>
                            > amdgpu_device *adev, loff_t pos,<br>
                            > >      uint32_t hi = ~0;<br>
                            > >      uint64_t last;<br>
                            > ><br>
                            > > -<br>
                            > > -#ifdef CONFIG_64BIT<br>
                            > > -   last = min(pos + size,
                            adev->gmc.visible_vram_size);<br>
                            > > -   if (last > pos) {<br>
                            > > -           void __iomem *addr =
                            adev->mman.aper_base_kaddr + pos;<br>
                            > > -           size_t count = last -
                            pos;<br>
                            > > -<br>
                            > > -           if (write) {<br>
                            > > -                  
                            memcpy_toio(addr, buf, count);<br>
                            > > -                   mb();<br>
                            > > -                  
                            amdgpu_asic_flush_hdp(adev, NULL);<br>
                            > > -           } else {<br>
                            > > -                  
                            amdgpu_asic_invalidate_hdp(adev, NULL);<br>
                            > > -                   mb();<br>
                            > > -                  
                            memcpy_fromio(buf, addr, count);<br>
                            > > -           }<br>
                            > > -<br>
                            > > -           if (count == size)<br>
                            > > -                   return;<br>
                            > > -<br>
                            > > -           pos += count;<br>
                            > > -           buf += count / 4;<br>
                            > > -           size -= count;<br>
                            > > -   }<br>
                            > > -#endif<br>
                            > > -<br>
                            > >     
                            spin_lock_irqsave(&adev->mmio_idx_lock,
                            flags);<br>
                            > >      for (last = pos + size; pos
                            < last; pos += 4) {<br>
                            > >              uint32_t tmp = pos
                            >> 31;<br>
_______________________________________________<br>
                            amd-gfx mailing list<br>
                            <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                            <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                        </div>
                      </div>
                    </div>
                  </div>
                </blockquote>
              </div>
              <p class="MsoNormal"> <o:p></o:p></p>
            </div>
          </div>
          <div>
            <p class="MsoNormal"> <o:p></o:p></p>
            <div>
              <p class="MsoNormal">Am 14.04.2020 16:35 schrieb "Deucher,
                Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
            </div>
          </div>
          <div>
            <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                Public Use]</span><o:p></o:p></p>
            <p class="MsoNormal"> <o:p></o:p></p>
            <div>
              <div>
                <p class="MsoNormal"><span style="font-size:12.0pt;color:black">If this causes
                    an issue, any access to vram via the BAR could cause
                    an issue.</span><o:p></o:p></p>
              </div>
              <div>
                <p class="MsoNormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
              </div>
              <div>
                <p class="MsoNormal"><span style="font-size:12.0pt;color:black">Alex</span><o:p></o:p></p>
              </div>
              <div class="MsoNormal" style="text-align:center" align="center">
                <hr width="98%" size="2" align="center">
              </div>
              <div id="divRplyFwdMsg">
                <p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                    on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                    <b>Sent:</b> Tuesday, April 14, 2020 10:19 AM<br>
                    <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                    <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                    <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                    Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                    <b>Subject:</b> RE: [PATCH] Revert "drm/amdgpu: use
                    the BAR if possible in amdgpu_device_vram_access v2"</span>
                  <o:p></o:p></p>
                <div>
                  <p class="MsoNormal"> <o:p></o:p></p>
                </div>
              </div>
              <div>
                <div>
                  <p class="MsoNormal">[AMD Official Use Only - Internal
                    Distribution Only]<br>
                    <br>
                    On VG20 or MI100, as soon as we run the subtest, we
                    get the dmesg output below, and then the kernel ends
                    up hanging. I don't know enough about the test
                    itself to know why this is occurring, but Jon Kim
                    and Felix were discussing it on a separate thread
                    when the issue was first reported, so they can
                    hopefully provide some additional information.<br>
                    <br>
                     Kent<br>
                    <br>
                    > -----Original Message-----<br>
                    > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                    > Sent: Tuesday, April 14, 2020 9:52 AM<br>
                    > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                    > Subject: Re: [PATCH] Revert "drm/amdgpu: use
                    the BAR if possible in<br>
                    > amdgpu_device_vram_access v2"<br>
                    > <br>
                    > Am 13.04.20 um 20:20 schrieb Kent Russell:<br>
                    > > This reverts commit
                    c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                    > > The original patch causes a RAS event and
                    subsequent kernel hard-hang<br>
                    > > when running the
                    KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and<br>
                    > > Arcturus<br>
                    > ><br>
                    > > dmesg output at hang time:<br>
                    > > [drm] RAS event of type
                    ERREVENT_ATHUB_INTERRUPT detected!<br>
                    > > amdgpu 0000:67:00.0: GPU reset begin!<br>
                    > > Evicting PASID 0x8000 queues<br>
                    > > Started evicting pasid 0x8000<br>
                    > > qcm fence wait loop timeout expired<br>
                    > > The cp might be in an unrecoverable state
                    due to an unsuccessful<br>
                    > > queues preemption Failed to evict process
                    queues Failed to suspend<br>
                    > > process 0x8000 Finished evicting pasid
                    0x8000 Started restoring pasid<br>
                    > > 0x8000 Finished restoring pasid 0x8000
                    [drm] UVD VCPU state may lost<br>
                    > > due to RAS ERREVENT_ATHUB_INTERRUPT<br>
                    > > amdgpu: [powerplay] Failed to send message
                    0x26, response 0x0<br>
                    > > amdgpu: [powerplay] Failed to set soft min
                    gfxclk !<br>
                    > > amdgpu: [powerplay] Failed to upload DPM
                    Bootup Levels!<br>
                    > > amdgpu: [powerplay] Failed to send message
                    0x7, response 0x0<br>
                    > > amdgpu: [powerplay]
                    [DisableAllSMUFeatures] Failed to disable all smu<br>
                    > features!<br>
                    > > amdgpu: [powerplay] [DisableDpmTasks]
                    Failed to disable all smu features!<br>
                    > > amdgpu: [powerplay] [PowerOffAsic] Failed
                    to disable DPM!<br>
                    > > [drm:amdgpu_device_ip_suspend_phase2
                    [amdgpu]] *ERROR* suspend of IP<br>
                    > > block <powerplay> failed -5<br>
                    > <br>
                    > Do you have more information on what's going
                    wrong here since this is a really<br>
                    > important patch for KFD debugging.<br>
                    > <br>
                    > ><br>
                    > > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                    > <br>
                    > Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                    > <br>
                    > > ---<br>
                    > >  
                    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
                    ----------------------<br>
                    > >   1 file changed, 26 deletions(-)<br>
                    > ><br>
                    > > diff --git
                    a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                    > >
                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                    > > index cf5d6e585634..a3f997f84020 100644<br>
                    > > ---
                    a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                    > > +++
                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                    > > @@ -254,32 +254,6 @@ void
                    amdgpu_device_vram_access(struct<br>
                    > amdgpu_device *adev, loff_t pos,<br>
                    > >      uint32_t hi = ~0;<br>
                    > >      uint64_t last;<br>
                    > ><br>
                    > > -<br>
                    > > -#ifdef CONFIG_64BIT<br>
                    > > -   last = min(pos + size,
                    adev->gmc.visible_vram_size);<br>
                    > > -   if (last > pos) {<br>
                    > > -           void __iomem *addr =
                    adev->mman.aper_base_kaddr + pos;<br>
                    > > -           size_t count = last - pos;<br>
                    > > -<br>
                    > > -           if (write) {<br>
                    > > -                   memcpy_toio(addr, buf,
                    count);<br>
                    > > -                   mb();<br>
                    > > -                  
                    amdgpu_asic_flush_hdp(adev, NULL);<br>
                    > > -           } else {<br>
                    > > -                  
                    amdgpu_asic_invalidate_hdp(adev, NULL);<br>
                    > > -                   mb();<br>
                    > > -                   memcpy_fromio(buf,
                    addr, count);<br>
                    > > -           }<br>
                    > > -<br>
                    > > -           if (count == size)<br>
                    > > -                   return;<br>
                    > > -<br>
                    > > -           pos += count;<br>
                    > > -           buf += count / 4;<br>
                    > > -           size -= count;<br>
                    > > -   }<br>
                    > > -#endif<br>
                    > > -<br>
                    > >     
                    spin_lock_irqsave(&adev->mmio_idx_lock,
                    flags);<br>
                    > >      for (last = pos + size; pos <
                    last; pos += 4) {<br>
                    > >              uint32_t tmp = pos >>
                    31;<br>
                    _______________________________________________<br>
                    amd-gfx mailing list<br>
                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                    <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <br>
  </body>
</html>