<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Why do you think that 3efed000 and
      befed000 are misaligned addresses?<br>
      <br>
      And see amdgpu_ttm_access_memory(), misaligned accesses are always
      routed to the MM path.<br>
      <br>
      Regards,<br>
      Christian.<br>
      <br>
      Am 16.04.20 um 18:08 schrieb Kim, Jonathan:<br>
    </div>
    <blockquote type="cite" cite="mid:MN2PR12MB45181F2428A410E2DD157DE085D80@MN2PR12MB4518.namprd12.prod.outlook.com">
      
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
      <style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
p.xmsonormal, li.xmsonormal, div.xmsonormal
        {mso-style-name:x_msonormal;
        margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
p.xmsipheader4d0fcdd7, li.xmsipheader4d0fcdd7, div.xmsipheader4d0fcdd7
        {mso-style-name:x_msipheader4d0fcdd7;
        mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
p.xmsipheader87abd423, li.xmsipheader87abd423, div.xmsipheader87abd423
        {mso-style-name:x_msipheader87abd423;
        mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
p.msipheader4d0fcdd7, li.msipheader4d0fcdd7, div.msipheader4d0fcdd7
        {mso-style-name:msipheader4d0fcdd7;
        mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
span.EmailStyle23
        {mso-style-type:personal-compose;
        font-family:"Arial",sans-serif;
        color:#0078D7;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="msipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
            Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Hi Felix,<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">You’re probably right.<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Passing Vega20 system:<o:p></o:p></p>
        <p class="MsoNormal">[   56.683273] amdgpu: [vram dbg]
          addr         3e7ffff8, val         deadbeef<o:p></o:p></p>
        <p class="MsoNormal">[   56.683349] amdgpu: [vram dbg]
          addr         3efed000, val         cafebabe <- potential
          misalign access<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Failing Vega20 system:<o:p></o:p></p>
        <p class="MsoNormal">[Apr16 12:00] amdgpu: [vram dbg]
          addr         be7ffff8, val         deadbeef<o:p></o:p></p>
        <p class="MsoNormal">[  +0.000082] amdgpu: [vram dbg]
          addr         befed000, val         ffffffff <- potential
          misalign access<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Thanks,<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Jon<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <div style="border:none;border-top:solid #E1E1E1
            1.0pt;padding:3.0pt 0in 0in 0in">
            <p class="MsoNormal"><b>From:</b> Kuehling, Felix
              <a class="moz-txt-link-rfc2396E" href="mailto:Felix.Kuehling@amd.com"><Felix.Kuehling@amd.com></a> <br>
              <b>Sent:</b> Wednesday, April 15, 2020 11:02 AM<br>
              <b>To:</b> Koenig, Christian
              <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>; Kim, Jonathan
              <a class="moz-txt-link-rfc2396E" href="mailto:Jonathan.Kim@amd.com"><Jonathan.Kim@amd.com></a>; Deucher, Alexander
              <a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a><br>
              <b>Cc:</b> Russell, Kent <a class="moz-txt-link-rfc2396E" href="mailto:Kent.Russell@amd.com"><Kent.Russell@amd.com></a>;
              <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
              <b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use the
              BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></p>
          </div>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
            Official Use Only - Internal Distribution Only]<o:p></o:p></span></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <div>
            <p class="MsoNormal"><span style="font-size:12.0pt;color:black">The test does not
                access outside of the allocated memory. But it
                deliberately crosses a boundary where memory can be
                allocated non-contiguously. This is meant to catch
                problems where the access function doesn't handle
                non-contiguous VRAM allocations correctly. However, the
                way that VRAM allocation has been optimized, I expect
                that most allocations are contiguous nowadays. However,
                the more interesting aspect of the test is, that it
                performs misaligned memory accesses. The MMIO method of
                accessing VRAM explicitly handles misaligned accesses
                and breaks them down into dword aligned accesses with
                proper masking and shifting.<o:p></o:p></span></p>
          </div>
          <div>
            <p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
          </div>
          <div>
            <p class="MsoNormal"><span style="font-size:12.0pt;color:black">Could the unaligned
                nature of the memory access have something to do with
                hitting RAS errors? That's something unique to this test
                that we wouldn't see on a normal page table update or
                memory eviction.<o:p></o:p></span></p>
          </div>
          <div>
            <p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
          </div>
          <div>
            <p class="MsoNormal"><span style="font-size:12.0pt;color:black">Regards,<br>
                  Felix<o:p></o:p></span></p>
          </div>
          <div>
            <p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
          </div>
          <div class="MsoNormal" style="text-align:center" align="center">
            <hr width="98%" size="2" align="center">
          </div>
          <div id="divRplyFwdMsg">
            <p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>><br>
                <b>Sent:</b> Wednesday, April 15, 2020 6:58 AM<br>
                <b>To:</b> Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>>;
                Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                Deucher, Alexander <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>><br>
                <b>Cc:</b> Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                <b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use the
                BAR if possible in amdgpu_device_vram_access v2"</span>
              <o:p></o:p></p>
            <div>
              <p class="MsoNormal"> <o:p></o:p></p>
            </div>
          </div>
          <div>
            <div>
              <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                <p class="xmsonormal" style="background:white">To
                  elaborate on the PTRACE test, we PEEK 2 DWORDs inside
                  thunk allocated mapped memory and 2 DWORDS outside
                  that boundary (it’s only about 4MB to the boundary). 
                  Then we POKE to swap the DWORD positions across the
                  boundary.  The RAS event on the single failing machine
                  happens on the out of boundary PEEK.<o:p></o:p></p>
              </blockquote>
              <p class="MsoNormal" style="background:white"><span style="color:black"><br>
                  Well when you access outside of an allocated buffer I
                  would expect that we never get as far as even touching
                  the hardware because the kernel should block the
                  access with an -EPERM or -EFAULT. So sounds like I'm
                  not understanding something correctly here.<br>
                  <br>
                  Apart from that I completely agree that we need to
                  sort out any other RAS event first to make sure that
                  the system is simply not failing randomly.<br>
                  <br>
                  Regards,<br>
                  Christian.<br>
                  <br>
                  Am 15.04.20 um 11:49 schrieb Kim, Jonathan:</span><o:p></o:p></p>
            </div>
            <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
              <div>
                <p class="xmsipheader87abd423" style="margin:0in;margin-bottom:.0001pt;background:white">
                  <span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                    Public Use]</span><o:p></o:p></p>
                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                <p class="xmsonormal" style="background:white">Hi
                  Christian,<o:p></o:p></p>
                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                <p class="xmsonormal" style="background:white">That
                  could potentially be it.  With additional testing, 2
                  of 3 Vega20 machines never hit error over BAR access
                  with the PTRACE test.  3 of 3 machines (from the same
                  pool) always hit error with CWSR.<o:p></o:p></p>
                <p class="xmsonormal" style="background:white">To
                  elaborate on the PTRACE test, we PEEK 2 DWORDs inside
                  thunk allocated mapped memory and 2 DWORDS outside
                  that boundary (it’s only about 4MB to the boundary). 
                  Then we POKE to swap the DWORD positions across the
                  boundary.  The RAS event on the single failing machine
                  happens on the out of boundary PEEK.<o:p></o:p></p>
                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                <p class="xmsonormal" style="background:white">Felix
                  mentioned we don’t hit errors over general HDP access
                  but that may not true.  An Arcturus failure sys logs
                  posted (which wasn’t tested by me) shows someone
                  launched rocm bandwidth test, hit a VM fault and a RAS
                  event ensued during evictions (I can point the
                  internal ticket or log snippet offline if
                  interested).  Whether the RAS event is BAR access
                  triggered or the result of HW instability is beyond me
                  since I don’t have access to the machine.<o:p></o:p></p>
                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                <p class="xmsonormal" style="background:white">Thanks,<o:p></o:p></p>
                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                <p class="xmsonormal" style="background:white">Jon<o:p></o:p></p>
                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                <div>
                  <div style="border:none;border-top:solid #E1E1E1
                    1.0pt;padding:3.0pt 0in 0in 0in">
                    <p class="xmsonormal" style="background:white"><b>From:</b>
                      Koenig, Christian <a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">
                        <Christian.Koenig@amd.com></a> <br>
                      <b>Sent:</b> Wednesday, April 15, 2020 4:11 AM<br>
                      <b>To:</b> Kim, Jonathan <a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true"><Jonathan.Kim@amd.com></a>;
                      Kuehling, Felix
                      <a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true"><Felix.Kuehling@amd.com></a>;
                      Deucher, Alexander
                      <a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a><br>
                      <b>Cc:</b> Russell, Kent <a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true"><Kent.Russell@amd.com></a>;
                      <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                      <b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu:
                      use the BAR if possible in
                      amdgpu_device_vram_access v2"<o:p></o:p></p>
                  </div>
                </div>
                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                <div>
                  <p class="xmsonormal" style="margin-bottom:12.0pt;background:white">Hi
                    Jon,<o:p></o:p></p>
                  <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                    <p class="xmsonormal" style="background:white">Also
                      cwsr tests fail on Vega20 with or without the
                      revert with the same RAS error.<o:p></o:p></p>
                  </blockquote>
                  <p class="xmsonormal" style="background:white"><br>
                    That sounds like the system/setup has a more general
                    problem.<br>
                    <br>
                    Could it be that we are seeing RAS errors because
                    there really is some hardware failure, but with the
                    MM path we don't trigger a RAS interrupt?<br>
                    <br>
                    Thanks,<br>
                    Christian.<br>
                    <br>
                    Am 14.04.20 um 22:30 schrieb Kim, Jonathan:<o:p></o:p></p>
                </div>
                <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                  <p class="xmsipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt;background:white">
                    <span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
                      Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
                  <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                  <p class="xmsonormal" style="background:white">If
                    we’re passing the test on the revert, then the only
                    thing that’s different is we’re not invalidating HDP
                    and doing a copy to host anymore in
                    amdgpu_device_vram_access since the function is
                    still called in ttm access_memory with BAR.<o:p></o:p></p>
                  <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                  <p class="xmsonormal" style="background:white">Also
                    cwsr tests fail on Vega20 with or without the revert
                    with the same RAS error.<o:p></o:p></p>
                  <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                  <p class="xmsonormal" style="background:white">Thanks,<o:p></o:p></p>
                  <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                  <p class="xmsonormal" style="background:white">Jon<o:p></o:p></p>
                  <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                  <div>
                    <div style="border:none;border-top:solid #E1E1E1
                      1.0pt;padding:3.0pt 0in 0in 0in">
                      <p class="xmsonormal" style="background:white"><b>From:</b>
                        Kuehling, Felix <a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">
                          <Felix.Kuehling@amd.com></a> <br>
                        <b>Sent:</b> Tuesday, April 14, 2020 2:32 PM<br>
                        <b>To:</b> Kim, Jonathan <a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true"><Jonathan.Kim@amd.com></a>;
                        Koenig, Christian
                        <a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a>;
                        Deucher, Alexander
                        <a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a><br>
                        <b>Cc:</b> Russell, Kent <a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true"><Kent.Russell@amd.com></a>;
                        <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                        <b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu:
                        use the BAR if possible in
                        amdgpu_device_vram_access v2"<o:p></o:p></p>
                    </div>
                  </div>
                  <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                  <p style="background:white"><span style="color:black">I
                      wouldn't call it premature. Revert is a usual
                      practice when there is a serious regression that
                      isn't fully understood or root-caused. As far as I
                      can tell, the problem has been reproduced on
                      multiple systems, different GPUs, and clearly
                      regressed to Christian's commit. I think that
                      justifies reverting it for now.</span><o:p></o:p></p>
                  <p style="background:white"><span style="color:black">I
                      agree with Christian that a general HDP memory
                      access problem causing RAS errors would
                      potentially cause problems in other tests as well.
                      For example common operations like GART table
                      updates, and GPUVM page table updates and PCIe
                      peer2peer accesses in ROCm applications use HDP.
                      But we're not seeing obvious problems from those.
                      So we need to understand what's special about this
                      test. I asked questions to that effect on our
                      other email thread.</span><o:p></o:p></p>
                  <p style="background:white"><span style="color:black">Regards,<br>
                        Felix</span><o:p></o:p></p>
                  <div>
                    <p class="xmsonormal" style="background:white">Am
                      2020-04-14 um 10:51 a.m. schrieb Kim, Jonathan:<o:p></o:p></p>
                  </div>
                  <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                    <p class="xmsipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt;background:white">
                      <span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
                        Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
                    <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                    <p class="xmsonormal" style="background:white">I
                      think it’s premature to push this revert.<o:p></o:p></p>
                    <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                    <p class="xmsonormal" style="background:white">With
                      more testing, I’m getting failures from different
                      tests or sometimes none at all on my machine.<o:p></o:p></p>
                    <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                    <p class="xmsonormal" style="background:white">Kent,
                      let’s continue the discussion on the original
                      thread.<o:p></o:p></p>
                    <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                    <p class="xmsonormal" style="background:white">Thanks,<o:p></o:p></p>
                    <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                    <p class="xmsonormal" style="background:white">Jon<o:p></o:p></p>
                    <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                    <div>
                      <div style="border:none;border-top:solid #E1E1E1
                        1.0pt;padding:3.0pt 0in 0in 0in">
                        <p class="xmsonormal" style="background:white"><b>From:</b>
                          Koenig, Christian <a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">
                            <Christian.Koenig@amd.com></a> <br>
                          <b>Sent:</b> Tuesday, April 14, 2020 10:47 AM<br>
                          <b>To:</b> Deucher, Alexander <a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a><br>
                          <b>Cc:</b> Russell, Kent <a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true"><Kent.Russell@amd.com></a>;
                          <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>;
                          Kuehling, Felix
                          <a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true"><Felix.Kuehling@amd.com></a>;
                          Kim, Jonathan
                          <a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true"><Jonathan.Kim@amd.com></a><br>
                          <b>Subject:</b> Re: [PATCH] Revert
                          "drm/amdgpu: use the BAR if possible in
                          amdgpu_device_vram_access v2"<o:p></o:p></p>
                      </div>
                    </div>
                    <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                    <div>
                      <div>
                        <div>
                          <div>
                            <div>
                              <p class="xmsonormal" style="background:white">That's exactly
                                my concern as well.
                                <o:p></o:p></p>
                              <div>
                                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                              </div>
                              <div>
                                <p class="xmsonormal" style="background:white">This looks a
                                  bit like the test creates erroneous
                                  data somehow, but there doesn't seems
                                  to be a RAS check in the MM data path.<o:p></o:p></p>
                              </div>
                              <div>
                                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                              </div>
                              <div>
                                <p class="xmsonormal" style="background:white">And now that
                                  we use the BAR path it goes up in
                                  flames.<o:p></o:p></p>
                              </div>
                              <div>
                                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                              </div>
                              <div>
                                <p class="xmsonormal" style="background:white">I just don't
                                  see how we can create erroneous data
                                  in a test case?<o:p></o:p></p>
                              </div>
                              <div>
                                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                              </div>
                              <div>
                                <p class="xmsonormal" style="background:white">Christian.<o:p></o:p></p>
                              </div>
                            </div>
                            <div>
                              <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                              <div>
                                <p class="xmsonormal" style="background:white">Am 14.04.2020
                                  16:35 schrieb "Deucher, Alexander"
                                  <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                                <blockquote style="border:none;border-left:solid
                                  #CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                                  <div>
                                    <p style="margin:15.0pt;background:white"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                                        Public Use]</span><o:p></o:p></p>
                                    <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                                    <div>
                                      <div>
                                        <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt">If
                                            this causes an issue, any
                                            access to vram via the BAR
                                            could cause an issue.</span><o:p></o:p></p>
                                      </div>
                                      <div>
                                        <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
                                      </div>
                                      <div>
                                        <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
                                      </div>
                                      <div class="MsoNormal" style="text-align:center;background:white" align="center">
                                        <span style="color:black">
                                          <hr width="98%" size="2" align="center">
                                        </span></div>
                                      <div>
                                        <p class="xmsonormal" style="background:white"><b>From:</b>
                                          amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                                          on behalf of Russell, Kent
                                          <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                                          <b>Sent:</b> Tuesday, April
                                          14, 2020 10:19 AM<br>
                                          <b>To:</b> Koenig, Christian
                                          <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                                          <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                                          <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                                          <b>Cc:</b> Kuehling, Felix
                                          <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                                          Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                                          <b>Subject:</b> RE: [PATCH]
                                          Revert "drm/amdgpu: use the
                                          BAR if possible in
                                          amdgpu_device_vram_access v2"
                                          <o:p></o:p></p>
                                        <div>
                                          <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                                        </div>
                                      </div>
                                      <div>
                                        <div>
                                          <p class="xmsonormal" style="background:white">[AMD
                                            Official Use Only - Internal
                                            Distribution Only]<br>
                                            <br>
                                            On VG20 or MI100, as soon as
                                            we run the subtest, we get
                                            the dmesg output below, and
                                            then the kernel ends up
                                            hanging. I don't know enough
                                            about the test itself to
                                            know why this is occurring,
                                            but Jon Kim and Felix were
                                            discussing it on a separate
                                            thread when the issue was
                                            first reported, so they can
                                            hopefully provide some
                                            additional information.<br>
                                            <br>
                                             Kent<br>
                                            <br>
                                            > -----Original
                                            Message-----<br>
                                            > From: Christian König
                                            <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                                            > Sent: Tuesday, April
                                            14, 2020 9:52 AM<br>
                                            > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                                            <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                            > Subject: Re: [PATCH]
                                            Revert "drm/amdgpu: use the
                                            BAR if possible in<br>
                                            >
                                            amdgpu_device_vram_access
                                            v2"<br>
                                            > <br>
                                            > Am 13.04.20 um 20:20
                                            schrieb Kent Russell:<br>
                                            > > This reverts
                                            commit
                                            c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                                            > > The original patch
                                            causes a RAS event and
                                            subsequent kernel hard-hang<br>
                                            > > when running the
                                            KFDMemoryTest.PtraceAccessInvisibleVram
                                            on VG20 and<br>
                                            > > Arcturus<br>
                                            > ><br>
                                            > > dmesg output at
                                            hang time:<br>
                                            > > [drm] RAS event of
                                            type
                                            ERREVENT_ATHUB_INTERRUPT
                                            detected!<br>
                                            > > amdgpu
                                            0000:67:00.0: GPU reset
                                            begin!<br>
                                            > > Evicting PASID
                                            0x8000 queues<br>
                                            > > Started evicting
                                            pasid 0x8000<br>
                                            > > qcm fence wait
                                            loop timeout expired<br>
                                            > > The cp might be in
                                            an unrecoverable state due
                                            to an unsuccessful<br>
                                            > > queues preemption
                                            Failed to evict process
                                            queues Failed to suspend<br>
                                            > > process 0x8000
                                            Finished evicting pasid
                                            0x8000 Started restoring
                                            pasid<br>
                                            > > 0x8000 Finished
                                            restoring pasid 0x8000 [drm]
                                            UVD VCPU state may lost<br>
                                            > > due to RAS
                                            ERREVENT_ATHUB_INTERRUPT<br>
                                            > > amdgpu:
                                            [powerplay] Failed to send
                                            message 0x26, response 0x0<br>
                                            > > amdgpu:
                                            [powerplay] Failed to set
                                            soft min gfxclk !<br>
                                            > > amdgpu:
                                            [powerplay] Failed to upload
                                            DPM Bootup Levels!<br>
                                            > > amdgpu:
                                            [powerplay] Failed to send
                                            message 0x7, response 0x0<br>
                                            > > amdgpu:
                                            [powerplay]
                                            [DisableAllSMUFeatures]
                                            Failed to disable all smu<br>
                                            > features!<br>
                                            > > amdgpu:
                                            [powerplay]
                                            [DisableDpmTasks] Failed to
                                            disable all smu features!<br>
                                            > > amdgpu:
                                            [powerplay] [PowerOffAsic]
                                            Failed to disable DPM!<br>
                                            > >
                                            [drm:amdgpu_device_ip_suspend_phase2
                                            [amdgpu]] *ERROR* suspend of
                                            IP<br>
                                            > > block
                                            <powerplay> failed -5<br>
                                            > <br>
                                            > Do you have more
                                            information on what's going
                                            wrong here since this is a
                                            really<br>
                                            > important patch for KFD
                                            debugging.<br>
                                            > <br>
                                            > ><br>
                                            > > Signed-off-by:
                                            Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                                            > <br>
                                            > Reviewed-by: Christian
                                            König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                                            > <br>
                                            > > ---<br>
                                            > >  
                                            drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                                            | 26 ----------------------<br>
                                            > >   1 file changed,
                                            26 deletions(-)<br>
                                            > ><br>
                                            > > diff --git
                                            a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                            > >
                                            b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                            > > index
                                            cf5d6e585634..a3f997f84020
                                            100644<br>
                                            > > ---
                                            a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                            > > +++
                                            b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                            > > @@ -254,32 +254,6
                                            @@ void
                                            amdgpu_device_vram_access(struct<br>
                                            > amdgpu_device *adev,
                                            loff_t pos,<br>
                                            > >      uint32_t hi =
                                            ~0;<br>
                                            > >      uint64_t
                                            last;<br>
                                            > ><br>
                                            > > -<br>
                                            > > -#ifdef
                                            CONFIG_64BIT<br>
                                            > > -   last = min(pos
                                            + size,
                                            adev->gmc.visible_vram_size);<br>
                                            > > -   if (last >
                                            pos) {<br>
                                            > > -           void
                                            __iomem *addr =
                                            adev->mman.aper_base_kaddr
                                            + pos;<br>
                                            > > -           size_t
                                            count = last - pos;<br>
                                            > > -<br>
                                            > > -           if
                                            (write) {<br>
                                            > >
                                            -                  
                                            memcpy_toio(addr, buf,
                                            count);<br>
                                            > >
                                            -                   mb();<br>
                                            > >
                                            -                  
                                            amdgpu_asic_flush_hdp(adev,
                                            NULL);<br>
                                            > > -           } else
                                            {<br>
                                            > >
                                            -                  
                                            amdgpu_asic_invalidate_hdp(adev,
                                            NULL);<br>
                                            > >
                                            -                   mb();<br>
                                            > >
                                            -                  
                                            memcpy_fromio(buf, addr,
                                            count);<br>
                                            > > -           }<br>
                                            > > -<br>
                                            > > -           if
                                            (count == size)<br>
                                            > >
                                            -                   return;<br>
                                            > > -<br>
                                            > > -           pos +=
                                            count;<br>
                                            > > -           buf +=
                                            count / 4;<br>
                                            > > -           size
                                            -= count;<br>
                                            > > -   }<br>
                                            > > -#endif<br>
                                            > > -<br>
                                            > >     
                                            spin_lock_irqsave(&adev->mmio_idx_lock,
                                            flags);<br>
                                            > >      for (last =
                                            pos + size; pos < last;
                                            pos += 4) {<br>
                                            > >             
                                            uint32_t tmp = pos >>
                                            31;<br>
_______________________________________________<br>
                                            amd-gfx mailing list<br>
                                            <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                            <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                                        </div>
                                      </div>
                                    </div>
                                  </div>
                                </blockquote>
                              </div>
                              <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                            </div>
                          </div>
                          <div>
                            <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                            <div>
                              <p class="xmsonormal" style="background:white">Am 14.04.2020
                                16:35 schrieb "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                              <blockquote style="border:none;border-left:solid
                                #CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                                <div>
                                  <p style="margin:15.0pt;background:white"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                                      Public Use]</span><o:p></o:p></p>
                                  <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                                  <div>
                                    <div>
                                      <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt">If
                                          this causes an issue, any
                                          access to vram via the BAR
                                          could cause an issue.</span><o:p></o:p></p>
                                    </div>
                                    <div>
                                      <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
                                    </div>
                                    <div>
                                      <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
                                    </div>
                                    <div class="MsoNormal" style="text-align:center;background:white" align="center">
                                      <span style="color:black">
                                        <hr width="98%" size="2" align="center">
                                      </span></div>
                                    <div>
                                      <p class="xmsonormal" style="background:white"><b>From:</b>
                                        amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                                        on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                                        <b>Sent:</b> Tuesday, April 14,
                                        2020 10:19 AM<br>
                                        <b>To:</b> Koenig, Christian
                                        <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                                        <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                                        <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                                        <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                                        Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                                        <b>Subject:</b> RE: [PATCH]
                                        Revert "drm/amdgpu: use the BAR
                                        if possible in
                                        amdgpu_device_vram_access v2"
                                        <o:p></o:p></p>
                                      <div>
                                        <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                                      </div>
                                    </div>
                                    <div>
                                      <div>
                                        <p class="xmsonormal" style="background:white">[AMD
                                          Official Use Only - Internal
                                          Distribution Only]<br>
                                          <br>
                                          On VG20 or MI100, as soon as
                                          we run the subtest, we get the
                                          dmesg output below, and then
                                          the kernel ends up hanging. I
                                          don't know enough about the
                                          test itself to know why this
                                          is occurring, but Jon Kim and
                                          Felix were discussing it on a
                                          separate thread when the issue
                                          was first reported, so they
                                          can hopefully provide some
                                          additional information.<br>
                                          <br>
                                           Kent<br>
                                          <br>
                                          > -----Original
                                          Message-----<br>
                                          > From: Christian König
                                          <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                                          > Sent: Tuesday, April 14,
                                          2020 9:52 AM<br>
                                          > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                                          <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                          > Subject: Re: [PATCH]
                                          Revert "drm/amdgpu: use the
                                          BAR if possible in<br>
                                          > amdgpu_device_vram_access
                                          v2"<br>
                                          > <br>
                                          > Am 13.04.20 um 20:20
                                          schrieb Kent Russell:<br>
                                          > > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                                          > > The original patch
                                          causes a RAS event and
                                          subsequent kernel hard-hang<br>
                                          > > when running the
                                          KFDMemoryTest.PtraceAccessInvisibleVram
                                          on VG20 and<br>
                                          > > Arcturus<br>
                                          > ><br>
                                          > > dmesg output at hang
                                          time:<br>
                                          > > [drm] RAS event of
                                          type ERREVENT_ATHUB_INTERRUPT
                                          detected!<br>
                                          > > amdgpu 0000:67:00.0:
                                          GPU reset begin!<br>
                                          > > Evicting PASID
                                          0x8000 queues<br>
                                          > > Started evicting
                                          pasid 0x8000<br>
                                          > > qcm fence wait loop
                                          timeout expired<br>
                                          > > The cp might be in
                                          an unrecoverable state due to
                                          an unsuccessful<br>
                                          > > queues preemption
                                          Failed to evict process queues
                                          Failed to suspend<br>
                                          > > process 0x8000
                                          Finished evicting pasid 0x8000
                                          Started restoring pasid<br>
                                          > > 0x8000 Finished
                                          restoring pasid 0x8000 [drm]
                                          UVD VCPU state may lost<br>
                                          > > due to RAS
                                          ERREVENT_ATHUB_INTERRUPT<br>
                                          > > amdgpu: [powerplay]
                                          Failed to send message 0x26,
                                          response 0x0<br>
                                          > > amdgpu: [powerplay]
                                          Failed to set soft min gfxclk
                                          !<br>
                                          > > amdgpu: [powerplay]
                                          Failed to upload DPM Bootup
                                          Levels!<br>
                                          > > amdgpu: [powerplay]
                                          Failed to send message 0x7,
                                          response 0x0<br>
                                          > > amdgpu: [powerplay]
                                          [DisableAllSMUFeatures] Failed
                                          to disable all smu<br>
                                          > features!<br>
                                          > > amdgpu: [powerplay]
                                          [DisableDpmTasks] Failed to
                                          disable all smu features!<br>
                                          > > amdgpu: [powerplay]
                                          [PowerOffAsic] Failed to
                                          disable DPM!<br>
                                          > >
                                          [drm:amdgpu_device_ip_suspend_phase2
                                          [amdgpu]] *ERROR* suspend of
                                          IP<br>
                                          > > block
                                          <powerplay> failed -5<br>
                                          > <br>
                                          > Do you have more
                                          information on what's going
                                          wrong here since this is a
                                          really<br>
                                          > important patch for KFD
                                          debugging.<br>
                                          > <br>
                                          > ><br>
                                          > > Signed-off-by: Kent
                                          Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                                          > <br>
                                          > Reviewed-by: Christian
                                          König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                                          > <br>
                                          > > ---<br>
                                          > >  
                                          drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                                          | 26 ----------------------<br>
                                          > >   1 file changed, 26
                                          deletions(-)<br>
                                          > ><br>
                                          > > diff --git
                                          a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                          > >
                                          b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                          > > index
                                          cf5d6e585634..a3f997f84020
                                          100644<br>
                                          > > ---
                                          a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                          > > +++
                                          b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                          > > @@ -254,32 +254,6 @@
                                          void
                                          amdgpu_device_vram_access(struct<br>
                                          > amdgpu_device *adev,
                                          loff_t pos,<br>
                                          > >      uint32_t hi =
                                          ~0;<br>
                                          > >      uint64_t last;<br>
                                          > ><br>
                                          > > -<br>
                                          > > -#ifdef CONFIG_64BIT<br>
                                          > > -   last = min(pos +
                                          size,
                                          adev->gmc.visible_vram_size);<br>
                                          > > -   if (last >
                                          pos) {<br>
                                          > > -           void
                                          __iomem *addr =
                                          adev->mman.aper_base_kaddr
                                          + pos;<br>
                                          > > -           size_t
                                          count = last - pos;<br>
                                          > > -<br>
                                          > > -           if
                                          (write) {<br>
                                          > > -                  
                                          memcpy_toio(addr, buf, count);<br>
                                          > > -                  
                                          mb();<br>
                                          > > -                  
                                          amdgpu_asic_flush_hdp(adev,
                                          NULL);<br>
                                          > > -           } else {<br>
                                          > > -                  
amdgpu_asic_invalidate_hdp(adev, NULL);<br>
                                          > > -                  
                                          mb();<br>
                                          > > -                  
                                          memcpy_fromio(buf, addr,
                                          count);<br>
                                          > > -           }<br>
                                          > > -<br>
                                          > > -           if
                                          (count == size)<br>
                                          > > -                  
                                          return;<br>
                                          > > -<br>
                                          > > -           pos +=
                                          count;<br>
                                          > > -           buf +=
                                          count / 4;<br>
                                          > > -           size -=
                                          count;<br>
                                          > > -   }<br>
                                          > > -#endif<br>
                                          > > -<br>
                                          > >     
                                          spin_lock_irqsave(&adev->mmio_idx_lock,
                                          flags);<br>
                                          > >      for (last = pos
                                          + size; pos < last; pos +=
                                          4) {<br>
                                          > >             
                                          uint32_t tmp = pos >>
                                          31;<br>
_______________________________________________<br>
                                          amd-gfx mailing list<br>
                                          <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                          <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                                      </div>
                                    </div>
                                  </div>
                                </div>
                              </blockquote>
                            </div>
                            <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                          </div>
                        </div>
                        <div>
                          <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                          <div>
                            <p class="xmsonormal" style="background:white">Am 14.04.2020
                              16:35 schrieb "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                            <blockquote style="border:none;border-left:solid
                              #CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                              <div>
                                <p style="margin:15.0pt;background:white"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                                    Public Use]</span><o:p></o:p></p>
                                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                                <div>
                                  <div>
                                    <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt">If this
                                        causes an issue, any access to
                                        vram via the BAR could cause an
                                        issue.</span><o:p></o:p></p>
                                  </div>
                                  <div>
                                    <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
                                  </div>
                                  <div>
                                    <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
                                  </div>
                                  <div class="MsoNormal" style="text-align:center;background:white" align="center">
                                    <span style="color:black">
                                      <hr width="98%" size="2" align="center">
                                    </span></div>
                                  <div>
                                    <p class="xmsonormal" style="background:white"><b>From:</b>
                                      amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                                      on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                                      <b>Sent:</b> Tuesday, April 14,
                                      2020 10:19 AM<br>
                                      <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                                      <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                                      <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                                      <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                                      Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                                      <b>Subject:</b> RE: [PATCH] Revert
                                      "drm/amdgpu: use the BAR if
                                      possible in
                                      amdgpu_device_vram_access v2"
                                      <o:p></o:p></p>
                                    <div>
                                      <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                                    </div>
                                  </div>
                                  <div>
                                    <div>
                                      <p class="xmsonormal" style="background:white">[AMD
                                        Official Use Only - Internal
                                        Distribution Only]<br>
                                        <br>
                                        On VG20 or MI100, as soon as we
                                        run the subtest, we get the
                                        dmesg output below, and then the
                                        kernel ends up hanging. I don't
                                        know enough about the test
                                        itself to know why this is
                                        occurring, but Jon Kim and Felix
                                        were discussing it on a separate
                                        thread when the issue was first
                                        reported, so they can hopefully
                                        provide some additional
                                        information.<br>
                                        <br>
                                         Kent<br>
                                        <br>
                                        > -----Original Message-----<br>
                                        > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                                        > Sent: Tuesday, April 14,
                                        2020 9:52 AM<br>
                                        > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                                        <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                        > Subject: Re: [PATCH] Revert
                                        "drm/amdgpu: use the BAR if
                                        possible in<br>
                                        > amdgpu_device_vram_access
                                        v2"<br>
                                        > <br>
                                        > Am 13.04.20 um 20:20
                                        schrieb Kent Russell:<br>
                                        > > This reverts commit
                                        c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                                        > > The original patch
                                        causes a RAS event and
                                        subsequent kernel hard-hang<br>
                                        > > when running the
                                        KFDMemoryTest.PtraceAccessInvisibleVram
                                        on VG20 and<br>
                                        > > Arcturus<br>
                                        > ><br>
                                        > > dmesg output at hang
                                        time:<br>
                                        > > [drm] RAS event of
                                        type ERREVENT_ATHUB_INTERRUPT
                                        detected!<br>
                                        > > amdgpu 0000:67:00.0:
                                        GPU reset begin!<br>
                                        > > Evicting PASID 0x8000
                                        queues<br>
                                        > > Started evicting pasid
                                        0x8000<br>
                                        > > qcm fence wait loop
                                        timeout expired<br>
                                        > > The cp might be in an
                                        unrecoverable state due to an
                                        unsuccessful<br>
                                        > > queues preemption
                                        Failed to evict process queues
                                        Failed to suspend<br>
                                        > > process 0x8000
                                        Finished evicting pasid 0x8000
                                        Started restoring pasid<br>
                                        > > 0x8000 Finished
                                        restoring pasid 0x8000 [drm] UVD
                                        VCPU state may lost<br>
                                        > > due to RAS
                                        ERREVENT_ATHUB_INTERRUPT<br>
                                        > > amdgpu: [powerplay]
                                        Failed to send message 0x26,
                                        response 0x0<br>
                                        > > amdgpu: [powerplay]
                                        Failed to set soft min gfxclk !<br>
                                        > > amdgpu: [powerplay]
                                        Failed to upload DPM Bootup
                                        Levels!<br>
                                        > > amdgpu: [powerplay]
                                        Failed to send message 0x7,
                                        response 0x0<br>
                                        > > amdgpu: [powerplay]
                                        [DisableAllSMUFeatures] Failed
                                        to disable all smu<br>
                                        > features!<br>
                                        > > amdgpu: [powerplay]
                                        [DisableDpmTasks] Failed to
                                        disable all smu features!<br>
                                        > > amdgpu: [powerplay]
                                        [PowerOffAsic] Failed to disable
                                        DPM!<br>
                                        > >
                                        [drm:amdgpu_device_ip_suspend_phase2
                                        [amdgpu]] *ERROR* suspend of IP<br>
                                        > > block
                                        <powerplay> failed -5<br>
                                        > <br>
                                        > Do you have more
                                        information on what's going
                                        wrong here since this is a
                                        really<br>
                                        > important patch for KFD
                                        debugging.<br>
                                        > <br>
                                        > ><br>
                                        > > Signed-off-by: Kent
                                        Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                                        > <br>
                                        > Reviewed-by: Christian
                                        König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                                        > <br>
                                        > > ---<br>
                                        > >  
                                        drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                                        | 26 ----------------------<br>
                                        > >   1 file changed, 26
                                        deletions(-)<br>
                                        > ><br>
                                        > > diff --git
                                        a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                        > >
                                        b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                        > > index
                                        cf5d6e585634..a3f997f84020
                                        100644<br>
                                        > > ---
                                        a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                        > > +++
                                        b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                        > > @@ -254,32 +254,6 @@
                                        void
                                        amdgpu_device_vram_access(struct<br>
                                        > amdgpu_device *adev, loff_t
                                        pos,<br>
                                        > >      uint32_t hi = ~0;<br>
                                        > >      uint64_t last;<br>
                                        > ><br>
                                        > > -<br>
                                        > > -#ifdef CONFIG_64BIT<br>
                                        > > -   last = min(pos +
                                        size,
                                        adev->gmc.visible_vram_size);<br>
                                        > > -   if (last > pos)
                                        {<br>
                                        > > -           void
                                        __iomem *addr =
                                        adev->mman.aper_base_kaddr +
                                        pos;<br>
                                        > > -           size_t
                                        count = last - pos;<br>
                                        > > -<br>
                                        > > -           if (write)
                                        {<br>
                                        > > -                  
                                        memcpy_toio(addr, buf, count);<br>
                                        > > -                  
                                        mb();<br>
                                        > > -                  
                                        amdgpu_asic_flush_hdp(adev,
                                        NULL);<br>
                                        > > -           } else {<br>
                                        > > -                  
                                        amdgpu_asic_invalidate_hdp(adev,
                                        NULL);<br>
                                        > > -                  
                                        mb();<br>
                                        > > -                  
                                        memcpy_fromio(buf, addr, count);<br>
                                        > > -           }<br>
                                        > > -<br>
                                        > > -           if (count
                                        == size)<br>
                                        > > -                  
                                        return;<br>
                                        > > -<br>
                                        > > -           pos +=
                                        count;<br>
                                        > > -           buf +=
                                        count / 4;<br>
                                        > > -           size -=
                                        count;<br>
                                        > > -   }<br>
                                        > > -#endif<br>
                                        > > -<br>
                                        > >     
                                        spin_lock_irqsave(&adev->mmio_idx_lock,
                                        flags);<br>
                                        > >      for (last = pos +
                                        size; pos < last; pos += 4) {<br>
                                        > >              uint32_t
                                        tmp = pos >> 31;<br>
_______________________________________________<br>
                                        amd-gfx mailing list<br>
                                        <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                        <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                                    </div>
                                  </div>
                                </div>
                              </div>
                            </blockquote>
                          </div>
                          <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                        </div>
                      </div>
                      <div>
                        <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                        <div>
                          <p class="xmsonormal" style="background:white">Am
                            14.04.2020 16:35 schrieb "Deucher,
                            Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                          <blockquote style="border:none;border-left:solid #CCCCCC
                            1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                            <div>
                              <p style="margin:15.0pt;background:white"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                                  Public Use]</span><o:p></o:p></p>
                              <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                              <div>
                                <div>
                                  <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt">If this
                                      causes an issue, any access to
                                      vram via the BAR could cause an
                                      issue.</span><o:p></o:p></p>
                                </div>
                                <div>
                                  <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
                                </div>
                                <div>
                                  <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
                                </div>
                                <div class="MsoNormal" style="text-align:center;background:white" align="center">
                                  <span style="color:black">
                                    <hr width="98%" size="2" align="center">
                                  </span></div>
                                <div>
                                  <p class="xmsonormal" style="background:white"><b>From:</b>
                                    amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                                    on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                                    <b>Sent:</b> Tuesday, April 14, 2020
                                    10:19 AM<br>
                                    <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                                    <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                                    <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                                    Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                                    <b>Subject:</b> RE: [PATCH] Revert
                                    "drm/amdgpu: use the BAR if possible
                                    in amdgpu_device_vram_access v2"
                                    <o:p></o:p></p>
                                  <div>
                                    <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                                  </div>
                                </div>
                                <div>
                                  <div>
                                    <p class="xmsonormal" style="background:white">[AMD
                                      Official Use Only - Internal
                                      Distribution Only]<br>
                                      <br>
                                      On VG20 or MI100, as soon as we
                                      run the subtest, we get the dmesg
                                      output below, and then the kernel
                                      ends up hanging. I don't know
                                      enough about the test itself to
                                      know why this is occurring, but
                                      Jon Kim and Felix were discussing
                                      it on a separate thread when the
                                      issue was first reported, so they
                                      can hopefully provide some
                                      additional information.<br>
                                      <br>
                                       Kent<br>
                                      <br>
                                      > -----Original Message-----<br>
                                      > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                                      > Sent: Tuesday, April 14, 2020
                                      9:52 AM<br>
                                      > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                                      <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                      > Subject: Re: [PATCH] Revert
                                      "drm/amdgpu: use the BAR if
                                      possible in<br>
                                      > amdgpu_device_vram_access v2"<br>
                                      > <br>
                                      > Am 13.04.20 um 20:20 schrieb
                                      Kent Russell:<br>
                                      > > This reverts commit
                                      c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                                      > > The original patch
                                      causes a RAS event and subsequent
                                      kernel hard-hang<br>
                                      > > when running the
                                      KFDMemoryTest.PtraceAccessInvisibleVram
                                      on VG20 and<br>
                                      > > Arcturus<br>
                                      > ><br>
                                      > > dmesg output at hang
                                      time:<br>
                                      > > [drm] RAS event of type
                                      ERREVENT_ATHUB_INTERRUPT detected!<br>
                                      > > amdgpu 0000:67:00.0: GPU
                                      reset begin!<br>
                                      > > Evicting PASID 0x8000
                                      queues<br>
                                      > > Started evicting pasid
                                      0x8000<br>
                                      > > qcm fence wait loop
                                      timeout expired<br>
                                      > > The cp might be in an
                                      unrecoverable state due to an
                                      unsuccessful<br>
                                      > > queues preemption Failed
                                      to evict process queues Failed to
                                      suspend<br>
                                      > > process 0x8000 Finished
                                      evicting pasid 0x8000 Started
                                      restoring pasid<br>
                                      > > 0x8000 Finished
                                      restoring pasid 0x8000 [drm] UVD
                                      VCPU state may lost<br>
                                      > > due to RAS
                                      ERREVENT_ATHUB_INTERRUPT<br>
                                      > > amdgpu: [powerplay]
                                      Failed to send message 0x26,
                                      response 0x0<br>
                                      > > amdgpu: [powerplay]
                                      Failed to set soft min gfxclk !<br>
                                      > > amdgpu: [powerplay]
                                      Failed to upload DPM Bootup
                                      Levels!<br>
                                      > > amdgpu: [powerplay]
                                      Failed to send message 0x7,
                                      response 0x0<br>
                                      > > amdgpu: [powerplay]
                                      [DisableAllSMUFeatures] Failed to
                                      disable all smu<br>
                                      > features!<br>
                                      > > amdgpu: [powerplay]
                                      [DisableDpmTasks] Failed to
                                      disable all smu features!<br>
                                      > > amdgpu: [powerplay]
                                      [PowerOffAsic] Failed to disable
                                      DPM!<br>
                                      > >
                                      [drm:amdgpu_device_ip_suspend_phase2
                                      [amdgpu]] *ERROR* suspend of IP<br>
                                      > > block <powerplay>
                                      failed -5<br>
                                      > <br>
                                      > Do you have more information
                                      on what's going wrong here since
                                      this is a really<br>
                                      > important patch for KFD
                                      debugging.<br>
                                      > <br>
                                      > ><br>
                                      > > Signed-off-by: Kent
                                      Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                                      > <br>
                                      > Reviewed-by: Christian König
                                      <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                                      > <br>
                                      > > ---<br>
                                      > >  
                                      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                                      | 26 ----------------------<br>
                                      > >   1 file changed, 26
                                      deletions(-)<br>
                                      > ><br>
                                      > > diff --git
                                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                      > >
                                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                      > > index
                                      cf5d6e585634..a3f997f84020 100644<br>
                                      > > ---
                                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                      > > +++
                                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                      > > @@ -254,32 +254,6 @@
                                      void
                                      amdgpu_device_vram_access(struct<br>
                                      > amdgpu_device *adev, loff_t
                                      pos,<br>
                                      > >      uint32_t hi = ~0;<br>
                                      > >      uint64_t last;<br>
                                      > ><br>
                                      > > -<br>
                                      > > -#ifdef CONFIG_64BIT<br>
                                      > > -   last = min(pos +
                                      size,
                                      adev->gmc.visible_vram_size);<br>
                                      > > -   if (last > pos) {<br>
                                      > > -           void __iomem
                                      *addr =
                                      adev->mman.aper_base_kaddr +
                                      pos;<br>
                                      > > -           size_t count
                                      = last - pos;<br>
                                      > > -<br>
                                      > > -           if (write) {<br>
                                      > > -                  
                                      memcpy_toio(addr, buf, count);<br>
                                      > > -                  
                                      mb();<br>
                                      > > -                  
                                      amdgpu_asic_flush_hdp(adev, NULL);<br>
                                      > > -           } else {<br>
                                      > > -                  
                                      amdgpu_asic_invalidate_hdp(adev,
                                      NULL);<br>
                                      > > -                  
                                      mb();<br>
                                      > > -                  
                                      memcpy_fromio(buf, addr, count);<br>
                                      > > -           }<br>
                                      > > -<br>
                                      > > -           if (count ==
                                      size)<br>
                                      > > -                  
                                      return;<br>
                                      > > -<br>
                                      > > -           pos +=
                                      count;<br>
                                      > > -           buf += count
                                      / 4;<br>
                                      > > -           size -=
                                      count;<br>
                                      > > -   }<br>
                                      > > -#endif<br>
                                      > > -<br>
                                      > >     
                                      spin_lock_irqsave(&adev->mmio_idx_lock,
                                      flags);<br>
                                      > >      for (last = pos +
                                      size; pos < last; pos += 4) {<br>
                                      > >              uint32_t
                                      tmp = pos >> 31;<br>
_______________________________________________<br>
                                      amd-gfx mailing list<br>
                                      <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                      <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                                  </div>
                                </div>
                              </div>
                            </div>
                          </blockquote>
                        </div>
                        <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                      </div>
                    </div>
                    <div>
                      <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                      <div>
                        <p class="xmsonormal" style="background:white">Am
                          14.04.2020 16:35 schrieb "Deucher, Alexander"
                          <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                      </div>
                    </div>
                    <div>
                      <p style="margin:15.0pt;background:white"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                          Public Use]</span><o:p></o:p></p>
                      <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                      <div>
                        <div>
                          <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt">If this causes an
                              issue, any access to vram via the BAR
                              could cause an issue.</span><o:p></o:p></p>
                        </div>
                        <div>
                          <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
                        </div>
                        <div>
                          <p class="xmsonormal" style="background:white"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
                        </div>
                        <div class="MsoNormal" style="text-align:center;background:white" align="center">
                          <span style="color:black">
                            <hr width="98%" size="2" align="center">
                          </span></div>
                        <div id="x_divRplyFwdMsg">
                          <p class="xmsonormal" style="background:white"><b>From:</b>
                            amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                            on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                            <b>Sent:</b> Tuesday, April 14, 2020 10:19
                            AM<br>
                            <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                            <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                            <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                            <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                            Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                            <b>Subject:</b> RE: [PATCH] Revert
                            "drm/amdgpu: use the BAR if possible in
                            amdgpu_device_vram_access v2"
                            <o:p></o:p></p>
                          <div>
                            <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
                          </div>
                        </div>
                        <div>
                          <div>
                            <p class="xmsonormal" style="background:white">[AMD Official Use
                              Only - Internal Distribution Only]<br>
                              <br>
                              On VG20 or MI100, as soon as we run the
                              subtest, we get the dmesg output below,
                              and then the kernel ends up hanging. I
                              don't know enough about the test itself to
                              know why this is occurring, but Jon Kim
                              and Felix were discussing it on a separate
                              thread when the issue was first reported,
                              so they can hopefully provide some
                              additional information.<br>
                              <br>
                               Kent<br>
                              <br>
                              > -----Original Message-----<br>
                              > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                              > Sent: Tuesday, April 14, 2020 9:52 AM<br>
                              > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                              <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                              > Subject: Re: [PATCH] Revert
                              "drm/amdgpu: use the BAR if possible in<br>
                              > amdgpu_device_vram_access v2"<br>
                              > <br>
                              > Am 13.04.20 um 20:20 schrieb Kent
                              Russell:<br>
                              > > This reverts commit
                              c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                              > > The original patch causes a RAS
                              event and subsequent kernel hard-hang<br>
                              > > when running the
                              KFDMemoryTest.PtraceAccessInvisibleVram on
                              VG20 and<br>
                              > > Arcturus<br>
                              > ><br>
                              > > dmesg output at hang time:<br>
                              > > [drm] RAS event of type
                              ERREVENT_ATHUB_INTERRUPT detected!<br>
                              > > amdgpu 0000:67:00.0: GPU reset
                              begin!<br>
                              > > Evicting PASID 0x8000 queues<br>
                              > > Started evicting pasid 0x8000<br>
                              > > qcm fence wait loop timeout
                              expired<br>
                              > > The cp might be in an
                              unrecoverable state due to an unsuccessful<br>
                              > > queues preemption Failed to
                              evict process queues Failed to suspend<br>
                              > > process 0x8000 Finished evicting
                              pasid 0x8000 Started restoring pasid<br>
                              > > 0x8000 Finished restoring pasid
                              0x8000 [drm] UVD VCPU state may lost<br>
                              > > due to RAS
                              ERREVENT_ATHUB_INTERRUPT<br>
                              > > amdgpu: [powerplay] Failed to
                              send message 0x26, response 0x0<br>
                              > > amdgpu: [powerplay] Failed to
                              set soft min gfxclk !<br>
                              > > amdgpu: [powerplay] Failed to
                              upload DPM Bootup Levels!<br>
                              > > amdgpu: [powerplay] Failed to
                              send message 0x7, response 0x0<br>
                              > > amdgpu: [powerplay]
                              [DisableAllSMUFeatures] Failed to disable
                              all smu<br>
                              > features!<br>
                              > > amdgpu: [powerplay]
                              [DisableDpmTasks] Failed to disable all
                              smu features!<br>
                              > > amdgpu: [powerplay]
                              [PowerOffAsic] Failed to disable DPM!<br>
                              > >
                              [drm:amdgpu_device_ip_suspend_phase2
                              [amdgpu]] *ERROR* suspend of IP<br>
                              > > block <powerplay> failed
                              -5<br>
                              > <br>
                              > Do you have more information on
                              what's going wrong here since this is a
                              really<br>
                              > important patch for KFD debugging.<br>
                              > <br>
                              > ><br>
                              > > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                              > <br>
                              > Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                              > <br>
                              > > ---<br>
                              > >  
                              drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                              | 26 ----------------------<br>
                              > >   1 file changed, 26
                              deletions(-)<br>
                              > ><br>
                              > > diff --git
                              a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > >
                              b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > > index cf5d6e585634..a3f997f84020
                              100644<br>
                              > > ---
                              a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > > +++
                              b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > > @@ -254,32 +254,6 @@ void
                              amdgpu_device_vram_access(struct<br>
                              > amdgpu_device *adev, loff_t pos,<br>
                              > >      uint32_t hi = ~0;<br>
                              > >      uint64_t last;<br>
                              > ><br>
                              > > -<br>
                              > > -#ifdef CONFIG_64BIT<br>
                              > > -   last = min(pos + size,
                              adev->gmc.visible_vram_size);<br>
                              > > -   if (last > pos) {<br>
                              > > -           void __iomem *addr =
                              adev->mman.aper_base_kaddr + pos;<br>
                              > > -           size_t count = last
                              - pos;<br>
                              > > -<br>
                              > > -           if (write) {<br>
                              > > -                  
                              memcpy_toio(addr, buf, count);<br>
                              > > -                   mb();<br>
                              > > -                  
                              amdgpu_asic_flush_hdp(adev, NULL);<br>
                              > > -           } else {<br>
                              > > -                  
                              amdgpu_asic_invalidate_hdp(adev, NULL);<br>
                              > > -                   mb();<br>
                              > > -                  
                              memcpy_fromio(buf, addr, count);<br>
                              > > -           }<br>
                              > > -<br>
                              > > -           if (count == size)<br>
                              > > -                   return;<br>
                              > > -<br>
                              > > -           pos += count;<br>
                              > > -           buf += count / 4;<br>
                              > > -           size -= count;<br>
                              > > -   }<br>
                              > > -#endif<br>
                              > > -<br>
                              > >     
                              spin_lock_irqsave(&adev->mmio_idx_lock,
                              flags);<br>
                              > >      for (last = pos + size; pos
                              < last; pos += 4) {<br>
                              > >              uint32_t tmp = pos
                              >> 31;<br>
_______________________________________________<br>
                              amd-gfx mailing list<br>
                              <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                              <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                          </div>
                        </div>
                      </div>
                    </div>
                  </blockquote>
                </blockquote>
                <p class="xmsonormal" style="background:white"> <o:p></o:p></p>
              </div>
            </blockquote>
            <p class="MsoNormal" style="background:white"><o:p> </o:p></p>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>