<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">
      <blockquote type="cite">
        <p class="MsoNormal"><span style="color:windowtext">To elaborate
            on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated
            mapped memory and 2 DWORDS outside that boundary (it’s only
            about 4MB to the boundary).  Then we POKE to swap the DWORD
            positions across the boundary.  The RAS event on the single
            failing machine happens on the out of boundary PEEK.</span></p>
        <span style="color:windowtext"></span></blockquote>
      <br>
      Well when you access outside of an allocated buffer I would expect
      that we never get as far as even touching the hardware because the
      kernel should block the access with an -EPERM or -EFAULT. So
      sounds like I'm not understanding something correctly here.<br>
      <br>
      Apart from that I completely agree that we need to sort out any
      other RAS event first to make sure that the system is simply not
      failing randomly.<br>
      <br>
      Regards,<br>
      Christian.<br>
      <br>
      Am 15.04.20 um 11:49 schrieb Kim, Jonathan:<br>
    </div>
    <blockquote type="cite" cite="mid:MN2PR12MB4518963F186CF8528A620A7D85DB0@MN2PR12MB4518.namprd12.prod.outlook.com">
      
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
      <style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
p.msipheader4d0fcdd7, li.msipheader4d0fcdd7, div.msipheader4d0fcdd7
        {mso-style-name:msipheader4d0fcdd7;
        mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
p.msipheader87abd423, li.msipheader87abd423, div.msipheader87abd423
        {mso-style-name:msipheader87abd423;
        mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
span.EmailStyle21
        {mso-style-type:personal-compose;
        font-family:"Arial",sans-serif;
        color:#317100;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="msipheader87abd423" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
            Public Use]</span><o:p></o:p></p>
        <p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext">Hi
            Christian,<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext">That could
            potentially be it.  With additional testing, 2 of 3 Vega20
            machines never hit error over BAR access with the PTRACE
            test.  3 of 3 machines (from the same pool) always hit error
            with CWSR.<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext">To elaborate
            on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated
            mapped memory and 2 DWORDS outside that boundary (it’s only
            about 4MB to the boundary).  Then we POKE to swap the DWORD
            positions across the boundary.  The RAS event on the single
            failing machine happens on the out of boundary PEEK.<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext">Felix
            mentioned we don’t hit errors over general HDP access but
            that may not true.  An Arcturus failure sys logs posted
            (which wasn’t tested by me) shows someone launched rocm
            bandwidth test, hit a VM fault and a RAS event ensued during
            evictions (I can point the internal ticket or log snippet
            offline if interested).  Whether the RAS event is BAR access
            triggered or the result of HW instability is beyond me since
            I don’t have access to the machine.<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext">Thanks,<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext">Jon<o:p></o:p></span></p>
        <p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
        <div>
          <div style="border:none;border-top:solid #E1E1E1
            1.0pt;padding:3.0pt 0in 0in 0in">
            <p class="MsoNormal"><b><span style="color:windowtext">From:</span></b><span style="color:windowtext"> Koenig, Christian
                <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>
                <br>
                <b>Sent:</b> Wednesday, April 15, 2020 4:11 AM<br>
                <b>To:</b> Kim, Jonathan <a class="moz-txt-link-rfc2396E" href="mailto:Jonathan.Kim@amd.com"><Jonathan.Kim@amd.com></a>;
                Kuehling, Felix <a class="moz-txt-link-rfc2396E" href="mailto:Felix.Kuehling@amd.com"><Felix.Kuehling@amd.com></a>; Deucher,
                Alexander <a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a><br>
                <b>Cc:</b> Russell, Kent <a class="moz-txt-link-rfc2396E" href="mailto:Kent.Russell@amd.com"><Kent.Russell@amd.com></a>;
                <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
                <b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use the
                BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></span></p>
          </div>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <p class="MsoNormal" style="margin-bottom:12.0pt">Hi Jon,<br>
            <br>
            <o:p></o:p></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoNormal">Also cwsr tests fail on Vega20 with or
              without the revert with the same RAS error.<o:p></o:p></p>
          </blockquote>
          <p class="MsoNormal"><br>
            That sounds like the system/setup has a more general
            problem.<br>
            <br>
            Could it be that we are seeing RAS errors because there
            really is some hardware failure, but with the MM path we
            don't trigger a RAS interrupt?<br>
            <br>
            Thanks,<br>
            Christian.<br>
            <br>
            Am 14.04.20 um 22:30 schrieb Kim, Jonathan:<o:p></o:p></p>
        </div>
        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
          <p class="msipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
              Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p class="MsoNormal">If we’re passing the test on the revert,
            then the only thing that’s different is we’re not
            invalidating HDP and doing a copy to host anymore in
            amdgpu_device_vram_access since the function is still called
            in ttm access_memory with BAR.<o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p class="MsoNormal">Also cwsr tests fail on Vega20 with or
            without the revert with the same RAS error.<o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p class="MsoNormal">Thanks,<o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p class="MsoNormal">Jon<o:p></o:p></p>
          <p class="MsoNormal"> <o:p></o:p></p>
          <div>
            <div style="border:none;border-top:solid #E1E1E1
              1.0pt;padding:3.0pt 0in 0in 0in">
              <p class="MsoNormal"><b>From:</b> Kuehling, Felix <a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">
                  <Felix.Kuehling@amd.com></a> <br>
                <b>Sent:</b> Tuesday, April 14, 2020 2:32 PM<br>
                <b>To:</b> Kim, Jonathan <a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true"><Jonathan.Kim@amd.com></a>;
                Koenig, Christian
                <a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a>;
                Deucher, Alexander
                <a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a><br>
                <b>Cc:</b> Russell, Kent <a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true"><Kent.Russell@amd.com></a>;
                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                <b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use the
                BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></p>
            </div>
          </div>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p>I wouldn't call it premature. Revert is a usual practice
            when there is a serious regression that isn't fully
            understood or root-caused. As far as I can tell, the problem
            has been reproduced on multiple systems, different GPUs, and
            clearly regressed to Christian's commit. I think that
            justifies reverting it for now.<o:p></o:p></p>
          <p>I agree with Christian that a general HDP memory access
            problem causing RAS errors would potentially cause problems
            in other tests as well. For example common operations like
            GART table updates, and GPUVM page table updates and PCIe
            peer2peer accesses in ROCm applications use HDP. But we're
            not seeing obvious problems from those. So we need to
            understand what's special about this test. I asked questions
            to that effect on our other email thread.<o:p></o:p></p>
          <p>Regards,<br>
              Felix<o:p></o:p></p>
          <div>
            <p class="MsoNormal">Am 2020-04-14 um 10:51 a.m. schrieb
              Kim, Jonathan:<o:p></o:p></p>
          </div>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="msipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
                Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
            <p class="MsoNormal"> <o:p></o:p></p>
            <p class="MsoNormal">I think it’s premature to push this
              revert.<o:p></o:p></p>
            <p class="MsoNormal"> <o:p></o:p></p>
            <p class="MsoNormal">With more testing, I’m getting failures
              from different tests or sometimes none at all on my
              machine.<o:p></o:p></p>
            <p class="MsoNormal"> <o:p></o:p></p>
            <p class="MsoNormal">Kent, let’s continue the discussion on
              the original thread.<o:p></o:p></p>
            <p class="MsoNormal"> <o:p></o:p></p>
            <p class="MsoNormal">Thanks,<o:p></o:p></p>
            <p class="MsoNormal"> <o:p></o:p></p>
            <p class="MsoNormal">Jon<o:p></o:p></p>
            <p class="MsoNormal"> <o:p></o:p></p>
            <div>
              <div style="border:none;border-top:solid #E1E1E1
                1.0pt;padding:3.0pt 0in 0in 0in">
                <p class="MsoNormal"><b>From:</b> Koenig, Christian <a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">
                    <Christian.Koenig@amd.com></a> <br>
                  <b>Sent:</b> Tuesday, April 14, 2020 10:47 AM<br>
                  <b>To:</b> Deucher, Alexander <a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a><br>
                  <b>Cc:</b> Russell, Kent <a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true"><Kent.Russell@amd.com></a>;
                  <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>;
                  Kuehling, Felix
                  <a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true"><Felix.Kuehling@amd.com></a>;
                  Kim, Jonathan
                  <a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true"><Jonathan.Kim@amd.com></a><br>
                  <b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use
                  the BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></p>
              </div>
            </div>
            <p class="MsoNormal"> <o:p></o:p></p>
            <div>
              <div>
                <div>
                  <div>
                    <div>
                      <p class="MsoNormal">That's exactly my concern as
                        well. <o:p></o:p></p>
                      <div>
                        <p class="MsoNormal"> <o:p></o:p></p>
                      </div>
                      <div>
                        <p class="MsoNormal">This looks a bit like the
                          test creates erroneous data somehow, but there
                          doesn't seems to be a RAS check in the MM data
                          path.<o:p></o:p></p>
                      </div>
                      <div>
                        <p class="MsoNormal"> <o:p></o:p></p>
                      </div>
                      <div>
                        <p class="MsoNormal">And now that we use the BAR
                          path it goes up in flames.<o:p></o:p></p>
                      </div>
                      <div>
                        <p class="MsoNormal"> <o:p></o:p></p>
                      </div>
                      <div>
                        <p class="MsoNormal">I just don't see how we can
                          create erroneous data in a test case?<o:p></o:p></p>
                      </div>
                      <div>
                        <p class="MsoNormal"> <o:p></o:p></p>
                      </div>
                      <div>
                        <p class="MsoNormal">Christian.<o:p></o:p></p>
                      </div>
                    </div>
                    <div>
                      <p class="MsoNormal"> <o:p></o:p></p>
                      <div>
                        <p class="MsoNormal">Am 14.04.2020 16:35 schrieb
                          "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                        <blockquote style="border:none;border-left:solid
                          #CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                          <div>
                            <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                                Public Use]</span><o:p></o:p></p>
                            <p class="MsoNormal"> <o:p></o:p></p>
                            <div>
                              <div>
                                <p class="MsoNormal"><span style="font-size:12.0pt">If this
                                    causes an issue, any access to vram
                                    via the BAR could cause an issue.</span><o:p></o:p></p>
                              </div>
                              <div>
                                <p class="MsoNormal"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
                              </div>
                              <div>
                                <p class="MsoNormal"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
                              </div>
                              <div class="MsoNormal" style="text-align:center" align="center">
                                <hr width="98%" size="2" align="center">
                              </div>
                              <div>
                                <p class="MsoNormal"><b>From:</b>
                                  amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                                  on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                                  <b>Sent:</b> Tuesday, April 14, 2020
                                  10:19 AM<br>
                                  <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                                  <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                                  <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                                  <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                                  Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                                  <b>Subject:</b> RE: [PATCH] Revert
                                  "drm/amdgpu: use the BAR if possible
                                  in amdgpu_device_vram_access v2"
                                  <o:p></o:p></p>
                                <div>
                                  <p class="MsoNormal"> <o:p></o:p></p>
                                </div>
                              </div>
                              <div>
                                <div>
                                  <p class="MsoNormal">[AMD Official Use
                                    Only - Internal Distribution Only]<br>
                                    <br>
                                    On VG20 or MI100, as soon as we run
                                    the subtest, we get the dmesg output
                                    below, and then the kernel ends up
                                    hanging. I don't know enough about
                                    the test itself to know why this is
                                    occurring, but Jon Kim and Felix
                                    were discussing it on a separate
                                    thread when the issue was first
                                    reported, so they can hopefully
                                    provide some additional information.<br>
                                    <br>
                                     Kent<br>
                                    <br>
                                    > -----Original Message-----<br>
                                    > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                                    > Sent: Tuesday, April 14, 2020
                                    9:52 AM<br>
                                    > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                    > Subject: Re: [PATCH] Revert
                                    "drm/amdgpu: use the BAR if possible
                                    in<br>
                                    > amdgpu_device_vram_access v2"<br>
                                    > <br>
                                    > Am 13.04.20 um 20:20 schrieb
                                    Kent Russell:<br>
                                    > > This reverts commit
                                    c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                                    > > The original patch causes
                                    a RAS event and subsequent kernel
                                    hard-hang<br>
                                    > > when running the
                                    KFDMemoryTest.PtraceAccessInvisibleVram
                                    on VG20 and<br>
                                    > > Arcturus<br>
                                    > ><br>
                                    > > dmesg output at hang time:<br>
                                    > > [drm] RAS event of type
                                    ERREVENT_ATHUB_INTERRUPT detected!<br>
                                    > > amdgpu 0000:67:00.0: GPU
                                    reset begin!<br>
                                    > > Evicting PASID 0x8000
                                    queues<br>
                                    > > Started evicting pasid
                                    0x8000<br>
                                    > > qcm fence wait loop
                                    timeout expired<br>
                                    > > The cp might be in an
                                    unrecoverable state due to an
                                    unsuccessful<br>
                                    > > queues preemption Failed
                                    to evict process queues Failed to
                                    suspend<br>
                                    > > process 0x8000 Finished
                                    evicting pasid 0x8000 Started
                                    restoring pasid<br>
                                    > > 0x8000 Finished restoring
                                    pasid 0x8000 [drm] UVD VCPU state
                                    may lost<br>
                                    > > due to RAS
                                    ERREVENT_ATHUB_INTERRUPT<br>
                                    > > amdgpu: [powerplay] Failed
                                    to send message 0x26, response 0x0<br>
                                    > > amdgpu: [powerplay] Failed
                                    to set soft min gfxclk !<br>
                                    > > amdgpu: [powerplay] Failed
                                    to upload DPM Bootup Levels!<br>
                                    > > amdgpu: [powerplay] Failed
                                    to send message 0x7, response 0x0<br>
                                    > > amdgpu: [powerplay]
                                    [DisableAllSMUFeatures] Failed to
                                    disable all smu<br>
                                    > features!<br>
                                    > > amdgpu: [powerplay]
                                    [DisableDpmTasks] Failed to disable
                                    all smu features!<br>
                                    > > amdgpu: [powerplay]
                                    [PowerOffAsic] Failed to disable
                                    DPM!<br>
                                    > >
                                    [drm:amdgpu_device_ip_suspend_phase2
                                    [amdgpu]] *ERROR* suspend of IP<br>
                                    > > block <powerplay>
                                    failed -5<br>
                                    > <br>
                                    > Do you have more information on
                                    what's going wrong here since this
                                    is a really<br>
                                    > important patch for KFD
                                    debugging.<br>
                                    > <br>
                                    > ><br>
                                    > > Signed-off-by: Kent
                                    Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                                    > <br>
                                    > Reviewed-by: Christian König
                                    <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                                    > <br>
                                    > > ---<br>
                                    > >  
                                    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                                    | 26 ----------------------<br>
                                    > >   1 file changed, 26
                                    deletions(-)<br>
                                    > ><br>
                                    > > diff --git
                                    a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                    > >
                                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                    > > index
                                    cf5d6e585634..a3f997f84020 100644<br>
                                    > > ---
                                    a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                    > > +++
                                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                    > > @@ -254,32 +254,6 @@ void
                                    amdgpu_device_vram_access(struct<br>
                                    > amdgpu_device *adev, loff_t
                                    pos,<br>
                                    > >      uint32_t hi = ~0;<br>
                                    > >      uint64_t last;<br>
                                    > ><br>
                                    > > -<br>
                                    > > -#ifdef CONFIG_64BIT<br>
                                    > > -   last = min(pos + size,
                                    adev->gmc.visible_vram_size);<br>
                                    > > -   if (last > pos) {<br>
                                    > > -           void __iomem
                                    *addr =
                                    adev->mman.aper_base_kaddr + pos;<br>
                                    > > -           size_t count =
                                    last - pos;<br>
                                    > > -<br>
                                    > > -           if (write) {<br>
                                    > > -                  
                                    memcpy_toio(addr, buf, count);<br>
                                    > > -                   mb();<br>
                                    > > -                  
                                    amdgpu_asic_flush_hdp(adev, NULL);<br>
                                    > > -           } else {<br>
                                    > > -                  
                                    amdgpu_asic_invalidate_hdp(adev,
                                    NULL);<br>
                                    > > -                   mb();<br>
                                    > > -                  
                                    memcpy_fromio(buf, addr, count);<br>
                                    > > -           }<br>
                                    > > -<br>
                                    > > -           if (count ==
                                    size)<br>
                                    > > -                  
                                    return;<br>
                                    > > -<br>
                                    > > -           pos += count;<br>
                                    > > -           buf += count /
                                    4;<br>
                                    > > -           size -= count;<br>
                                    > > -   }<br>
                                    > > -#endif<br>
                                    > > -<br>
                                    > >     
                                    spin_lock_irqsave(&adev->mmio_idx_lock,
                                    flags);<br>
                                    > >      for (last = pos +
                                    size; pos < last; pos += 4) {<br>
                                    > >              uint32_t tmp
                                    = pos >> 31;<br>
_______________________________________________<br>
                                    amd-gfx mailing list<br>
                                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                    <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                                </div>
                              </div>
                            </div>
                          </div>
                        </blockquote>
                      </div>
                      <p class="MsoNormal"> <o:p></o:p></p>
                    </div>
                  </div>
                  <div>
                    <p class="MsoNormal"> <o:p></o:p></p>
                    <div>
                      <p class="MsoNormal">Am 14.04.2020 16:35 schrieb
                        "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                      <blockquote style="border:none;border-left:solid
                        #CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                        <div>
                          <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                              Public Use]</span><o:p></o:p></p>
                          <p class="MsoNormal"> <o:p></o:p></p>
                          <div>
                            <div>
                              <p class="MsoNormal"><span style="font-size:12.0pt">If this
                                  causes an issue, any access to vram
                                  via the BAR could cause an issue.</span><o:p></o:p></p>
                            </div>
                            <div>
                              <p class="MsoNormal"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
                            </div>
                            <div>
                              <p class="MsoNormal"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
                            </div>
                            <div class="MsoNormal" style="text-align:center" align="center">
                              <hr width="98%" size="2" align="center">
                            </div>
                            <div>
                              <p class="MsoNormal"><b>From:</b> amd-gfx
                                <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                                on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                                <b>Sent:</b> Tuesday, April 14, 2020
                                10:19 AM<br>
                                <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                                <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                                <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                                Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                                <b>Subject:</b> RE: [PATCH] Revert
                                "drm/amdgpu: use the BAR if possible in
                                amdgpu_device_vram_access v2"
                                <o:p></o:p></p>
                              <div>
                                <p class="MsoNormal"> <o:p></o:p></p>
                              </div>
                            </div>
                            <div>
                              <div>
                                <p class="MsoNormal">[AMD Official Use
                                  Only - Internal Distribution Only]<br>
                                  <br>
                                  On VG20 or MI100, as soon as we run
                                  the subtest, we get the dmesg output
                                  below, and then the kernel ends up
                                  hanging. I don't know enough about the
                                  test itself to know why this is
                                  occurring, but Jon Kim and Felix were
                                  discussing it on a separate thread
                                  when the issue was first reported, so
                                  they can hopefully provide some
                                  additional information.<br>
                                  <br>
                                   Kent<br>
                                  <br>
                                  > -----Original Message-----<br>
                                  > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                                  > Sent: Tuesday, April 14, 2020
                                  9:52 AM<br>
                                  > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                                  <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                  > Subject: Re: [PATCH] Revert
                                  "drm/amdgpu: use the BAR if possible
                                  in<br>
                                  > amdgpu_device_vram_access v2"<br>
                                  > <br>
                                  > Am 13.04.20 um 20:20 schrieb Kent
                                  Russell:<br>
                                  > > This reverts commit
                                  c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                                  > > The original patch causes a
                                  RAS event and subsequent kernel
                                  hard-hang<br>
                                  > > when running the
                                  KFDMemoryTest.PtraceAccessInvisibleVram
                                  on VG20 and<br>
                                  > > Arcturus<br>
                                  > ><br>
                                  > > dmesg output at hang time:<br>
                                  > > [drm] RAS event of type
                                  ERREVENT_ATHUB_INTERRUPT detected!<br>
                                  > > amdgpu 0000:67:00.0: GPU
                                  reset begin!<br>
                                  > > Evicting PASID 0x8000 queues<br>
                                  > > Started evicting pasid
                                  0x8000<br>
                                  > > qcm fence wait loop timeout
                                  expired<br>
                                  > > The cp might be in an
                                  unrecoverable state due to an
                                  unsuccessful<br>
                                  > > queues preemption Failed to
                                  evict process queues Failed to suspend<br>
                                  > > process 0x8000 Finished
                                  evicting pasid 0x8000 Started
                                  restoring pasid<br>
                                  > > 0x8000 Finished restoring
                                  pasid 0x8000 [drm] UVD VCPU state may
                                  lost<br>
                                  > > due to RAS
                                  ERREVENT_ATHUB_INTERRUPT<br>
                                  > > amdgpu: [powerplay] Failed
                                  to send message 0x26, response 0x0<br>
                                  > > amdgpu: [powerplay] Failed
                                  to set soft min gfxclk !<br>
                                  > > amdgpu: [powerplay] Failed
                                  to upload DPM Bootup Levels!<br>
                                  > > amdgpu: [powerplay] Failed
                                  to send message 0x7, response 0x0<br>
                                  > > amdgpu: [powerplay]
                                  [DisableAllSMUFeatures] Failed to
                                  disable all smu<br>
                                  > features!<br>
                                  > > amdgpu: [powerplay]
                                  [DisableDpmTasks] Failed to disable
                                  all smu features!<br>
                                  > > amdgpu: [powerplay]
                                  [PowerOffAsic] Failed to disable DPM!<br>
                                  > >
                                  [drm:amdgpu_device_ip_suspend_phase2
                                  [amdgpu]] *ERROR* suspend of IP<br>
                                  > > block <powerplay>
                                  failed -5<br>
                                  > <br>
                                  > Do you have more information on
                                  what's going wrong here since this is
                                  a really<br>
                                  > important patch for KFD
                                  debugging.<br>
                                  > <br>
                                  > ><br>
                                  > > Signed-off-by: Kent Russell
                                  <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                                  > <br>
                                  > Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                                  > <br>
                                  > > ---<br>
                                  > >  
                                  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                                  | 26 ----------------------<br>
                                  > >   1 file changed, 26
                                  deletions(-)<br>
                                  > ><br>
                                  > > diff --git
                                  a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                  > >
                                  b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                  > > index
                                  cf5d6e585634..a3f997f84020 100644<br>
                                  > > ---
                                  a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                  > > +++
                                  b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                  > > @@ -254,32 +254,6 @@ void
                                  amdgpu_device_vram_access(struct<br>
                                  > amdgpu_device *adev, loff_t pos,<br>
                                  > >      uint32_t hi = ~0;<br>
                                  > >      uint64_t last;<br>
                                  > ><br>
                                  > > -<br>
                                  > > -#ifdef CONFIG_64BIT<br>
                                  > > -   last = min(pos + size,
                                  adev->gmc.visible_vram_size);<br>
                                  > > -   if (last > pos) {<br>
                                  > > -           void __iomem
                                  *addr = adev->mman.aper_base_kaddr
                                  + pos;<br>
                                  > > -           size_t count =
                                  last - pos;<br>
                                  > > -<br>
                                  > > -           if (write) {<br>
                                  > > -                  
                                  memcpy_toio(addr, buf, count);<br>
                                  > > -                   mb();<br>
                                  > > -                  
                                  amdgpu_asic_flush_hdp(adev, NULL);<br>
                                  > > -           } else {<br>
                                  > > -                  
                                  amdgpu_asic_invalidate_hdp(adev,
                                  NULL);<br>
                                  > > -                   mb();<br>
                                  > > -                  
                                  memcpy_fromio(buf, addr, count);<br>
                                  > > -           }<br>
                                  > > -<br>
                                  > > -           if (count ==
                                  size)<br>
                                  > > -                   return;<br>
                                  > > -<br>
                                  > > -           pos += count;<br>
                                  > > -           buf += count /
                                  4;<br>
                                  > > -           size -= count;<br>
                                  > > -   }<br>
                                  > > -#endif<br>
                                  > > -<br>
                                  > >     
                                  spin_lock_irqsave(&adev->mmio_idx_lock,
                                  flags);<br>
                                  > >      for (last = pos + size;
                                  pos < last; pos += 4) {<br>
                                  > >              uint32_t tmp =
                                  pos >> 31;<br>
_______________________________________________<br>
                                  amd-gfx mailing list<br>
                                  <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                  <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                              </div>
                            </div>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                    <p class="MsoNormal"> <o:p></o:p></p>
                  </div>
                </div>
                <div>
                  <p class="MsoNormal"> <o:p></o:p></p>
                  <div>
                    <p class="MsoNormal">Am 14.04.2020 16:35 schrieb
                      "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                    <blockquote style="border:none;border-left:solid
                      #CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                      <div>
                        <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                            Public Use]</span><o:p></o:p></p>
                        <p class="MsoNormal"> <o:p></o:p></p>
                        <div>
                          <div>
                            <p class="MsoNormal"><span style="font-size:12.0pt">If this causes
                                an issue, any access to vram via the BAR
                                could cause an issue.</span><o:p></o:p></p>
                          </div>
                          <div>
                            <p class="MsoNormal"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
                          </div>
                          <div>
                            <p class="MsoNormal"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
                          </div>
                          <div class="MsoNormal" style="text-align:center" align="center">
                            <hr width="98%" size="2" align="center">
                          </div>
                          <div>
                            <p class="MsoNormal"><b>From:</b> amd-gfx
                              <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                              on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                              <b>Sent:</b> Tuesday, April 14, 2020 10:19
                              AM<br>
                              <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                              <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                              <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                              <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                              Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                              <b>Subject:</b> RE: [PATCH] Revert
                              "drm/amdgpu: use the BAR if possible in
                              amdgpu_device_vram_access v2"
                              <o:p></o:p></p>
                            <div>
                              <p class="MsoNormal"> <o:p></o:p></p>
                            </div>
                          </div>
                          <div>
                            <div>
                              <p class="MsoNormal">[AMD Official Use
                                Only - Internal Distribution Only]<br>
                                <br>
                                On VG20 or MI100, as soon as we run the
                                subtest, we get the dmesg output below,
                                and then the kernel ends up hanging. I
                                don't know enough about the test itself
                                to know why this is occurring, but Jon
                                Kim and Felix were discussing it on a
                                separate thread when the issue was first
                                reported, so they can hopefully provide
                                some additional information.<br>
                                <br>
                                 Kent<br>
                                <br>
                                > -----Original Message-----<br>
                                > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                                > Sent: Tuesday, April 14, 2020 9:52
                                AM<br>
                                > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                > Subject: Re: [PATCH] Revert
                                "drm/amdgpu: use the BAR if possible in<br>
                                > amdgpu_device_vram_access v2"<br>
                                > <br>
                                > Am 13.04.20 um 20:20 schrieb Kent
                                Russell:<br>
                                > > This reverts commit
                                c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                                > > The original patch causes a
                                RAS event and subsequent kernel
                                hard-hang<br>
                                > > when running the
                                KFDMemoryTest.PtraceAccessInvisibleVram
                                on VG20 and<br>
                                > > Arcturus<br>
                                > ><br>
                                > > dmesg output at hang time:<br>
                                > > [drm] RAS event of type
                                ERREVENT_ATHUB_INTERRUPT detected!<br>
                                > > amdgpu 0000:67:00.0: GPU reset
                                begin!<br>
                                > > Evicting PASID 0x8000 queues<br>
                                > > Started evicting pasid 0x8000<br>
                                > > qcm fence wait loop timeout
                                expired<br>
                                > > The cp might be in an
                                unrecoverable state due to an
                                unsuccessful<br>
                                > > queues preemption Failed to
                                evict process queues Failed to suspend<br>
                                > > process 0x8000 Finished
                                evicting pasid 0x8000 Started restoring
                                pasid<br>
                                > > 0x8000 Finished restoring
                                pasid 0x8000 [drm] UVD VCPU state may
                                lost<br>
                                > > due to RAS
                                ERREVENT_ATHUB_INTERRUPT<br>
                                > > amdgpu: [powerplay] Failed to
                                send message 0x26, response 0x0<br>
                                > > amdgpu: [powerplay] Failed to
                                set soft min gfxclk !<br>
                                > > amdgpu: [powerplay] Failed to
                                upload DPM Bootup Levels!<br>
                                > > amdgpu: [powerplay] Failed to
                                send message 0x7, response 0x0<br>
                                > > amdgpu: [powerplay]
                                [DisableAllSMUFeatures] Failed to
                                disable all smu<br>
                                > features!<br>
                                > > amdgpu: [powerplay]
                                [DisableDpmTasks] Failed to disable all
                                smu features!<br>
                                > > amdgpu: [powerplay]
                                [PowerOffAsic] Failed to disable DPM!<br>
                                > >
                                [drm:amdgpu_device_ip_suspend_phase2
                                [amdgpu]] *ERROR* suspend of IP<br>
                                > > block <powerplay> failed
                                -5<br>
                                > <br>
                                > Do you have more information on
                                what's going wrong here since this is a
                                really<br>
                                > important patch for KFD debugging.<br>
                                > <br>
                                > ><br>
                                > > Signed-off-by: Kent Russell
                                <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                                > <br>
                                > Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                                > <br>
                                > > ---<br>
                                > >  
                                drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                                | 26 ----------------------<br>
                                > >   1 file changed, 26
                                deletions(-)<br>
                                > ><br>
                                > > diff --git
                                a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                > >
                                b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                > > index
                                cf5d6e585634..a3f997f84020 100644<br>
                                > > ---
                                a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                > > +++
                                b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                                > > @@ -254,32 +254,6 @@ void
                                amdgpu_device_vram_access(struct<br>
                                > amdgpu_device *adev, loff_t pos,<br>
                                > >      uint32_t hi = ~0;<br>
                                > >      uint64_t last;<br>
                                > ><br>
                                > > -<br>
                                > > -#ifdef CONFIG_64BIT<br>
                                > > -   last = min(pos + size,
                                adev->gmc.visible_vram_size);<br>
                                > > -   if (last > pos) {<br>
                                > > -           void __iomem *addr
                                = adev->mman.aper_base_kaddr + pos;<br>
                                > > -           size_t count =
                                last - pos;<br>
                                > > -<br>
                                > > -           if (write) {<br>
                                > > -                  
                                memcpy_toio(addr, buf, count);<br>
                                > > -                   mb();<br>
                                > > -                  
                                amdgpu_asic_flush_hdp(adev, NULL);<br>
                                > > -           } else {<br>
                                > > -                  
                                amdgpu_asic_invalidate_hdp(adev, NULL);<br>
                                > > -                   mb();<br>
                                > > -                  
                                memcpy_fromio(buf, addr, count);<br>
                                > > -           }<br>
                                > > -<br>
                                > > -           if (count == size)<br>
                                > > -                   return;<br>
                                > > -<br>
                                > > -           pos += count;<br>
                                > > -           buf += count / 4;<br>
                                > > -           size -= count;<br>
                                > > -   }<br>
                                > > -#endif<br>
                                > > -<br>
                                > >     
                                spin_lock_irqsave(&adev->mmio_idx_lock,
                                flags);<br>
                                > >      for (last = pos + size;
                                pos < last; pos += 4) {<br>
                                > >              uint32_t tmp =
                                pos >> 31;<br>
_______________________________________________<br>
                                amd-gfx mailing list<br>
                                <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                                <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                  <p class="MsoNormal"> <o:p></o:p></p>
                </div>
              </div>
              <div>
                <p class="MsoNormal"> <o:p></o:p></p>
                <div>
                  <p class="MsoNormal">Am 14.04.2020 16:35 schrieb
                    "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
                  <blockquote style="border:none;border-left:solid
                    #CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                    <div>
                      <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                          Public Use]</span><o:p></o:p></p>
                      <p class="MsoNormal"> <o:p></o:p></p>
                      <div>
                        <div>
                          <p class="MsoNormal"><span style="font-size:12.0pt">If this causes an
                              issue, any access to vram via the BAR
                              could cause an issue.</span><o:p></o:p></p>
                        </div>
                        <div>
                          <p class="MsoNormal"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
                        </div>
                        <div>
                          <p class="MsoNormal"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
                        </div>
                        <div class="MsoNormal" style="text-align:center" align="center">
                          <hr width="98%" size="2" align="center">
                        </div>
                        <div>
                          <p class="MsoNormal"><b>From:</b> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                            on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                            <b>Sent:</b> Tuesday, April 14, 2020 10:19
                            AM<br>
                            <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                            <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                            <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                            <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                            Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                            <b>Subject:</b> RE: [PATCH] Revert
                            "drm/amdgpu: use the BAR if possible in
                            amdgpu_device_vram_access v2"
                            <o:p></o:p></p>
                          <div>
                            <p class="MsoNormal"> <o:p></o:p></p>
                          </div>
                        </div>
                        <div>
                          <div>
                            <p class="MsoNormal">[AMD Official Use Only
                              - Internal Distribution Only]<br>
                              <br>
                              On VG20 or MI100, as soon as we run the
                              subtest, we get the dmesg output below,
                              and then the kernel ends up hanging. I
                              don't know enough about the test itself to
                              know why this is occurring, but Jon Kim
                              and Felix were discussing it on a separate
                              thread when the issue was first reported,
                              so they can hopefully provide some
                              additional information.<br>
                              <br>
                               Kent<br>
                              <br>
                              > -----Original Message-----<br>
                              > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                              > Sent: Tuesday, April 14, 2020 9:52 AM<br>
                              > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                              <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                              > Subject: Re: [PATCH] Revert
                              "drm/amdgpu: use the BAR if possible in<br>
                              > amdgpu_device_vram_access v2"<br>
                              > <br>
                              > Am 13.04.20 um 20:20 schrieb Kent
                              Russell:<br>
                              > > This reverts commit
                              c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                              > > The original patch causes a RAS
                              event and subsequent kernel hard-hang<br>
                              > > when running the
                              KFDMemoryTest.PtraceAccessInvisibleVram on
                              VG20 and<br>
                              > > Arcturus<br>
                              > ><br>
                              > > dmesg output at hang time:<br>
                              > > [drm] RAS event of type
                              ERREVENT_ATHUB_INTERRUPT detected!<br>
                              > > amdgpu 0000:67:00.0: GPU reset
                              begin!<br>
                              > > Evicting PASID 0x8000 queues<br>
                              > > Started evicting pasid 0x8000<br>
                              > > qcm fence wait loop timeout
                              expired<br>
                              > > The cp might be in an
                              unrecoverable state due to an unsuccessful<br>
                              > > queues preemption Failed to
                              evict process queues Failed to suspend<br>
                              > > process 0x8000 Finished evicting
                              pasid 0x8000 Started restoring pasid<br>
                              > > 0x8000 Finished restoring pasid
                              0x8000 [drm] UVD VCPU state may lost<br>
                              > > due to RAS
                              ERREVENT_ATHUB_INTERRUPT<br>
                              > > amdgpu: [powerplay] Failed to
                              send message 0x26, response 0x0<br>
                              > > amdgpu: [powerplay] Failed to
                              set soft min gfxclk !<br>
                              > > amdgpu: [powerplay] Failed to
                              upload DPM Bootup Levels!<br>
                              > > amdgpu: [powerplay] Failed to
                              send message 0x7, response 0x0<br>
                              > > amdgpu: [powerplay]
                              [DisableAllSMUFeatures] Failed to disable
                              all smu<br>
                              > features!<br>
                              > > amdgpu: [powerplay]
                              [DisableDpmTasks] Failed to disable all
                              smu features!<br>
                              > > amdgpu: [powerplay]
                              [PowerOffAsic] Failed to disable DPM!<br>
                              > >
                              [drm:amdgpu_device_ip_suspend_phase2
                              [amdgpu]] *ERROR* suspend of IP<br>
                              > > block <powerplay> failed
                              -5<br>
                              > <br>
                              > Do you have more information on
                              what's going wrong here since this is a
                              really<br>
                              > important patch for KFD debugging.<br>
                              > <br>
                              > ><br>
                              > > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                              > <br>
                              > Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                              > <br>
                              > > ---<br>
                              > >  
                              drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                              | 26 ----------------------<br>
                              > >   1 file changed, 26
                              deletions(-)<br>
                              > ><br>
                              > > diff --git
                              a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > >
                              b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > > index cf5d6e585634..a3f997f84020
                              100644<br>
                              > > ---
                              a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > > +++
                              b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                              > > @@ -254,32 +254,6 @@ void
                              amdgpu_device_vram_access(struct<br>
                              > amdgpu_device *adev, loff_t pos,<br>
                              > >      uint32_t hi = ~0;<br>
                              > >      uint64_t last;<br>
                              > ><br>
                              > > -<br>
                              > > -#ifdef CONFIG_64BIT<br>
                              > > -   last = min(pos + size,
                              adev->gmc.visible_vram_size);<br>
                              > > -   if (last > pos) {<br>
                              > > -           void __iomem *addr =
                              adev->mman.aper_base_kaddr + pos;<br>
                              > > -           size_t count = last
                              - pos;<br>
                              > > -<br>
                              > > -           if (write) {<br>
                              > > -                  
                              memcpy_toio(addr, buf, count);<br>
                              > > -                   mb();<br>
                              > > -                  
                              amdgpu_asic_flush_hdp(adev, NULL);<br>
                              > > -           } else {<br>
                              > > -                  
                              amdgpu_asic_invalidate_hdp(adev, NULL);<br>
                              > > -                   mb();<br>
                              > > -                  
                              memcpy_fromio(buf, addr, count);<br>
                              > > -           }<br>
                              > > -<br>
                              > > -           if (count == size)<br>
                              > > -                   return;<br>
                              > > -<br>
                              > > -           pos += count;<br>
                              > > -           buf += count / 4;<br>
                              > > -           size -= count;<br>
                              > > -   }<br>
                              > > -#endif<br>
                              > > -<br>
                              > >     
                              spin_lock_irqsave(&adev->mmio_idx_lock,
                              flags);<br>
                              > >      for (last = pos + size; pos
                              < last; pos += 4) {<br>
                              > >              uint32_t tmp = pos
                              >> 31;<br>
_______________________________________________<br>
                              amd-gfx mailing list<br>
                              <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                              <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                          </div>
                        </div>
                      </div>
                    </div>
                  </blockquote>
                </div>
                <p class="MsoNormal"> <o:p></o:p></p>
              </div>
            </div>
            <div>
              <p class="MsoNormal"> <o:p></o:p></p>
              <div>
                <p class="MsoNormal">Am 14.04.2020 16:35 schrieb
                  "Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
              </div>
            </div>
            <div>
              <p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
                  Public Use]</span><o:p></o:p></p>
              <p class="MsoNormal"> <o:p></o:p></p>
              <div>
                <div>
                  <p class="MsoNormal"><span style="font-size:12.0pt">If
                      this causes an issue, any access to vram via the
                      BAR could cause an issue.</span><o:p></o:p></p>
                </div>
                <div>
                  <p class="MsoNormal"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
                </div>
                <div>
                  <p class="MsoNormal"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
                </div>
                <div class="MsoNormal" style="text-align:center" align="center">
                  <hr width="98%" size="2" align="center">
                </div>
                <div id="divRplyFwdMsg">
                  <p class="MsoNormal"><b>From:</b> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
                    on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
                    <b>Sent:</b> Tuesday, April 14, 2020 10:19 AM<br>
                    <b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
                    <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
                    <<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
                    <b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
                    Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
                    <b>Subject:</b> RE: [PATCH] Revert "drm/amdgpu: use
                    the BAR if possible in amdgpu_device_vram_access v2"
                    <o:p></o:p></p>
                  <div>
                    <p class="MsoNormal"> <o:p></o:p></p>
                  </div>
                </div>
                <div>
                  <div>
                    <p class="MsoNormal">[AMD Official Use Only -
                      Internal Distribution Only]<br>
                      <br>
                      On VG20 or MI100, as soon as we run the subtest,
                      we get the dmesg output below, and then the kernel
                      ends up hanging. I don't know enough about the
                      test itself to know why this is occurring, but Jon
                      Kim and Felix were discussing it on a separate
                      thread when the issue was first reported, so they
                      can hopefully provide some additional information.<br>
                      <br>
                       Kent<br>
                      <br>
                      > -----Original Message-----<br>
                      > From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
                      > Sent: Tuesday, April 14, 2020 9:52 AM<br>
                      > To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
                      <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                      > Subject: Re: [PATCH] Revert "drm/amdgpu: use
                      the BAR if possible in<br>
                      > amdgpu_device_vram_access v2"<br>
                      > <br>
                      > Am 13.04.20 um 20:20 schrieb Kent Russell:<br>
                      > > This reverts commit
                      c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
                      > > The original patch causes a RAS event
                      and subsequent kernel hard-hang<br>
                      > > when running the
                      KFDMemoryTest.PtraceAccessInvisibleVram on VG20
                      and<br>
                      > > Arcturus<br>
                      > ><br>
                      > > dmesg output at hang time:<br>
                      > > [drm] RAS event of type
                      ERREVENT_ATHUB_INTERRUPT detected!<br>
                      > > amdgpu 0000:67:00.0: GPU reset begin!<br>
                      > > Evicting PASID 0x8000 queues<br>
                      > > Started evicting pasid 0x8000<br>
                      > > qcm fence wait loop timeout expired<br>
                      > > The cp might be in an unrecoverable
                      state due to an unsuccessful<br>
                      > > queues preemption Failed to evict
                      process queues Failed to suspend<br>
                      > > process 0x8000 Finished evicting pasid
                      0x8000 Started restoring pasid<br>
                      > > 0x8000 Finished restoring pasid 0x8000
                      [drm] UVD VCPU state may lost<br>
                      > > due to RAS ERREVENT_ATHUB_INTERRUPT<br>
                      > > amdgpu: [powerplay] Failed to send
                      message 0x26, response 0x0<br>
                      > > amdgpu: [powerplay] Failed to set soft
                      min gfxclk !<br>
                      > > amdgpu: [powerplay] Failed to upload DPM
                      Bootup Levels!<br>
                      > > amdgpu: [powerplay] Failed to send
                      message 0x7, response 0x0<br>
                      > > amdgpu: [powerplay]
                      [DisableAllSMUFeatures] Failed to disable all smu<br>
                      > features!<br>
                      > > amdgpu: [powerplay] [DisableDpmTasks]
                      Failed to disable all smu features!<br>
                      > > amdgpu: [powerplay] [PowerOffAsic]
                      Failed to disable DPM!<br>
                      > > [drm:amdgpu_device_ip_suspend_phase2
                      [amdgpu]] *ERROR* suspend of IP<br>
                      > > block <powerplay> failed -5<br>
                      > <br>
                      > Do you have more information on what's going
                      wrong here since this is a really<br>
                      > important patch for KFD debugging.<br>
                      > <br>
                      > ><br>
                      > > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
                      > <br>
                      > Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
                      > <br>
                      > > ---<br>
                      > >  
                      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
                      ----------------------<br>
                      > >   1 file changed, 26 deletions(-)<br>
                      > ><br>
                      > > diff --git
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                      > >
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                      > > index cf5d6e585634..a3f997f84020 100644<br>
                      > > ---
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                      > > +++
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
                      > > @@ -254,32 +254,6 @@ void
                      amdgpu_device_vram_access(struct<br>
                      > amdgpu_device *adev, loff_t pos,<br>
                      > >      uint32_t hi = ~0;<br>
                      > >      uint64_t last;<br>
                      > ><br>
                      > > -<br>
                      > > -#ifdef CONFIG_64BIT<br>
                      > > -   last = min(pos + size,
                      adev->gmc.visible_vram_size);<br>
                      > > -   if (last > pos) {<br>
                      > > -           void __iomem *addr =
                      adev->mman.aper_base_kaddr + pos;<br>
                      > > -           size_t count = last - pos;<br>
                      > > -<br>
                      > > -           if (write) {<br>
                      > > -                   memcpy_toio(addr,
                      buf, count);<br>
                      > > -                   mb();<br>
                      > > -                  
                      amdgpu_asic_flush_hdp(adev, NULL);<br>
                      > > -           } else {<br>
                      > > -                  
                      amdgpu_asic_invalidate_hdp(adev, NULL);<br>
                      > > -                   mb();<br>
                      > > -                   memcpy_fromio(buf,
                      addr, count);<br>
                      > > -           }<br>
                      > > -<br>
                      > > -           if (count == size)<br>
                      > > -                   return;<br>
                      > > -<br>
                      > > -           pos += count;<br>
                      > > -           buf += count / 4;<br>
                      > > -           size -= count;<br>
                      > > -   }<br>
                      > > -#endif<br>
                      > > -<br>
                      > >     
                      spin_lock_irqsave(&adev->mmio_idx_lock,
                      flags);<br>
                      > >      for (last = pos + size; pos <
                      last; pos += 4) {<br>
                      > >              uint32_t tmp = pos >>
                      31;<br>
                      _______________________________________________<br>
                      amd-gfx mailing list<br>
                      <a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                      <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&amp;sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&amp;reserved=0</a><o:p></o:p></p>
                  </div>
                </div>
              </div>
            </div>
          </blockquote>
        </blockquote>
        <p class="MsoNormal"><o:p> </o:p></p>
      </div>
    </blockquote>
    <br>
  </body>
</html>