<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">
<blockquote type="cite">
<p class="MsoNormal"><span style="color:windowtext">To elaborate
on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated
mapped memory and 2 DWORDS outside that boundary (it’s only
about 4MB to the boundary). Then we POKE to swap the DWORD
positions across the boundary. The RAS event on the single
failing machine happens on the out of boundary PEEK.</span></p>
<span style="color:windowtext"></span></blockquote>
<br>
Well when you access outside of an allocated buffer I would expect
that we never get as far as even touching the hardware because the
kernel should block the access with an -EPERM or -EFAULT. So
sounds like I'm not understanding something correctly here.<br>
<br>
Apart from that I completely agree that we need to sort out any
other RAS event first to make sure that the system is simply not
failing randomly.<br>
<br>
Regards,<br>
Christian.<br>
<br>
Am 15.04.20 um 11:49 schrieb Kim, Jonathan:<br>
</div>
<blockquote type="cite" cite="mid:MN2PR12MB4518963F186CF8528A620A7D85DB0@MN2PR12MB4518.namprd12.prod.outlook.com">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
color:black;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
p.msipheader4d0fcdd7, li.msipheader4d0fcdd7, div.msipheader4d0fcdd7
{mso-style-name:msipheader4d0fcdd7;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
color:black;}
p.msipheader87abd423, li.msipheader87abd423, div.msipheader87abd423
{mso-style-name:msipheader87abd423;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle21
{mso-style-type:personal-compose;
font-family:"Arial",sans-serif;
color:#317100;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="msipheader87abd423" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">Hi
Christian,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">That could
potentially be it. With additional testing, 2 of 3 Vega20
machines never hit error over BAR access with the PTRACE
test. 3 of 3 machines (from the same pool) always hit error
with CWSR.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">To elaborate
on the PTRACE test, we PEEK 2 DWORDs inside thunk allocated
mapped memory and 2 DWORDS outside that boundary (it’s only
about 4MB to the boundary). Then we POKE to swap the DWORD
positions across the boundary. The RAS event on the single
failing machine happens on the out of boundary PEEK.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">Felix
mentioned we don’t hit errors over general HDP access but
that may not true. An Arcturus failure sys logs posted
(which wasn’t tested by me) shows someone launched rocm
bandwidth test, hit a VM fault and a RAS event ensued during
evictions (I can point the internal ticket or log snippet
offline if interested). Whether the RAS event is BAR access
triggered or the result of HW instability is beyond me since
I don’t have access to the machine.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">Thanks,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">Jon<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="color:windowtext">From:</span></b><span style="color:windowtext"> Koenig, Christian
<a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>
<br>
<b>Sent:</b> Wednesday, April 15, 2020 4:11 AM<br>
<b>To:</b> Kim, Jonathan <a class="moz-txt-link-rfc2396E" href="mailto:Jonathan.Kim@amd.com"><Jonathan.Kim@amd.com></a>;
Kuehling, Felix <a class="moz-txt-link-rfc2396E" href="mailto:Felix.Kuehling@amd.com"><Felix.Kuehling@amd.com></a>; Deucher,
Alexander <a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a><br>
<b>Cc:</b> Russell, Kent <a class="moz-txt-link-rfc2396E" href="mailto:Kent.Russell@amd.com"><Kent.Russell@amd.com></a>;
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
<b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use the
BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt">Hi Jon,<br>
<br>
<o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Also cwsr tests fail on Vega20 with or
without the revert with the same RAS error.<o:p></o:p></p>
</blockquote>
<p class="MsoNormal"><br>
That sounds like the system/setup has a more general
problem.<br>
<br>
Could it be that we are seeing RAS errors because there
really is some hardware failure, but with the MM path we
don't trigger a RAS interrupt?<br>
<br>
Thanks,<br>
Christian.<br>
<br>
Am 14.04.20 um 22:30 schrieb Kim, Jonathan:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="msipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">If we’re passing the test on the revert,
then the only thing that’s different is we’re not
invalidating HDP and doing a copy to host anymore in
amdgpu_device_vram_access since the function is still called
in ttm access_memory with BAR.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Also cwsr tests fail on Vega20 with or
without the revert with the same RAS error.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Jon<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Kuehling, Felix <a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">
<Felix.Kuehling@amd.com></a> <br>
<b>Sent:</b> Tuesday, April 14, 2020 2:32 PM<br>
<b>To:</b> Kim, Jonathan <a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true"><Jonathan.Kim@amd.com></a>;
Koenig, Christian
<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a>;
Deucher, Alexander
<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a><br>
<b>Cc:</b> Russell, Kent <a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true"><Kent.Russell@amd.com></a>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use the
BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p>I wouldn't call it premature. Revert is a usual practice
when there is a serious regression that isn't fully
understood or root-caused. As far as I can tell, the problem
has been reproduced on multiple systems, different GPUs, and
clearly regressed to Christian's commit. I think that
justifies reverting it for now.<o:p></o:p></p>
<p>I agree with Christian that a general HDP memory access
problem causing RAS errors would potentially cause problems
in other tests as well. For example common operations like
GART table updates, and GPUVM page table updates and PCIe
peer2peer accesses in ROCm applications use HDP. But we're
not seeing obvious problems from those. So we need to
understand what's special about this test. I asked questions
to that effect on our other email thread.<o:p></o:p></p>
<p>Regards,<br>
Felix<o:p></o:p></p>
<div>
<p class="MsoNormal">Am 2020-04-14 um 10:51 a.m. schrieb
Kim, Jonathan:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="msipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">I think it’s premature to push this
revert.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">With more testing, I’m getting failures
from different tests or sometimes none at all on my
machine.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Kent, let’s continue the discussion on
the original thread.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Jon<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Koenig, Christian <a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">
<Christian.Koenig@amd.com></a> <br>
<b>Sent:</b> Tuesday, April 14, 2020 10:47 AM<br>
<b>To:</b> Deucher, Alexander <a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a><br>
<b>Cc:</b> Russell, Kent <a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true"><Kent.Russell@amd.com></a>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>;
Kuehling, Felix
<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true"><Felix.Kuehling@amd.com></a>;
Kim, Jonathan
<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true"><Jonathan.Kim@amd.com></a><br>
<b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use
the BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal">That's exactly my concern as
well. <o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">This looks a bit like the
test creates erroneous data somehow, but there
doesn't seems to be a RAS check in the MM data
path.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">And now that we use the BAR
path it goes up in flames.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">I just don't see how we can
create erroneous data in a test case?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Christian.<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">Am 14.04.2020 16:35 schrieb
"Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
<blockquote style="border:none;border-left:solid
#CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt">If this
causes an issue, any access to vram
via the BAR could cause an issue.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
</div>
<div class="MsoNormal" style="text-align:center" align="center">
<hr width="98%" size="2" align="center">
</div>
<div>
<p class="MsoNormal"><b>From:</b>
amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
<b>Sent:</b> Tuesday, April 14, 2020
10:19 AM<br>
<b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
<b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
<b>Subject:</b> RE: [PATCH] Revert
"drm/amdgpu: use the BAR if possible
in amdgpu_device_vram_access v2"
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">[AMD Official Use
Only - Internal Distribution Only]<br>
<br>
On VG20 or MI100, as soon as we run
the subtest, we get the dmesg output
below, and then the kernel ends up
hanging. I don't know enough about
the test itself to know why this is
occurring, but Jon Kim and Felix
were discussing it on a separate
thread when the issue was first
reported, so they can hopefully
provide some additional information.<br>
<br>
Kent<br>
<br>
> -----Original Message-----<br>
> From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
> Sent: Tuesday, April 14, 2020
9:52 AM<br>
> To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
> Subject: Re: [PATCH] Revert
"drm/amdgpu: use the BAR if possible
in<br>
> amdgpu_device_vram_access v2"<br>
> <br>
> Am 13.04.20 um 20:20 schrieb
Kent Russell:<br>
> > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
> > The original patch causes
a RAS event and subsequent kernel
hard-hang<br>
> > when running the
KFDMemoryTest.PtraceAccessInvisibleVram
on VG20 and<br>
> > Arcturus<br>
> ><br>
> > dmesg output at hang time:<br>
> > [drm] RAS event of type
ERREVENT_ATHUB_INTERRUPT detected!<br>
> > amdgpu 0000:67:00.0: GPU
reset begin!<br>
> > Evicting PASID 0x8000
queues<br>
> > Started evicting pasid
0x8000<br>
> > qcm fence wait loop
timeout expired<br>
> > The cp might be in an
unrecoverable state due to an
unsuccessful<br>
> > queues preemption Failed
to evict process queues Failed to
suspend<br>
> > process 0x8000 Finished
evicting pasid 0x8000 Started
restoring pasid<br>
> > 0x8000 Finished restoring
pasid 0x8000 [drm] UVD VCPU state
may lost<br>
> > due to RAS
ERREVENT_ATHUB_INTERRUPT<br>
> > amdgpu: [powerplay] Failed
to send message 0x26, response 0x0<br>
> > amdgpu: [powerplay] Failed
to set soft min gfxclk !<br>
> > amdgpu: [powerplay] Failed
to upload DPM Bootup Levels!<br>
> > amdgpu: [powerplay] Failed
to send message 0x7, response 0x0<br>
> > amdgpu: [powerplay]
[DisableAllSMUFeatures] Failed to
disable all smu<br>
> features!<br>
> > amdgpu: [powerplay]
[DisableDpmTasks] Failed to disable
all smu features!<br>
> > amdgpu: [powerplay]
[PowerOffAsic] Failed to disable
DPM!<br>
> >
[drm:amdgpu_device_ip_suspend_phase2
[amdgpu]] *ERROR* suspend of IP<br>
> > block <powerplay>
failed -5<br>
> <br>
> Do you have more information on
what's going wrong here since this
is a really<br>
> important patch for KFD
debugging.<br>
> <br>
> ><br>
> > Signed-off-by: Kent
Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
> <br>
> Reviewed-by: Christian König
<<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
> <br>
> > ---<br>
> >
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
| 26 ----------------------<br>
> > 1 file changed, 26
deletions(-)<br>
> ><br>
> > diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> >
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > index
cf5d6e585634..a3f997f84020 100644<br>
> > ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > @@ -254,32 +254,6 @@ void
amdgpu_device_vram_access(struct<br>
> amdgpu_device *adev, loff_t
pos,<br>
> > uint32_t hi = ~0;<br>
> > uint64_t last;<br>
> ><br>
> > -<br>
> > -#ifdef CONFIG_64BIT<br>
> > - last = min(pos + size,
adev->gmc.visible_vram_size);<br>
> > - if (last > pos) {<br>
> > - void __iomem
*addr =
adev->mman.aper_base_kaddr + pos;<br>
> > - size_t count =
last - pos;<br>
> > -<br>
> > - if (write) {<br>
> > -
memcpy_toio(addr, buf, count);<br>
> > - mb();<br>
> > -
amdgpu_asic_flush_hdp(adev, NULL);<br>
> > - } else {<br>
> > -
amdgpu_asic_invalidate_hdp(adev,
NULL);<br>
> > - mb();<br>
> > -
memcpy_fromio(buf, addr, count);<br>
> > - }<br>
> > -<br>
> > - if (count ==
size)<br>
> > -
return;<br>
> > -<br>
> > - pos += count;<br>
> > - buf += count /
4;<br>
> > - size -= count;<br>
> > - }<br>
> > -#endif<br>
> > -<br>
> >
spin_lock_irqsave(&adev->mmio_idx_lock,
flags);<br>
> > for (last = pos +
size; pos < last; pos += 4) {<br>
> > uint32_t tmp
= pos >> 31;<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">Am 14.04.2020 16:35 schrieb
"Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
<blockquote style="border:none;border-left:solid
#CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt">If this
causes an issue, any access to vram
via the BAR could cause an issue.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
</div>
<div class="MsoNormal" style="text-align:center" align="center">
<hr width="98%" size="2" align="center">
</div>
<div>
<p class="MsoNormal"><b>From:</b> amd-gfx
<<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
<b>Sent:</b> Tuesday, April 14, 2020
10:19 AM<br>
<b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
<b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
<b>Subject:</b> RE: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in
amdgpu_device_vram_access v2"
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">[AMD Official Use
Only - Internal Distribution Only]<br>
<br>
On VG20 or MI100, as soon as we run
the subtest, we get the dmesg output
below, and then the kernel ends up
hanging. I don't know enough about the
test itself to know why this is
occurring, but Jon Kim and Felix were
discussing it on a separate thread
when the issue was first reported, so
they can hopefully provide some
additional information.<br>
<br>
Kent<br>
<br>
> -----Original Message-----<br>
> From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
> Sent: Tuesday, April 14, 2020
9:52 AM<br>
> To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
> Subject: Re: [PATCH] Revert
"drm/amdgpu: use the BAR if possible
in<br>
> amdgpu_device_vram_access v2"<br>
> <br>
> Am 13.04.20 um 20:20 schrieb Kent
Russell:<br>
> > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
> > The original patch causes a
RAS event and subsequent kernel
hard-hang<br>
> > when running the
KFDMemoryTest.PtraceAccessInvisibleVram
on VG20 and<br>
> > Arcturus<br>
> ><br>
> > dmesg output at hang time:<br>
> > [drm] RAS event of type
ERREVENT_ATHUB_INTERRUPT detected!<br>
> > amdgpu 0000:67:00.0: GPU
reset begin!<br>
> > Evicting PASID 0x8000 queues<br>
> > Started evicting pasid
0x8000<br>
> > qcm fence wait loop timeout
expired<br>
> > The cp might be in an
unrecoverable state due to an
unsuccessful<br>
> > queues preemption Failed to
evict process queues Failed to suspend<br>
> > process 0x8000 Finished
evicting pasid 0x8000 Started
restoring pasid<br>
> > 0x8000 Finished restoring
pasid 0x8000 [drm] UVD VCPU state may
lost<br>
> > due to RAS
ERREVENT_ATHUB_INTERRUPT<br>
> > amdgpu: [powerplay] Failed
to send message 0x26, response 0x0<br>
> > amdgpu: [powerplay] Failed
to set soft min gfxclk !<br>
> > amdgpu: [powerplay] Failed
to upload DPM Bootup Levels!<br>
> > amdgpu: [powerplay] Failed
to send message 0x7, response 0x0<br>
> > amdgpu: [powerplay]
[DisableAllSMUFeatures] Failed to
disable all smu<br>
> features!<br>
> > amdgpu: [powerplay]
[DisableDpmTasks] Failed to disable
all smu features!<br>
> > amdgpu: [powerplay]
[PowerOffAsic] Failed to disable DPM!<br>
> >
[drm:amdgpu_device_ip_suspend_phase2
[amdgpu]] *ERROR* suspend of IP<br>
> > block <powerplay>
failed -5<br>
> <br>
> Do you have more information on
what's going wrong here since this is
a really<br>
> important patch for KFD
debugging.<br>
> <br>
> ><br>
> > Signed-off-by: Kent Russell
<<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
> <br>
> Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
> <br>
> > ---<br>
> >
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
| 26 ----------------------<br>
> > 1 file changed, 26
deletions(-)<br>
> ><br>
> > diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> >
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > index
cf5d6e585634..a3f997f84020 100644<br>
> > ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > @@ -254,32 +254,6 @@ void
amdgpu_device_vram_access(struct<br>
> amdgpu_device *adev, loff_t pos,<br>
> > uint32_t hi = ~0;<br>
> > uint64_t last;<br>
> ><br>
> > -<br>
> > -#ifdef CONFIG_64BIT<br>
> > - last = min(pos + size,
adev->gmc.visible_vram_size);<br>
> > - if (last > pos) {<br>
> > - void __iomem
*addr = adev->mman.aper_base_kaddr
+ pos;<br>
> > - size_t count =
last - pos;<br>
> > -<br>
> > - if (write) {<br>
> > -
memcpy_toio(addr, buf, count);<br>
> > - mb();<br>
> > -
amdgpu_asic_flush_hdp(adev, NULL);<br>
> > - } else {<br>
> > -
amdgpu_asic_invalidate_hdp(adev,
NULL);<br>
> > - mb();<br>
> > -
memcpy_fromio(buf, addr, count);<br>
> > - }<br>
> > -<br>
> > - if (count ==
size)<br>
> > - return;<br>
> > -<br>
> > - pos += count;<br>
> > - buf += count /
4;<br>
> > - size -= count;<br>
> > - }<br>
> > -#endif<br>
> > -<br>
> >
spin_lock_irqsave(&adev->mmio_idx_lock,
flags);<br>
> > for (last = pos + size;
pos < last; pos += 4) {<br>
> > uint32_t tmp =
pos >> 31;<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">Am 14.04.2020 16:35 schrieb
"Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
<blockquote style="border:none;border-left:solid
#CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt">If this causes
an issue, any access to vram via the BAR
could cause an issue.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
</div>
<div class="MsoNormal" style="text-align:center" align="center">
<hr width="98%" size="2" align="center">
</div>
<div>
<p class="MsoNormal"><b>From:</b> amd-gfx
<<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
<b>Sent:</b> Tuesday, April 14, 2020 10:19
AM<br>
<b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
<b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
<b>Subject:</b> RE: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in
amdgpu_device_vram_access v2"
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">[AMD Official Use
Only - Internal Distribution Only]<br>
<br>
On VG20 or MI100, as soon as we run the
subtest, we get the dmesg output below,
and then the kernel ends up hanging. I
don't know enough about the test itself
to know why this is occurring, but Jon
Kim and Felix were discussing it on a
separate thread when the issue was first
reported, so they can hopefully provide
some additional information.<br>
<br>
Kent<br>
<br>
> -----Original Message-----<br>
> From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
> Sent: Tuesday, April 14, 2020 9:52
AM<br>
> To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
> Subject: Re: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in<br>
> amdgpu_device_vram_access v2"<br>
> <br>
> Am 13.04.20 um 20:20 schrieb Kent
Russell:<br>
> > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
> > The original patch causes a
RAS event and subsequent kernel
hard-hang<br>
> > when running the
KFDMemoryTest.PtraceAccessInvisibleVram
on VG20 and<br>
> > Arcturus<br>
> ><br>
> > dmesg output at hang time:<br>
> > [drm] RAS event of type
ERREVENT_ATHUB_INTERRUPT detected!<br>
> > amdgpu 0000:67:00.0: GPU reset
begin!<br>
> > Evicting PASID 0x8000 queues<br>
> > Started evicting pasid 0x8000<br>
> > qcm fence wait loop timeout
expired<br>
> > The cp might be in an
unrecoverable state due to an
unsuccessful<br>
> > queues preemption Failed to
evict process queues Failed to suspend<br>
> > process 0x8000 Finished
evicting pasid 0x8000 Started restoring
pasid<br>
> > 0x8000 Finished restoring
pasid 0x8000 [drm] UVD VCPU state may
lost<br>
> > due to RAS
ERREVENT_ATHUB_INTERRUPT<br>
> > amdgpu: [powerplay] Failed to
send message 0x26, response 0x0<br>
> > amdgpu: [powerplay] Failed to
set soft min gfxclk !<br>
> > amdgpu: [powerplay] Failed to
upload DPM Bootup Levels!<br>
> > amdgpu: [powerplay] Failed to
send message 0x7, response 0x0<br>
> > amdgpu: [powerplay]
[DisableAllSMUFeatures] Failed to
disable all smu<br>
> features!<br>
> > amdgpu: [powerplay]
[DisableDpmTasks] Failed to disable all
smu features!<br>
> > amdgpu: [powerplay]
[PowerOffAsic] Failed to disable DPM!<br>
> >
[drm:amdgpu_device_ip_suspend_phase2
[amdgpu]] *ERROR* suspend of IP<br>
> > block <powerplay> failed
-5<br>
> <br>
> Do you have more information on
what's going wrong here since this is a
really<br>
> important patch for KFD debugging.<br>
> <br>
> ><br>
> > Signed-off-by: Kent Russell
<<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
> <br>
> Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
> <br>
> > ---<br>
> >
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
| 26 ----------------------<br>
> > 1 file changed, 26
deletions(-)<br>
> ><br>
> > diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> >
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > index
cf5d6e585634..a3f997f84020 100644<br>
> > ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > @@ -254,32 +254,6 @@ void
amdgpu_device_vram_access(struct<br>
> amdgpu_device *adev, loff_t pos,<br>
> > uint32_t hi = ~0;<br>
> > uint64_t last;<br>
> ><br>
> > -<br>
> > -#ifdef CONFIG_64BIT<br>
> > - last = min(pos + size,
adev->gmc.visible_vram_size);<br>
> > - if (last > pos) {<br>
> > - void __iomem *addr
= adev->mman.aper_base_kaddr + pos;<br>
> > - size_t count =
last - pos;<br>
> > -<br>
> > - if (write) {<br>
> > -
memcpy_toio(addr, buf, count);<br>
> > - mb();<br>
> > -
amdgpu_asic_flush_hdp(adev, NULL);<br>
> > - } else {<br>
> > -
amdgpu_asic_invalidate_hdp(adev, NULL);<br>
> > - mb();<br>
> > -
memcpy_fromio(buf, addr, count);<br>
> > - }<br>
> > -<br>
> > - if (count == size)<br>
> > - return;<br>
> > -<br>
> > - pos += count;<br>
> > - buf += count / 4;<br>
> > - size -= count;<br>
> > - }<br>
> > -#endif<br>
> > -<br>
> >
spin_lock_irqsave(&adev->mmio_idx_lock,
flags);<br>
> > for (last = pos + size;
pos < last; pos += 4) {<br>
> > uint32_t tmp =
pos >> 31;<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">Am 14.04.2020 16:35 schrieb
"Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
<blockquote style="border:none;border-left:solid
#CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt">If this causes an
issue, any access to vram via the BAR
could cause an issue.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
</div>
<div class="MsoNormal" style="text-align:center" align="center">
<hr width="98%" size="2" align="center">
</div>
<div>
<p class="MsoNormal"><b>From:</b> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
<b>Sent:</b> Tuesday, April 14, 2020 10:19
AM<br>
<b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
<b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
<b>Subject:</b> RE: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in
amdgpu_device_vram_access v2"
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">[AMD Official Use Only
- Internal Distribution Only]<br>
<br>
On VG20 or MI100, as soon as we run the
subtest, we get the dmesg output below,
and then the kernel ends up hanging. I
don't know enough about the test itself to
know why this is occurring, but Jon Kim
and Felix were discussing it on a separate
thread when the issue was first reported,
so they can hopefully provide some
additional information.<br>
<br>
Kent<br>
<br>
> -----Original Message-----<br>
> From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
> Sent: Tuesday, April 14, 2020 9:52 AM<br>
> To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
> Subject: Re: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in<br>
> amdgpu_device_vram_access v2"<br>
> <br>
> Am 13.04.20 um 20:20 schrieb Kent
Russell:<br>
> > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
> > The original patch causes a RAS
event and subsequent kernel hard-hang<br>
> > when running the
KFDMemoryTest.PtraceAccessInvisibleVram on
VG20 and<br>
> > Arcturus<br>
> ><br>
> > dmesg output at hang time:<br>
> > [drm] RAS event of type
ERREVENT_ATHUB_INTERRUPT detected!<br>
> > amdgpu 0000:67:00.0: GPU reset
begin!<br>
> > Evicting PASID 0x8000 queues<br>
> > Started evicting pasid 0x8000<br>
> > qcm fence wait loop timeout
expired<br>
> > The cp might be in an
unrecoverable state due to an unsuccessful<br>
> > queues preemption Failed to
evict process queues Failed to suspend<br>
> > process 0x8000 Finished evicting
pasid 0x8000 Started restoring pasid<br>
> > 0x8000 Finished restoring pasid
0x8000 [drm] UVD VCPU state may lost<br>
> > due to RAS
ERREVENT_ATHUB_INTERRUPT<br>
> > amdgpu: [powerplay] Failed to
send message 0x26, response 0x0<br>
> > amdgpu: [powerplay] Failed to
set soft min gfxclk !<br>
> > amdgpu: [powerplay] Failed to
upload DPM Bootup Levels!<br>
> > amdgpu: [powerplay] Failed to
send message 0x7, response 0x0<br>
> > amdgpu: [powerplay]
[DisableAllSMUFeatures] Failed to disable
all smu<br>
> features!<br>
> > amdgpu: [powerplay]
[DisableDpmTasks] Failed to disable all
smu features!<br>
> > amdgpu: [powerplay]
[PowerOffAsic] Failed to disable DPM!<br>
> >
[drm:amdgpu_device_ip_suspend_phase2
[amdgpu]] *ERROR* suspend of IP<br>
> > block <powerplay> failed
-5<br>
> <br>
> Do you have more information on
what's going wrong here since this is a
really<br>
> important patch for KFD debugging.<br>
> <br>
> ><br>
> > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
> <br>
> Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
> <br>
> > ---<br>
> >
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
| 26 ----------------------<br>
> > 1 file changed, 26
deletions(-)<br>
> ><br>
> > diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> >
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > index cf5d6e585634..a3f997f84020
100644<br>
> > ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > @@ -254,32 +254,6 @@ void
amdgpu_device_vram_access(struct<br>
> amdgpu_device *adev, loff_t pos,<br>
> > uint32_t hi = ~0;<br>
> > uint64_t last;<br>
> ><br>
> > -<br>
> > -#ifdef CONFIG_64BIT<br>
> > - last = min(pos + size,
adev->gmc.visible_vram_size);<br>
> > - if (last > pos) {<br>
> > - void __iomem *addr =
adev->mman.aper_base_kaddr + pos;<br>
> > - size_t count = last
- pos;<br>
> > -<br>
> > - if (write) {<br>
> > -
memcpy_toio(addr, buf, count);<br>
> > - mb();<br>
> > -
amdgpu_asic_flush_hdp(adev, NULL);<br>
> > - } else {<br>
> > -
amdgpu_asic_invalidate_hdp(adev, NULL);<br>
> > - mb();<br>
> > -
memcpy_fromio(buf, addr, count);<br>
> > - }<br>
> > -<br>
> > - if (count == size)<br>
> > - return;<br>
> > -<br>
> > - pos += count;<br>
> > - buf += count / 4;<br>
> > - size -= count;<br>
> > - }<br>
> > -#endif<br>
> > -<br>
> >
spin_lock_irqsave(&adev->mmio_idx_lock,
flags);<br>
> > for (last = pos + size; pos
< last; pos += 4) {<br>
> > uint32_t tmp = pos
>> 31;<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">Am 14.04.2020 16:35 schrieb
"Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
</div>
</div>
<div>
<p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt">If
this causes an issue, any access to vram via the
BAR could cause an issue.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt">Alex</span><o:p></o:p></p>
</div>
<div class="MsoNormal" style="text-align:center" align="center">
<hr width="98%" size="2" align="center">
</div>
<div id="divRplyFwdMsg">
<p class="MsoNormal"><b>From:</b> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
<b>Sent:</b> Tuesday, April 14, 2020 10:19 AM<br>
<b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
<b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
<b>Subject:</b> RE: [PATCH] Revert "drm/amdgpu: use
the BAR if possible in amdgpu_device_vram_access v2"
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">[AMD Official Use Only -
Internal Distribution Only]<br>
<br>
On VG20 or MI100, as soon as we run the subtest,
we get the dmesg output below, and then the kernel
ends up hanging. I don't know enough about the
test itself to know why this is occurring, but Jon
Kim and Felix were discussing it on a separate
thread when the issue was first reported, so they
can hopefully provide some additional information.<br>
<br>
Kent<br>
<br>
> -----Original Message-----<br>
> From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
> Sent: Tuesday, April 14, 2020 9:52 AM<br>
> To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use
the BAR if possible in<br>
> amdgpu_device_vram_access v2"<br>
> <br>
> Am 13.04.20 um 20:20 schrieb Kent Russell:<br>
> > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
> > The original patch causes a RAS event
and subsequent kernel hard-hang<br>
> > when running the
KFDMemoryTest.PtraceAccessInvisibleVram on VG20
and<br>
> > Arcturus<br>
> ><br>
> > dmesg output at hang time:<br>
> > [drm] RAS event of type
ERREVENT_ATHUB_INTERRUPT detected!<br>
> > amdgpu 0000:67:00.0: GPU reset begin!<br>
> > Evicting PASID 0x8000 queues<br>
> > Started evicting pasid 0x8000<br>
> > qcm fence wait loop timeout expired<br>
> > The cp might be in an unrecoverable
state due to an unsuccessful<br>
> > queues preemption Failed to evict
process queues Failed to suspend<br>
> > process 0x8000 Finished evicting pasid
0x8000 Started restoring pasid<br>
> > 0x8000 Finished restoring pasid 0x8000
[drm] UVD VCPU state may lost<br>
> > due to RAS ERREVENT_ATHUB_INTERRUPT<br>
> > amdgpu: [powerplay] Failed to send
message 0x26, response 0x0<br>
> > amdgpu: [powerplay] Failed to set soft
min gfxclk !<br>
> > amdgpu: [powerplay] Failed to upload DPM
Bootup Levels!<br>
> > amdgpu: [powerplay] Failed to send
message 0x7, response 0x0<br>
> > amdgpu: [powerplay]
[DisableAllSMUFeatures] Failed to disable all smu<br>
> features!<br>
> > amdgpu: [powerplay] [DisableDpmTasks]
Failed to disable all smu features!<br>
> > amdgpu: [powerplay] [PowerOffAsic]
Failed to disable DPM!<br>
> > [drm:amdgpu_device_ip_suspend_phase2
[amdgpu]] *ERROR* suspend of IP<br>
> > block <powerplay> failed -5<br>
> <br>
> Do you have more information on what's going
wrong here since this is a really<br>
> important patch for KFD debugging.<br>
> <br>
> ><br>
> > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
> <br>
> Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
> <br>
> > ---<br>
> >
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
----------------------<br>
> > 1 file changed, 26 deletions(-)<br>
> ><br>
> > diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> >
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > index cf5d6e585634..a3f997f84020 100644<br>
> > ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > @@ -254,32 +254,6 @@ void
amdgpu_device_vram_access(struct<br>
> amdgpu_device *adev, loff_t pos,<br>
> > uint32_t hi = ~0;<br>
> > uint64_t last;<br>
> ><br>
> > -<br>
> > -#ifdef CONFIG_64BIT<br>
> > - last = min(pos + size,
adev->gmc.visible_vram_size);<br>
> > - if (last > pos) {<br>
> > - void __iomem *addr =
adev->mman.aper_base_kaddr + pos;<br>
> > - size_t count = last - pos;<br>
> > -<br>
> > - if (write) {<br>
> > - memcpy_toio(addr,
buf, count);<br>
> > - mb();<br>
> > -
amdgpu_asic_flush_hdp(adev, NULL);<br>
> > - } else {<br>
> > -
amdgpu_asic_invalidate_hdp(adev, NULL);<br>
> > - mb();<br>
> > - memcpy_fromio(buf,
addr, count);<br>
> > - }<br>
> > -<br>
> > - if (count == size)<br>
> > - return;<br>
> > -<br>
> > - pos += count;<br>
> > - buf += count / 4;<br>
> > - size -= count;<br>
> > - }<br>
> > -#endif<br>
> > -<br>
> >
spin_lock_irqsave(&adev->mmio_idx_lock,
flags);<br>
> > for (last = pos + size; pos <
last; pos += 4) {<br>
> > uint32_t tmp = pos >>
31;<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</blockquote>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</blockquote>
<br>
</body>
</html>