<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Hi Jon,<br>
<br>
<blockquote type="cite">Also cwsr tests fail on Vega20 with or
without the revert with the same RAS error.</blockquote>
<br>
That sounds like the system/setup has a more general problem.<br>
<br>
Could it be that we are seeing RAS errors because there really is
some hardware failure, but with the MM path we don't trigger a RAS
interrupt?<br>
<br>
Thanks,<br>
Christian.<br>
<br>
Am 14.04.20 um 22:30 schrieb Kim, Jonathan:<br>
</div>
<blockquote type="cite" cite="mid:MN2PR12MB4518A3D9746674DA688AD34885DA0@MN2PR12MB4518.namprd12.prod.outlook.com">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
p.msipheader4d0fcdd7, li.msipheader4d0fcdd7, div.msipheader4d0fcdd7
{mso-style-name:msipheader4d0fcdd7;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle20
{mso-style-type:personal-compose;
font-family:"Arial",sans-serif;
color:#0078D7;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="msipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">If we’re passing the test on the revert,
then the only thing that’s different is we’re not invalidating
HDP and doing a copy to host anymore in
amdgpu_device_vram_access since the function is still called
in ttm access_memory with BAR.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Also cwsr tests fail on Vega20 with or
without the revert with the same RAS error.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Jon<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Kuehling, Felix
<a class="moz-txt-link-rfc2396E" href="mailto:Felix.Kuehling@amd.com"><Felix.Kuehling@amd.com></a> <br>
<b>Sent:</b> Tuesday, April 14, 2020 2:32 PM<br>
<b>To:</b> Kim, Jonathan <a class="moz-txt-link-rfc2396E" href="mailto:Jonathan.Kim@amd.com"><Jonathan.Kim@amd.com></a>;
Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>;
Deucher, Alexander <a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a><br>
<b>Cc:</b> Russell, Kent <a class="moz-txt-link-rfc2396E" href="mailto:Kent.Russell@amd.com"><Kent.Russell@amd.com></a>;
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a><br>
<b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use the
BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p>I wouldn't call it premature. Revert is a usual practice when
there is a serious regression that isn't fully understood or
root-caused. As far as I can tell, the problem has been
reproduced on multiple systems, different GPUs, and clearly
regressed to Christian's commit. I think that justifies
reverting it for now.<o:p></o:p></p>
<p>I agree with Christian that a general HDP memory access
problem causing RAS errors would potentially cause problems in
other tests as well. For example common operations like GART
table updates, and GPUVM page table updates and PCIe peer2peer
accesses in ROCm applications use HDP. But we're not seeing
obvious problems from those. So we need to understand what's
special about this test. I asked questions to that effect on
our other email thread.<o:p></o:p></p>
<p>Regards,<br>
Felix<o:p></o:p></p>
<div>
<p class="MsoNormal">Am 2020-04-14 um 10:51 a.m. schrieb Kim,
Jonathan:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="msipheader4d0fcdd7" style="margin:0in;margin-bottom:.0001pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#0078D7">[AMD
Official Use Only - Internal Distribution Only]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">I think it’s premature to push this
revert.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">With more testing, I’m getting failures
from different tests or sometimes none at all on my machine.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Kent, let’s continue the discussion on
the original thread.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Jon<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Koenig, Christian <a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">
<Christian.Koenig@amd.com></a> <br>
<b>Sent:</b> Tuesday, April 14, 2020 10:47 AM<br>
<b>To:</b> Deucher, Alexander <a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a><br>
<b>Cc:</b> Russell, Kent <a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true"><Kent.Russell@amd.com></a>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>;
Kuehling, Felix
<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true"><Felix.Kuehling@amd.com></a>;
Kim, Jonathan
<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true"><Jonathan.Kim@amd.com></a><br>
<b>Subject:</b> Re: [PATCH] Revert "drm/amdgpu: use the
BAR if possible in amdgpu_device_vram_access v2"<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal">That's exactly my concern as
well. <o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">This looks a bit like the
test creates erroneous data somehow, but there
doesn't seems to be a RAS check in the MM data
path.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">And now that we use the BAR
path it goes up in flames.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">I just don't see how we can
create erroneous data in a test case?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Christian.<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">Am 14.04.2020 16:35 schrieb
"Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
<blockquote style="border:none;border-left:solid
#CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">If
this causes an issue, any access to
vram via the BAR could cause an issue.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Alex</span><o:p></o:p></p>
</div>
<div class="MsoNormal" style="text-align:center" align="center">
<hr width="98%" size="2" align="center">
</div>
<div>
<p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
<b>Sent:</b> Tuesday, April 14, 2020
10:19 AM<br>
<b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
<b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
<b>Subject:</b> RE: [PATCH] Revert
"drm/amdgpu: use the BAR if possible
in amdgpu_device_vram_access v2"</span>
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">[AMD Official Use
Only - Internal Distribution Only]<br>
<br>
On VG20 or MI100, as soon as we run
the subtest, we get the dmesg output
below, and then the kernel ends up
hanging. I don't know enough about the
test itself to know why this is
occurring, but Jon Kim and Felix were
discussing it on a separate thread
when the issue was first reported, so
they can hopefully provide some
additional information.<br>
<br>
Kent<br>
<br>
> -----Original Message-----<br>
> From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
> Sent: Tuesday, April 14, 2020
9:52 AM<br>
> To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
> Subject: Re: [PATCH] Revert
"drm/amdgpu: use the BAR if possible
in<br>
> amdgpu_device_vram_access v2"<br>
> <br>
> Am 13.04.20 um 20:20 schrieb Kent
Russell:<br>
> > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
> > The original patch causes a
RAS event and subsequent kernel
hard-hang<br>
> > when running the
KFDMemoryTest.PtraceAccessInvisibleVram
on VG20 and<br>
> > Arcturus<br>
> ><br>
> > dmesg output at hang time:<br>
> > [drm] RAS event of type
ERREVENT_ATHUB_INTERRUPT detected!<br>
> > amdgpu 0000:67:00.0: GPU
reset begin!<br>
> > Evicting PASID 0x8000 queues<br>
> > Started evicting pasid
0x8000<br>
> > qcm fence wait loop timeout
expired<br>
> > The cp might be in an
unrecoverable state due to an
unsuccessful<br>
> > queues preemption Failed to
evict process queues Failed to suspend<br>
> > process 0x8000 Finished
evicting pasid 0x8000 Started
restoring pasid<br>
> > 0x8000 Finished restoring
pasid 0x8000 [drm] UVD VCPU state may
lost<br>
> > due to RAS
ERREVENT_ATHUB_INTERRUPT<br>
> > amdgpu: [powerplay] Failed
to send message 0x26, response 0x0<br>
> > amdgpu: [powerplay] Failed
to set soft min gfxclk !<br>
> > amdgpu: [powerplay] Failed
to upload DPM Bootup Levels!<br>
> > amdgpu: [powerplay] Failed
to send message 0x7, response 0x0<br>
> > amdgpu: [powerplay]
[DisableAllSMUFeatures] Failed to
disable all smu<br>
> features!<br>
> > amdgpu: [powerplay]
[DisableDpmTasks] Failed to disable
all smu features!<br>
> > amdgpu: [powerplay]
[PowerOffAsic] Failed to disable DPM!<br>
> >
[drm:amdgpu_device_ip_suspend_phase2
[amdgpu]] *ERROR* suspend of IP<br>
> > block <powerplay>
failed -5<br>
> <br>
> Do you have more information on
what's going wrong here since this is
a really<br>
> important patch for KFD
debugging.<br>
> <br>
> ><br>
> > Signed-off-by: Kent Russell
<<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
> <br>
> Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
> <br>
> > ---<br>
> >
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
| 26 ----------------------<br>
> > 1 file changed, 26
deletions(-)<br>
> ><br>
> > diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> >
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > index
cf5d6e585634..a3f997f84020 100644<br>
> > ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > @@ -254,32 +254,6 @@ void
amdgpu_device_vram_access(struct<br>
> amdgpu_device *adev, loff_t pos,<br>
> > uint32_t hi = ~0;<br>
> > uint64_t last;<br>
> ><br>
> > -<br>
> > -#ifdef CONFIG_64BIT<br>
> > - last = min(pos + size,
adev->gmc.visible_vram_size);<br>
> > - if (last > pos) {<br>
> > - void __iomem
*addr = adev->mman.aper_base_kaddr
+ pos;<br>
> > - size_t count =
last - pos;<br>
> > -<br>
> > - if (write) {<br>
> > -
memcpy_toio(addr, buf, count);<br>
> > - mb();<br>
> > -
amdgpu_asic_flush_hdp(adev, NULL);<br>
> > - } else {<br>
> > -
amdgpu_asic_invalidate_hdp(adev,
NULL);<br>
> > - mb();<br>
> > -
memcpy_fromio(buf, addr, count);<br>
> > - }<br>
> > -<br>
> > - if (count ==
size)<br>
> > - return;<br>
> > -<br>
> > - pos += count;<br>
> > - buf += count /
4;<br>
> > - size -= count;<br>
> > - }<br>
> > -#endif<br>
> > -<br>
> >
spin_lock_irqsave(&adev->mmio_idx_lock,
flags);<br>
> > for (last = pos + size;
pos < last; pos += 4) {<br>
> > uint32_t tmp =
pos >> 31;<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">Am 14.04.2020 16:35 schrieb
"Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
<blockquote style="border:none;border-left:solid
#CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">If
this causes an issue, any access to vram
via the BAR could cause an issue.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Alex</span><o:p></o:p></p>
</div>
<div class="MsoNormal" style="text-align:center" align="center">
<hr width="98%" size="2" align="center">
</div>
<div>
<p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
<b>Sent:</b> Tuesday, April 14, 2020
10:19 AM<br>
<b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
<b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
<b>Subject:</b> RE: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in
amdgpu_device_vram_access v2"</span>
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">[AMD Official Use
Only - Internal Distribution Only]<br>
<br>
On VG20 or MI100, as soon as we run the
subtest, we get the dmesg output below,
and then the kernel ends up hanging. I
don't know enough about the test itself
to know why this is occurring, but Jon
Kim and Felix were discussing it on a
separate thread when the issue was first
reported, so they can hopefully provide
some additional information.<br>
<br>
Kent<br>
<br>
> -----Original Message-----<br>
> From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
> Sent: Tuesday, April 14, 2020 9:52
AM<br>
> To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
> Subject: Re: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in<br>
> amdgpu_device_vram_access v2"<br>
> <br>
> Am 13.04.20 um 20:20 schrieb Kent
Russell:<br>
> > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
> > The original patch causes a
RAS event and subsequent kernel
hard-hang<br>
> > when running the
KFDMemoryTest.PtraceAccessInvisibleVram
on VG20 and<br>
> > Arcturus<br>
> ><br>
> > dmesg output at hang time:<br>
> > [drm] RAS event of type
ERREVENT_ATHUB_INTERRUPT detected!<br>
> > amdgpu 0000:67:00.0: GPU reset
begin!<br>
> > Evicting PASID 0x8000 queues<br>
> > Started evicting pasid 0x8000<br>
> > qcm fence wait loop timeout
expired<br>
> > The cp might be in an
unrecoverable state due to an
unsuccessful<br>
> > queues preemption Failed to
evict process queues Failed to suspend<br>
> > process 0x8000 Finished
evicting pasid 0x8000 Started restoring
pasid<br>
> > 0x8000 Finished restoring
pasid 0x8000 [drm] UVD VCPU state may
lost<br>
> > due to RAS
ERREVENT_ATHUB_INTERRUPT<br>
> > amdgpu: [powerplay] Failed to
send message 0x26, response 0x0<br>
> > amdgpu: [powerplay] Failed to
set soft min gfxclk !<br>
> > amdgpu: [powerplay] Failed to
upload DPM Bootup Levels!<br>
> > amdgpu: [powerplay] Failed to
send message 0x7, response 0x0<br>
> > amdgpu: [powerplay]
[DisableAllSMUFeatures] Failed to
disable all smu<br>
> features!<br>
> > amdgpu: [powerplay]
[DisableDpmTasks] Failed to disable all
smu features!<br>
> > amdgpu: [powerplay]
[PowerOffAsic] Failed to disable DPM!<br>
> >
[drm:amdgpu_device_ip_suspend_phase2
[amdgpu]] *ERROR* suspend of IP<br>
> > block <powerplay> failed
-5<br>
> <br>
> Do you have more information on
what's going wrong here since this is a
really<br>
> important patch for KFD debugging.<br>
> <br>
> ><br>
> > Signed-off-by: Kent Russell
<<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
> <br>
> Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
> <br>
> > ---<br>
> >
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
| 26 ----------------------<br>
> > 1 file changed, 26
deletions(-)<br>
> ><br>
> > diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> >
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > index
cf5d6e585634..a3f997f84020 100644<br>
> > ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > @@ -254,32 +254,6 @@ void
amdgpu_device_vram_access(struct<br>
> amdgpu_device *adev, loff_t pos,<br>
> > uint32_t hi = ~0;<br>
> > uint64_t last;<br>
> ><br>
> > -<br>
> > -#ifdef CONFIG_64BIT<br>
> > - last = min(pos + size,
adev->gmc.visible_vram_size);<br>
> > - if (last > pos) {<br>
> > - void __iomem *addr
= adev->mman.aper_base_kaddr + pos;<br>
> > - size_t count =
last - pos;<br>
> > -<br>
> > - if (write) {<br>
> > -
memcpy_toio(addr, buf, count);<br>
> > - mb();<br>
> > -
amdgpu_asic_flush_hdp(adev, NULL);<br>
> > - } else {<br>
> > -
amdgpu_asic_invalidate_hdp(adev, NULL);<br>
> > - mb();<br>
> > -
memcpy_fromio(buf, addr, count);<br>
> > - }<br>
> > -<br>
> > - if (count == size)<br>
> > - return;<br>
> > -<br>
> > - pos += count;<br>
> > - buf += count / 4;<br>
> > - size -= count;<br>
> > - }<br>
> > -#endif<br>
> > -<br>
> >
spin_lock_irqsave(&adev->mmio_idx_lock,
flags);<br>
> > for (last = pos + size;
pos < last; pos += 4) {<br>
> > uint32_t tmp =
pos >> 31;<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">Am 14.04.2020 16:35 schrieb
"Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
<blockquote style="border:none;border-left:solid
#CCCCCC 1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">If
this causes an issue, any access to vram
via the BAR could cause an issue.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Alex</span><o:p></o:p></p>
</div>
<div class="MsoNormal" style="text-align:center" align="center">
<hr width="98%" size="2" align="center">
</div>
<div>
<p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
<b>Sent:</b> Tuesday, April 14, 2020 10:19
AM<br>
<b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
<b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
<b>Subject:</b> RE: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in
amdgpu_device_vram_access v2"</span>
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">[AMD Official Use Only
- Internal Distribution Only]<br>
<br>
On VG20 or MI100, as soon as we run the
subtest, we get the dmesg output below,
and then the kernel ends up hanging. I
don't know enough about the test itself to
know why this is occurring, but Jon Kim
and Felix were discussing it on a separate
thread when the issue was first reported,
so they can hopefully provide some
additional information.<br>
<br>
Kent<br>
<br>
> -----Original Message-----<br>
> From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
> Sent: Tuesday, April 14, 2020 9:52 AM<br>
> To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
> Subject: Re: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in<br>
> amdgpu_device_vram_access v2"<br>
> <br>
> Am 13.04.20 um 20:20 schrieb Kent
Russell:<br>
> > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
> > The original patch causes a RAS
event and subsequent kernel hard-hang<br>
> > when running the
KFDMemoryTest.PtraceAccessInvisibleVram on
VG20 and<br>
> > Arcturus<br>
> ><br>
> > dmesg output at hang time:<br>
> > [drm] RAS event of type
ERREVENT_ATHUB_INTERRUPT detected!<br>
> > amdgpu 0000:67:00.0: GPU reset
begin!<br>
> > Evicting PASID 0x8000 queues<br>
> > Started evicting pasid 0x8000<br>
> > qcm fence wait loop timeout
expired<br>
> > The cp might be in an
unrecoverable state due to an unsuccessful<br>
> > queues preemption Failed to
evict process queues Failed to suspend<br>
> > process 0x8000 Finished evicting
pasid 0x8000 Started restoring pasid<br>
> > 0x8000 Finished restoring pasid
0x8000 [drm] UVD VCPU state may lost<br>
> > due to RAS
ERREVENT_ATHUB_INTERRUPT<br>
> > amdgpu: [powerplay] Failed to
send message 0x26, response 0x0<br>
> > amdgpu: [powerplay] Failed to
set soft min gfxclk !<br>
> > amdgpu: [powerplay] Failed to
upload DPM Bootup Levels!<br>
> > amdgpu: [powerplay] Failed to
send message 0x7, response 0x0<br>
> > amdgpu: [powerplay]
[DisableAllSMUFeatures] Failed to disable
all smu<br>
> features!<br>
> > amdgpu: [powerplay]
[DisableDpmTasks] Failed to disable all
smu features!<br>
> > amdgpu: [powerplay]
[PowerOffAsic] Failed to disable DPM!<br>
> >
[drm:amdgpu_device_ip_suspend_phase2
[amdgpu]] *ERROR* suspend of IP<br>
> > block <powerplay> failed
-5<br>
> <br>
> Do you have more information on
what's going wrong here since this is a
really<br>
> important patch for KFD debugging.<br>
> <br>
> ><br>
> > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
> <br>
> Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
> <br>
> > ---<br>
> >
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
| 26 ----------------------<br>
> > 1 file changed, 26
deletions(-)<br>
> ><br>
> > diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> >
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > index cf5d6e585634..a3f997f84020
100644<br>
> > ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > @@ -254,32 +254,6 @@ void
amdgpu_device_vram_access(struct<br>
> amdgpu_device *adev, loff_t pos,<br>
> > uint32_t hi = ~0;<br>
> > uint64_t last;<br>
> ><br>
> > -<br>
> > -#ifdef CONFIG_64BIT<br>
> > - last = min(pos + size,
adev->gmc.visible_vram_size);<br>
> > - if (last > pos) {<br>
> > - void __iomem *addr =
adev->mman.aper_base_kaddr + pos;<br>
> > - size_t count = last
- pos;<br>
> > -<br>
> > - if (write) {<br>
> > -
memcpy_toio(addr, buf, count);<br>
> > - mb();<br>
> > -
amdgpu_asic_flush_hdp(adev, NULL);<br>
> > - } else {<br>
> > -
amdgpu_asic_invalidate_hdp(adev, NULL);<br>
> > - mb();<br>
> > -
memcpy_fromio(buf, addr, count);<br>
> > - }<br>
> > -<br>
> > - if (count == size)<br>
> > - return;<br>
> > -<br>
> > - pos += count;<br>
> > - buf += count / 4;<br>
> > - size -= count;<br>
> > - }<br>
> > -#endif<br>
> > -<br>
> >
spin_lock_irqsave(&adev->mmio_idx_lock,
flags);<br>
> > for (last = pos + size; pos
< last; pos += 4) {<br>
> > uint32_t tmp = pos
>> 31;<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">Am 14.04.2020 16:35 schrieb
"Deucher, Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">If this
causes an issue, any access to vram via the
BAR could cause an issue.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Alex</span><o:p></o:p></p>
</div>
<div class="MsoNormal" style="text-align:center" align="center">
<hr width="98%" size="2" align="center">
</div>
<div>
<p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
<b>Sent:</b> Tuesday, April 14, 2020 10:19
AM<br>
<b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
<b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
<b>Subject:</b> RE: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in
amdgpu_device_vram_access v2"</span>
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">[AMD Official Use Only -
Internal Distribution Only]<br>
<br>
On VG20 or MI100, as soon as we run the
subtest, we get the dmesg output below, and
then the kernel ends up hanging. I don't
know enough about the test itself to know
why this is occurring, but Jon Kim and Felix
were discussing it on a separate thread when
the issue was first reported, so they can
hopefully provide some additional
information.<br>
<br>
Kent<br>
<br>
> -----Original Message-----<br>
> From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
> Sent: Tuesday, April 14, 2020 9:52 AM<br>
> To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
> Subject: Re: [PATCH] Revert
"drm/amdgpu: use the BAR if possible in<br>
> amdgpu_device_vram_access v2"<br>
> <br>
> Am 13.04.20 um 20:20 schrieb Kent
Russell:<br>
> > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
> > The original patch causes a RAS
event and subsequent kernel hard-hang<br>
> > when running the
KFDMemoryTest.PtraceAccessInvisibleVram on
VG20 and<br>
> > Arcturus<br>
> ><br>
> > dmesg output at hang time:<br>
> > [drm] RAS event of type
ERREVENT_ATHUB_INTERRUPT detected!<br>
> > amdgpu 0000:67:00.0: GPU reset
begin!<br>
> > Evicting PASID 0x8000 queues<br>
> > Started evicting pasid 0x8000<br>
> > qcm fence wait loop timeout
expired<br>
> > The cp might be in an
unrecoverable state due to an unsuccessful<br>
> > queues preemption Failed to evict
process queues Failed to suspend<br>
> > process 0x8000 Finished evicting
pasid 0x8000 Started restoring pasid<br>
> > 0x8000 Finished restoring pasid
0x8000 [drm] UVD VCPU state may lost<br>
> > due to RAS
ERREVENT_ATHUB_INTERRUPT<br>
> > amdgpu: [powerplay] Failed to send
message 0x26, response 0x0<br>
> > amdgpu: [powerplay] Failed to set
soft min gfxclk !<br>
> > amdgpu: [powerplay] Failed to
upload DPM Bootup Levels!<br>
> > amdgpu: [powerplay] Failed to send
message 0x7, response 0x0<br>
> > amdgpu: [powerplay]
[DisableAllSMUFeatures] Failed to disable
all smu<br>
> features!<br>
> > amdgpu: [powerplay]
[DisableDpmTasks] Failed to disable all smu
features!<br>
> > amdgpu: [powerplay] [PowerOffAsic]
Failed to disable DPM!<br>
> >
[drm:amdgpu_device_ip_suspend_phase2
[amdgpu]] *ERROR* suspend of IP<br>
> > block <powerplay> failed -5<br>
> <br>
> Do you have more information on what's
going wrong here since this is a really<br>
> important patch for KFD debugging.<br>
> <br>
> ><br>
> > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
> <br>
> Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
> <br>
> > ---<br>
> >
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |
26 ----------------------<br>
> > 1 file changed, 26 deletions(-)<br>
> ><br>
> > diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> >
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > index cf5d6e585634..a3f997f84020
100644<br>
> > ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > @@ -254,32 +254,6 @@ void
amdgpu_device_vram_access(struct<br>
> amdgpu_device *adev, loff_t pos,<br>
> > uint32_t hi = ~0;<br>
> > uint64_t last;<br>
> ><br>
> > -<br>
> > -#ifdef CONFIG_64BIT<br>
> > - last = min(pos + size,
adev->gmc.visible_vram_size);<br>
> > - if (last > pos) {<br>
> > - void __iomem *addr =
adev->mman.aper_base_kaddr + pos;<br>
> > - size_t count = last -
pos;<br>
> > -<br>
> > - if (write) {<br>
> > -
memcpy_toio(addr, buf, count);<br>
> > - mb();<br>
> > -
amdgpu_asic_flush_hdp(adev, NULL);<br>
> > - } else {<br>
> > -
amdgpu_asic_invalidate_hdp(adev, NULL);<br>
> > - mb();<br>
> > -
memcpy_fromio(buf, addr, count);<br>
> > - }<br>
> > -<br>
> > - if (count == size)<br>
> > - return;<br>
> > -<br>
> > - pos += count;<br>
> > - buf += count / 4;<br>
> > - size -= count;<br>
> > - }<br>
> > -#endif<br>
> > -<br>
> >
spin_lock_irqsave(&adev->mmio_idx_lock,
flags);<br>
> > for (last = pos + size; pos
< last; pos += 4) {<br>
> > uint32_t tmp = pos
>> 31;<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<p class="MsoNormal">Am 14.04.2020 16:35 schrieb "Deucher,
Alexander" <<a href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true">Alexander.Deucher@amd.com</a>>:<o:p></o:p></p>
</div>
</div>
<div>
<p style="margin:15.0pt"><span style="font-size:10.0pt;font-family:"Arial",sans-serif;color:#317100">[AMD
Public Use]</span><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">If this causes
an issue, any access to vram via the BAR could cause
an issue.</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Alex</span><o:p></o:p></p>
</div>
<div class="MsoNormal" style="text-align:center" align="center">
<hr width="98%" size="2" align="center">
</div>
<div id="divRplyFwdMsg">
<p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> amd-gfx <<a href="mailto:amd-gfx-bounces@lists.freedesktop.org" moz-do-not-send="true">amd-gfx-bounces@lists.freedesktop.org</a>>
on behalf of Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>><br>
<b>Sent:</b> Tuesday, April 14, 2020 10:19 AM<br>
<b>To:</b> Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true">Christian.Koenig@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>
<<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a>><br>
<b>Cc:</b> Kuehling, Felix <<a href="mailto:Felix.Kuehling@amd.com" moz-do-not-send="true">Felix.Kuehling@amd.com</a>>;
Kim, Jonathan <<a href="mailto:Jonathan.Kim@amd.com" moz-do-not-send="true">Jonathan.Kim@amd.com</a>><br>
<b>Subject:</b> RE: [PATCH] Revert "drm/amdgpu: use
the BAR if possible in amdgpu_device_vram_access v2"</span>
<o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">[AMD Official Use Only - Internal
Distribution Only]<br>
<br>
On VG20 or MI100, as soon as we run the subtest, we
get the dmesg output below, and then the kernel ends
up hanging. I don't know enough about the test
itself to know why this is occurring, but Jon Kim
and Felix were discussing it on a separate thread
when the issue was first reported, so they can
hopefully provide some additional information.<br>
<br>
Kent<br>
<br>
> -----Original Message-----<br>
> From: Christian König <<a href="mailto:ckoenig.leichtzumerken@gmail.com" moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>><br>
> Sent: Tuesday, April 14, 2020 9:52 AM<br>
> To: Russell, Kent <<a href="mailto:Kent.Russell@amd.com" moz-do-not-send="true">Kent.Russell@amd.com</a>>;
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
> Subject: Re: [PATCH] Revert "drm/amdgpu: use
the BAR if possible in<br>
> amdgpu_device_vram_access v2"<br>
> <br>
> Am 13.04.20 um 20:20 schrieb Kent Russell:<br>
> > This reverts commit
c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br>
> > The original patch causes a RAS event and
subsequent kernel hard-hang<br>
> > when running the
KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and<br>
> > Arcturus<br>
> ><br>
> > dmesg output at hang time:<br>
> > [drm] RAS event of type
ERREVENT_ATHUB_INTERRUPT detected!<br>
> > amdgpu 0000:67:00.0: GPU reset begin!<br>
> > Evicting PASID 0x8000 queues<br>
> > Started evicting pasid 0x8000<br>
> > qcm fence wait loop timeout expired<br>
> > The cp might be in an unrecoverable state
due to an unsuccessful<br>
> > queues preemption Failed to evict process
queues Failed to suspend<br>
> > process 0x8000 Finished evicting pasid
0x8000 Started restoring pasid<br>
> > 0x8000 Finished restoring pasid 0x8000
[drm] UVD VCPU state may lost<br>
> > due to RAS ERREVENT_ATHUB_INTERRUPT<br>
> > amdgpu: [powerplay] Failed to send message
0x26, response 0x0<br>
> > amdgpu: [powerplay] Failed to set soft min
gfxclk !<br>
> > amdgpu: [powerplay] Failed to upload DPM
Bootup Levels!<br>
> > amdgpu: [powerplay] Failed to send message
0x7, response 0x0<br>
> > amdgpu: [powerplay]
[DisableAllSMUFeatures] Failed to disable all smu<br>
> features!<br>
> > amdgpu: [powerplay] [DisableDpmTasks]
Failed to disable all smu features!<br>
> > amdgpu: [powerplay] [PowerOffAsic] Failed
to disable DPM!<br>
> > [drm:amdgpu_device_ip_suspend_phase2
[amdgpu]] *ERROR* suspend of IP<br>
> > block <powerplay> failed -5<br>
> <br>
> Do you have more information on what's going
wrong here since this is a really<br>
> important patch for KFD debugging.<br>
> <br>
> ><br>
> > Signed-off-by: Kent Russell <<a href="mailto:kent.russell@amd.com" moz-do-not-send="true">kent.russell@amd.com</a>><br>
> <br>
> Reviewed-by: Christian König <<a href="mailto:christian.koenig@amd.com" moz-do-not-send="true">christian.koenig@amd.com</a>><br>
> <br>
> > ---<br>
> >
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
----------------------<br>
> > 1 file changed, 26 deletions(-)<br>
> ><br>
> > diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> >
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > index cf5d6e585634..a3f997f84020 100644<br>
> > ---
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > +++
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > @@ -254,32 +254,6 @@ void
amdgpu_device_vram_access(struct<br>
> amdgpu_device *adev, loff_t pos,<br>
> > uint32_t hi = ~0;<br>
> > uint64_t last;<br>
> ><br>
> > -<br>
> > -#ifdef CONFIG_64BIT<br>
> > - last = min(pos + size,
adev->gmc.visible_vram_size);<br>
> > - if (last > pos) {<br>
> > - void __iomem *addr =
adev->mman.aper_base_kaddr + pos;<br>
> > - size_t count = last - pos;<br>
> > -<br>
> > - if (write) {<br>
> > - memcpy_toio(addr, buf,
count);<br>
> > - mb();<br>
> > -
amdgpu_asic_flush_hdp(adev, NULL);<br>
> > - } else {<br>
> > -
amdgpu_asic_invalidate_hdp(adev, NULL);<br>
> > - mb();<br>
> > - memcpy_fromio(buf,
addr, count);<br>
> > - }<br>
> > -<br>
> > - if (count == size)<br>
> > - return;<br>
> > -<br>
> > - pos += count;<br>
> > - buf += count / 4;<br>
> > - size -= count;<br>
> > - }<br>
> > -#endif<br>
> > -<br>
> >
spin_lock_irqsave(&adev->mmio_idx_lock,
flags);<br>
> > for (last = pos + size; pos <
last; pos += 4) {<br>
> > uint32_t tmp = pos >>
31;<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0" moz-do-not-send="true">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br>
</body>
</html>