<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta content="text/html; charset=iso-8859-1"> <style type="text/css" style="display:none">  </style> </head> <body dir="ltr"> <div dir="auto"> <div dir="auto"> <div dir="auto"> <div dir="auto"> <div dir="auto">That's exactly my concern as well. <div dir="auto"><br> </div> <div dir="auto">This looks a bit like the test creates erroneous data somehow, but there doesn't seems to be a RAS check in the MM data path.</div> <div dir="auto"><br> </div> <div dir="auto">And now that we use the BAR path it goes up in flames.<br> </div> <div dir="auto"><br> </div> <div dir="auto">I just don't see how we can create erroneous data in a test case?</div> <div dir="auto"><br> </div> <div dir="auto">Christian.</div> </div> <div><br> <div class="elided-text">Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com>:<br type="attribution"> <blockquote style="margin:0 0 0 0.8ex; border-left:1px #ccc solid; padding-left:1ex"> <div dir="ltr"> <p align="Left" style="font-family:'arial'; font-size:10pt; color:#317100; margin:15pt"> [AMD Public Use]<br> </p> <br> <div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> If this causes an issue, any access to vram via the BAR could cause an issue.</div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> <br> </div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> Alex<br> </div> <div></div> <hr style="display:inline-block; width:98%"> <div dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com><br> <b>Sent:</b> Tuesday, April 14, 2020 10:19 AM<br> <b>To:</b> Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br> <b>Cc:</b> Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com><br> <b>Subject:</b> RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"</font> <div> </div> </div> <div><font size="2"><span style="font-size:11pt"> <div>[AMD Official Use Only - Internal Distribution Only]<br> <br> On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.<br> <br> Kent<br> <br> > -----Original Message-----<br> > From: Christian König <ckoenig.leichtzumerken@gmail.com><br> > Sent: Tuesday, April 14, 2020 9:52 AM<br> > To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<br> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in<br> > amdgpu_device_vram_access v2"<br> > <br> > Am 13.04.20 um 20:20 schrieb Kent Russell:<br> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br> > > The original patch causes a RAS event and subsequent kernel hard-hang<br> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and<br> > > Arcturus<br> > ><br> > > dmesg output at hang time:<br> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!<br> > > amdgpu 0000:67:00.0: GPU reset begin!<br> > > Evicting PASID 0x8000 queues<br> > > Started evicting pasid 0x8000<br> > > qcm fence wait loop timeout expired<br> > > The cp might be in an unrecoverable state due to an unsuccessful<br> > > queues preemption Failed to evict process queues Failed to suspend<br> > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid<br> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost<br> > > due to RAS ERREVENT_ATHUB_INTERRUPT<br> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0<br> > > amdgpu: [powerplay] Failed to set soft min gfxclk !<br> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!<br> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0<br> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu<br> > features!<br> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!<br> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!<br> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP<br> > > block <powerplay> failed -5<br> > <br> > Do you have more information on what's going wrong here since this is a really<br> > important patch for KFD debugging.<br> > <br> > ><br> > > Signed-off-by: Kent Russell <kent.russell@amd.com><br> > <br> > Reviewed-by: Christian König <christian.koenig@amd.com><br> > <br> > > ---<br> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------<br> > > 1 file changed, 26 deletions(-)<br> > ><br> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > index cf5d6e585634..a3f997f84020 100644<br> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct<br> > amdgpu_device *adev, loff_t pos,<br> > > uint32_t hi = ~0;<br> > > uint64_t last;<br> > ><br> > > -<br> > > -#ifdef CONFIG_64BIT<br> > > - last = min(pos + size, adev->gmc.visible_vram_size);<br> > > - if (last > pos) {<br> > > - void __iomem *addr = adev->mman.aper_base_kaddr + pos;<br> > > - size_t count = last - pos;<br> > > -<br> > > - if (write) {<br> > > - memcpy_toio(addr, buf, count);<br> > > - mb();<br> > > - amdgpu_asic_flush_hdp(adev, NULL);<br> > > - } else {<br> > > - amdgpu_asic_invalidate_hdp(adev, NULL);<br> > > - mb();<br> > > - memcpy_fromio(buf, addr, count);<br> > > - }<br> > > -<br> > > - if (count == size)<br> > > - return;<br> > > -<br> > > - pos += count;<br> > > - buf += count / 4;<br> > > - size -= count;<br> > > - }<br> > > -#endif<br> > > -<br> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);<br> > > for (last = pos + size; pos < last; pos += 4) {<br> > > uint32_t tmp = pos >> 31;<br> _______________________________________________<br> amd-gfx mailing list<br> amd-gfx@lists.freedesktop.org<br> <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><br> </div> </span></font></div> </div> </div> </blockquote> </div> <br> </div> </div> <div><br> <div class="elided-text">Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com>:<br type="attribution"> <blockquote style="margin:0 0 0 0.8ex; border-left:1px #ccc solid; padding-left:1ex"> <div dir="ltr"> <p align="Left" style="font-family:'arial'; font-size:10pt; color:#317100; margin:15pt"> [AMD Public Use]<br> </p> <br> <div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> If this causes an issue, any access to vram via the BAR could cause an issue.</div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> <br> </div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> Alex<br> </div> <div></div> <hr style="display:inline-block; width:98%"> <div dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com><br> <b>Sent:</b> Tuesday, April 14, 2020 10:19 AM<br> <b>To:</b> Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br> <b>Cc:</b> Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com><br> <b>Subject:</b> RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"</font> <div> </div> </div> <div><font size="2"><span style="font-size:11pt"> <div>[AMD Official Use Only - Internal Distribution Only]<br> <br> On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.<br> <br> Kent<br> <br> > -----Original Message-----<br> > From: Christian König <ckoenig.leichtzumerken@gmail.com><br> > Sent: Tuesday, April 14, 2020 9:52 AM<br> > To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<br> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in<br> > amdgpu_device_vram_access v2"<br> > <br> > Am 13.04.20 um 20:20 schrieb Kent Russell:<br> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br> > > The original patch causes a RAS event and subsequent kernel hard-hang<br> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and<br> > > Arcturus<br> > ><br> > > dmesg output at hang time:<br> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!<br> > > amdgpu 0000:67:00.0: GPU reset begin!<br> > > Evicting PASID 0x8000 queues<br> > > Started evicting pasid 0x8000<br> > > qcm fence wait loop timeout expired<br> > > The cp might be in an unrecoverable state due to an unsuccessful<br> > > queues preemption Failed to evict process queues Failed to suspend<br> > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid<br> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost<br> > > due to RAS ERREVENT_ATHUB_INTERRUPT<br> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0<br> > > amdgpu: [powerplay] Failed to set soft min gfxclk !<br> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!<br> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0<br> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu<br> > features!<br> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!<br> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!<br> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP<br> > > block <powerplay> failed -5<br> > <br> > Do you have more information on what's going wrong here since this is a really<br> > important patch for KFD debugging.<br> > <br> > ><br> > > Signed-off-by: Kent Russell <kent.russell@amd.com><br> > <br> > Reviewed-by: Christian König <christian.koenig@amd.com><br> > <br> > > ---<br> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------<br> > > 1 file changed, 26 deletions(-)<br> > ><br> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > index cf5d6e585634..a3f997f84020 100644<br> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct<br> > amdgpu_device *adev, loff_t pos,<br> > > uint32_t hi = ~0;<br> > > uint64_t last;<br> > ><br> > > -<br> > > -#ifdef CONFIG_64BIT<br> > > - last = min(pos + size, adev->gmc.visible_vram_size);<br> > > - if (last > pos) {<br> > > - void __iomem *addr = adev->mman.aper_base_kaddr + pos;<br> > > - size_t count = last - pos;<br> > > -<br> > > - if (write) {<br> > > - memcpy_toio(addr, buf, count);<br> > > - mb();<br> > > - amdgpu_asic_flush_hdp(adev, NULL);<br> > > - } else {<br> > > - amdgpu_asic_invalidate_hdp(adev, NULL);<br> > > - mb();<br> > > - memcpy_fromio(buf, addr, count);<br> > > - }<br> > > -<br> > > - if (count == size)<br> > > - return;<br> > > -<br> > > - pos += count;<br> > > - buf += count / 4;<br> > > - size -= count;<br> > > - }<br> > > -#endif<br> > > -<br> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);<br> > > for (last = pos + size; pos < last; pos += 4) {<br> > > uint32_t tmp = pos >> 31;<br> _______________________________________________<br> amd-gfx mailing list<br> amd-gfx@lists.freedesktop.org<br> <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><br> </div> </span></font></div> </div> </div> </blockquote> </div> <br> </div> </div> <div><br> <div class="elided-text">Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com>:<br type="attribution"> <blockquote style="margin:0 0 0 0.8ex; border-left:1px #ccc solid; padding-left:1ex"> <div dir="ltr"> <p align="Left" style="font-family:'arial'; font-size:10pt; color:#317100; margin:15pt"> [AMD Public Use]<br> </p> <br> <div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> If this causes an issue, any access to vram via the BAR could cause an issue.</div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> <br> </div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> Alex<br> </div> <div></div> <hr style="display:inline-block; width:98%"> <div dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com><br> <b>Sent:</b> Tuesday, April 14, 2020 10:19 AM<br> <b>To:</b> Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br> <b>Cc:</b> Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com><br> <b>Subject:</b> RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"</font> <div> </div> </div> <div><font size="2"><span style="font-size:11pt"> <div>[AMD Official Use Only - Internal Distribution Only]<br> <br> On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.<br> <br> Kent<br> <br> > -----Original Message-----<br> > From: Christian König <ckoenig.leichtzumerken@gmail.com><br> > Sent: Tuesday, April 14, 2020 9:52 AM<br> > To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<br> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in<br> > amdgpu_device_vram_access v2"<br> > <br> > Am 13.04.20 um 20:20 schrieb Kent Russell:<br> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br> > > The original patch causes a RAS event and subsequent kernel hard-hang<br> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and<br> > > Arcturus<br> > ><br> > > dmesg output at hang time:<br> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!<br> > > amdgpu 0000:67:00.0: GPU reset begin!<br> > > Evicting PASID 0x8000 queues<br> > > Started evicting pasid 0x8000<br> > > qcm fence wait loop timeout expired<br> > > The cp might be in an unrecoverable state due to an unsuccessful<br> > > queues preemption Failed to evict process queues Failed to suspend<br> > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid<br> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost<br> > > due to RAS ERREVENT_ATHUB_INTERRUPT<br> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0<br> > > amdgpu: [powerplay] Failed to set soft min gfxclk !<br> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!<br> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0<br> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu<br> > features!<br> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!<br> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!<br> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP<br> > > block <powerplay> failed -5<br> > <br> > Do you have more information on what's going wrong here since this is a really<br> > important patch for KFD debugging.<br> > <br> > ><br> > > Signed-off-by: Kent Russell <kent.russell@amd.com><br> > <br> > Reviewed-by: Christian König <christian.koenig@amd.com><br> > <br> > > ---<br> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------<br> > > 1 file changed, 26 deletions(-)<br> > ><br> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > index cf5d6e585634..a3f997f84020 100644<br> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct<br> > amdgpu_device *adev, loff_t pos,<br> > > uint32_t hi = ~0;<br> > > uint64_t last;<br> > ><br> > > -<br> > > -#ifdef CONFIG_64BIT<br> > > - last = min(pos + size, adev->gmc.visible_vram_size);<br> > > - if (last > pos) {<br> > > - void __iomem *addr = adev->mman.aper_base_kaddr + pos;<br> > > - size_t count = last - pos;<br> > > -<br> > > - if (write) {<br> > > - memcpy_toio(addr, buf, count);<br> > > - mb();<br> > > - amdgpu_asic_flush_hdp(adev, NULL);<br> > > - } else {<br> > > - amdgpu_asic_invalidate_hdp(adev, NULL);<br> > > - mb();<br> > > - memcpy_fromio(buf, addr, count);<br> > > - }<br> > > -<br> > > - if (count == size)<br> > > - return;<br> > > -<br> > > - pos += count;<br> > > - buf += count / 4;<br> > > - size -= count;<br> > > - }<br> > > -#endif<br> > > -<br> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);<br> > > for (last = pos + size; pos < last; pos += 4) {<br> > > uint32_t tmp = pos >> 31;<br> _______________________________________________<br> amd-gfx mailing list<br> amd-gfx@lists.freedesktop.org<br> <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><br> </div> </span></font></div> </div> </div> </blockquote> </div> <br> </div> </div> <div><br> <div class="elided-text">Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com>:<br type="attribution"> <blockquote style="margin:0 0 0 0.8ex; border-left:1px #ccc solid; padding-left:1ex"> <div dir="ltr"> <p align="Left" style="font-family:'arial'; font-size:10pt; color:#317100; margin:15pt"> [AMD Public Use]<br> </p> <br> <div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> If this causes an issue, any access to vram via the BAR could cause an issue.</div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> <br> </div> <div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)"> Alex<br> </div> <div></div> <hr style="display:inline-block; width:98%"> <div dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com><br> <b>Sent:</b> Tuesday, April 14, 2020 10:19 AM<br> <b>To:</b> Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br> <b>Cc:</b> Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com><br> <b>Subject:</b> RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"</font> <div> </div> </div> <div><font size="2"><span style="font-size:11pt"> <div>[AMD Official Use Only - Internal Distribution Only]<br> <br> On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.<br> <br> Kent<br> <br> > -----Original Message-----<br> > From: Christian König <ckoenig.leichtzumerken@gmail.com><br> > Sent: Tuesday, April 14, 2020 9:52 AM<br> > To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<br> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in<br> > amdgpu_device_vram_access v2"<br> > <br> > Am 13.04.20 um 20:20 schrieb Kent Russell:<br> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br> > > The original patch causes a RAS event and subsequent kernel hard-hang<br> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and<br> > > Arcturus<br> > ><br> > > dmesg output at hang time:<br> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!<br> > > amdgpu 0000:67:00.0: GPU reset begin!<br> > > Evicting PASID 0x8000 queues<br> > > Started evicting pasid 0x8000<br> > > qcm fence wait loop timeout expired<br> > > The cp might be in an unrecoverable state due to an unsuccessful<br> > > queues preemption Failed to evict process queues Failed to suspend<br> > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid<br> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost<br> > > due to RAS ERREVENT_ATHUB_INTERRUPT<br> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0<br> > > amdgpu: [powerplay] Failed to set soft min gfxclk !<br> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!<br> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0<br> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu<br> > features!<br> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!<br> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!<br> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP<br> > > block <powerplay> failed -5<br> > <br> > Do you have more information on what's going wrong here since this is a really<br> > important patch for KFD debugging.<br> > <br> > ><br> > > Signed-off-by: Kent Russell <kent.russell@amd.com><br> > <br> > Reviewed-by: Christian König <christian.koenig@amd.com><br> > <br> > > ---<br> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------<br> > > 1 file changed, 26 deletions(-)<br> > ><br> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > index cf5d6e585634..a3f997f84020 100644<br> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct<br> > amdgpu_device *adev, loff_t pos,<br> > > uint32_t hi = ~0;<br> > > uint64_t last;<br> > ><br> > > -<br> > > -#ifdef CONFIG_64BIT<br> > > - last = min(pos + size, adev->gmc.visible_vram_size);<br> > > - if (last > pos) {<br> > > - void __iomem *addr = adev->mman.aper_base_kaddr + pos;<br> > > - size_t count = last - pos;<br> > > -<br> > > - if (write) {<br> > > - memcpy_toio(addr, buf, count);<br> > > - mb();<br> > > - amdgpu_asic_flush_hdp(adev, NULL);<br> > > - } else {<br> > > - amdgpu_asic_invalidate_hdp(adev, NULL);<br> > > - mb();<br> > > - memcpy_fromio(buf, addr, count);<br> > > - }<br> > > -<br> > > - if (count == size)<br> > > - return;<br> > > -<br> > > - pos += count;<br> > > - buf += count / 4;<br> > > - size -= count;<br> > > - }<br> > > -#endif<br> > > -<br> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);<br> > > for (last = pos + size; pos < last; pos += 4) {<br> > > uint32_t tmp = pos >> 31;<br> _______________________________________________<br> amd-gfx mailing list<br> amd-gfx@lists.freedesktop.org<br> <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><br> </div> </span></font></div> </div> </div> </blockquote> </div> <br> </div> </div> <div class="gmail_extra"><br> <div class="gmail_quote">Am 14.04.2020 16:35 schrieb "Deucher, Alexander" <Alexander.Deucher@amd.com>:<br type="attribution"> </div> </div> <div> <p align="Left" style="font-family:Arial; font-size:10pt; color:#317100; margin:15pt"> [AMD Public Use]<br> </p> <br> <div> <div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)"> If this causes an issue, any access to vram via the BAR could cause an issue.</div> <div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)"> <br> </div> <div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)"> Alex<br> </div> <div id="appendonsend"></div> <hr tabindex="-1" style="display:inline-block; width:98%"> <div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Russell, Kent <Kent.Russell@amd.com><br> <b>Sent:</b> Tuesday, April 14, 2020 10:19 AM<br> <b>To:</b> Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br> <b>Cc:</b> Kuehling, Felix <Felix.Kuehling@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com><br> <b>Subject:</b> RE: [PATCH] Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"</font> <div> </div> </div> <div class="BodyFragment"><font size="2"><span style="font-size:11pt"> <div class="PlainText">[AMD Official Use Only - Internal Distribution Only]<br> <br> On VG20 or MI100, as soon as we run the subtest, we get the dmesg output below, and then the kernel ends up hanging. I don't know enough about the test itself to know why this is occurring, but Jon Kim and Felix were discussing it on a separate thread when the issue was first reported, so they can hopefully provide some additional information.<br> <br> Kent<br> <br> > -----Original Message-----<br> > From: Christian König <ckoenig.leichtzumerken@gmail.com><br> > Sent: Tuesday, April 14, 2020 9:52 AM<br> > To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org<br> > Subject: Re: [PATCH] Revert "drm/amdgpu: use the BAR if possible in<br> > amdgpu_device_vram_access v2"<br> > <br> > Am 13.04.20 um 20:20 schrieb Kent Russell:<br> > > This reverts commit c12b84d6e0d70f1185e6daddfd12afb671791b6e.<br> > > The original patch causes a RAS event and subsequent kernel hard-hang<br> > > when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and<br> > > Arcturus<br> > ><br> > > dmesg output at hang time:<br> > > [drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!<br> > > amdgpu 0000:67:00.0: GPU reset begin!<br> > > Evicting PASID 0x8000 queues<br> > > Started evicting pasid 0x8000<br> > > qcm fence wait loop timeout expired<br> > > The cp might be in an unrecoverable state due to an unsuccessful<br> > > queues preemption Failed to evict process queues Failed to suspend<br> > > process 0x8000 Finished evicting pasid 0x8000 Started restoring pasid<br> > > 0x8000 Finished restoring pasid 0x8000 [drm] UVD VCPU state may lost<br> > > due to RAS ERREVENT_ATHUB_INTERRUPT<br> > > amdgpu: [powerplay] Failed to send message 0x26, response 0x0<br> > > amdgpu: [powerplay] Failed to set soft min gfxclk !<br> > > amdgpu: [powerplay] Failed to upload DPM Bootup Levels!<br> > > amdgpu: [powerplay] Failed to send message 0x7, response 0x0<br> > > amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu<br> > features!<br> > > amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!<br> > > amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!<br> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP<br> > > block <powerplay> failed -5<br> > <br> > Do you have more information on what's going wrong here since this is a really<br> > important patch for KFD debugging.<br> > <br> > ><br> > > Signed-off-by: Kent Russell <kent.russell@amd.com><br> > <br> > Reviewed-by: Christian König <christian.koenig@amd.com><br> > <br> > > ---<br> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ----------------------<br> > > 1 file changed, 26 deletions(-)<br> > ><br> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > index cf5d6e585634..a3f997f84020 100644<br> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > @@ -254,32 +254,6 @@ void amdgpu_device_vram_access(struct<br> > amdgpu_device *adev, loff_t pos,<br> > > uint32_t hi = ~0;<br> > > uint64_t last;<br> > ><br> > > -<br> > > -#ifdef CONFIG_64BIT<br> > > - last = min(pos + size, adev->gmc.visible_vram_size);<br> > > - if (last > pos) {<br> > > - void __iomem *addr = adev->mman.aper_base_kaddr + pos;<br> > > - size_t count = last - pos;<br> > > -<br> > > - if (write) {<br> > > - memcpy_toio(addr, buf, count);<br> > > - mb();<br> > > - amdgpu_asic_flush_hdp(adev, NULL);<br> > > - } else {<br> > > - amdgpu_asic_invalidate_hdp(adev, NULL);<br> > > - mb();<br> > > - memcpy_fromio(buf, addr, count);<br> > > - }<br> > > -<br> > > - if (count == size)<br> > > - return;<br> > > -<br> > > - pos += count;<br> > > - buf += count / 4;<br> > > - size -= count;<br> > > - }<br> > > -#endif<br> > > -<br> > > spin_lock_irqsave(&adev->mmio_idx_lock, flags);<br> > > for (last = pos + size; pos < last; pos += 4) {<br> > > uint32_t tmp = pos >> 31;<br> _______________________________________________<br> amd-gfx mailing list<br> amd-gfx@lists.freedesktop.org<br> <a href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Calexander.deucher%40amd.com%7C68e0bfea2a5f4a909ab108d7e07ed164%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637224707637289768&sdata=ttNOHJt0IwywpOIWahKjjuC6OkT1jxduc6iMzYzndpg%3D&reserved=0</a><br> </div> </span></font></div> </div> </div> </body> </html>