<div>I'm leaving out for some days. Thanks very much for your detailed answer.</div><div><br></div><div>Best Regards.</div><div>Yanhua<br></div><div><div><br></div><div><br></div><div style="font-size: 12px;font-family: Arial Narrow;padding:2px 0 2px 0;">------------------ 原始邮件 ------------------</div><div style="font-size: 12px;background:#efefef;padding:8px;"><div><b>发件人:</b> "Koenig, Christian"<Christian.Koenig@amd.com>;</div><div><b>发送时间:</b> 2019年9月6日(星期五) 晚上7:23</div><div><b>收件人:</b> "yanhua"<78666679@qq.com>;"amd-gfx"<amd-gfx@lists.freedesktop.org>;<wbr></div><div><b>抄送:</b> "Deucher, Alexander"<Alexander.Deucher@amd.com>;<wbr></div><div><b>主题:</b> Re: 回复：回复：回复： Bug: amdgpu drm driver cause process into Disk sleep state</div></div><div><br></div> <div class="moz-cite-prefix"> <blockquote type="cite">Are there anything I have missed ?</blockquote> <br> Yeah, unfortunately quite a bunch of things. The fact that arm64 doesn't support the PCIe NoSnoop TLP attribute is only the tip of the iceberg.<br> <br> You need a full "recent" driver stack, e.g. not older than a few month till a year, for this to work. And not only the kernel, but also recent userspace components.<br> <br> Maybe that's something you could first, e.g. install a recent version of Mesa and/or tell Mesa to not use the SDMA at all. But since you are running into an SDMA lockup with a kernel triggered page table update I see little chance that this work.<br> <br> The only other alternative I can see is the DKMS package of the pro-driver. With that one you might be able to compile the recent driver for an older kernel version.<br> <br> But I can't guarantee at all that this actually works on ARM64.<br> <br> Sorry that I don't have better news for you,<br> Christian.<br> <br> Am 05.09.19 um 03:36 schrieb yanhua:<br> </div> <blockquote type="cite" cite="mid:tencent_20683D4D4999B2E0A746EA7D01D677D6070A@qq.com"> <div>Hi, Christian,</div> <div> I noticed that you said 'amdgpu is known to not work on arm64 until very recently'. I found the CPU related commit with drm is "drm: disable uncached DMA optimization for ARM and arm64". <br> </div> <div>@@ -47,6 +47,24 @@ static inline bool drm_arch_can_wc_memory(void)<br> return false;<br> #elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)<br> return false;<br> +#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)<br> + /*<br> + * The DRM driver stack is designed to work with cache coherent devices<br> + * only, but permits an optimization to be enabled in some cases, where<br> + * for some buffers, both the CPU and the GPU use uncached mappings,<br> + * removing the need for DMA snooping and allocation in the CPU caches.<br> + *<br> + * The use of uncached GPU mappings relies on the correct implementation<br> + * of the PCIe NoSnoop TLP attribute by the platform, otherwise the GPU<br> + * will use cached mappings nonetheless. On x86 platforms, this does not<br> + * seem to matter, as uncached CPU mappings will snoop the caches in any<br> + * case. However, on ARM and arm64, enabling this optimization on a<br> + * platform where NoSnoop is ignored results in loss of coherency, which<br> + * breaks correct operation of the device. Since we have no way of<br> + * detecting whether NoSnoop works or not, just disable this<br> + * optimization entirely for ARM and arm64.<br> + */<br> + return false;<br> #else<br> return true;<br> #endif<br> <br> </div> <div>The real effect is to in amdgpu_object.c<br> </div> <div><br> </div> <div> if (!drm_arch_can_wc_memory())<br> bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;<br> </div> <div><br> </div> <div><span style="font-family: monospace;">And we have </span>AMDGPU_GEM_CREATE_CPU_GTT_USWC turned off in our 4.19.36 kernel, So I think this is not the cause of my bug. Are there anything I have missed ?</div> <div><br> </div> <div>I had suggest the machine supplier to use a more newer kernel such as 5.2.2, But they failed to do so after some try. We also backport a series patches from newer kernel. But still we get the bad ring timeout.</div> <div><br> </div> <div>We have dived into the amdgpu drm driver a long time, bu it is really difficult for me, especially the hardware related ring timeout.</div> <div><br> </div> <div>------------------</div> <div>Yanhua<br> </div> <div> <div><br> </div> <div style="font-size: 12px;font-family: Arial Narrow;padding:2px 0 2px 0;"> ------------------ 原始邮件 ------------------</div> <div style="font-size: 12px;background:#efefef;padding:8px;"> <div><b>发件人:</b> "Koenig, Christian"<a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>;</div> <div><b>发送时间:</b> 2019年9月3日(星期二) 晚上9:19</div> <div><b>收件人:</b> "yanhua"<a class="moz-txt-link-rfc2396E" href="mailto:78666679@qq.com"><78666679@qq.com></a>;"amd-gfx"<a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx@lists.freedesktop.org"><amd-gfx@lists.freedesktop.org></a>;<wbr></div> <div><b>抄送:</b> "Deucher, Alexander"<a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a>;<wbr></div> <div><b>主题:</b> Re: 回复：回复： Bug: amdgpu drm driver cause process into Disk sleep state</div> </div> <div><br> </div> <div class="moz-cite-prefix">This is just a GPU lock, please open up a bug report on freedesktop.org and attach the full dmesg and which version of Mesa you are using.<br> <br> Regards,<br> Christian.<br> <br> Am 03.09.19 um 15:16 schrieb 78666679:<br> </div> <blockquote type="cite" cite="mid:tencent_DFCD5A0853FDA639F81F91375F8DF55AF508@qq.com"> <div>Yes, with dmesg|grep drm , I get following.</div> <div><br> </div> <div>348571.880718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865</div> <div> <div><br> </div> <div><br> </div> <div style="font-size: 12px;font-family: Arial Narrow;padding:2px 0 2px 0;"> ------------------ 原始邮件 ------------------</div> <div style="font-size: 12px;background:#efefef;padding:8px;"> <div><b>发件人:</b> "Koenig, Christian"<a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a>;</div> <div><b>发送时间:</b> 2019年9月3日(星期二) 晚上9:07</div> <div><b>收件人:</b> ""<a class="moz-txt-link-rfc2396E" href="mailto:78666679@qq.com" moz-do-not-send="true"><78666679@qq.com></a>;"amd-gfx"<a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true"><amd-gfx@lists.freedesktop.org></a>;<wbr></div> <div><b>抄送:</b> "Deucher, Alexander"<a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a>;<wbr></div> <div><b>主题:</b> Re: 回复： Bug: amdgpu drm driver cause process into Disk sleep state</div> </div> <div><br> </div> <div class="moz-cite-prefix">Well that looks like the hardware got stuck.<br> <br> Do you get something in the locks about a timeout on the SDMA ring?<br> <br> Regards,<br> Christian.<br> <br> Am 03.09.19 um 14:50 schrieb 78666679:<br> </div> <blockquote type="cite" cite="mid:tencent_7DC9F5195A4D538FA626F85991875FC5F508@qq.com"> <div>Hi Christian,</div> <div> Sometimes the thread blocked disk sleeping in call to amdgpu_sa_bo_new. following is the stack trace. it seems the sa bo is used up , so the caller blocked waiting someone to free sa resources. <br> </div> <div><br> </div> <div>D 206833 227656 [surfaceflinger] <defunct> Binder:45_5</div> <div>cat /proc/206833/task/227656/stack</div> <div><br> </div> <div>[<0>] __switch_to+0x94/0xe8<br> [<0>] dma_fence_wait_any_timeout+0x234/0x2d0<br> [<0>] amdgpu_sa_bo_new+0x468/0x540 [amdgpu]<br> [<0>] amdgpu_ib_get+0x60/0xc8 [amdgpu]<br> [<0>] amdgpu_job_alloc_with_ib+0x70/0xb0 [amdgpu]<br> [<0>] amdgpu_vm_bo_update_mapping+0x2e0/0x3d8 [amdgpu]<br> [<0>] amdgpu_vm_bo_update+0x2a0/0x710 [amdgpu]<br> [<0>] amdgpu_gem_va_ioctl+0x46c/0x4c8 [amdgpu]<br> [<0>] drm_ioctl_kernel+0x94/0x118 [drm]<br> [<0>] drm_ioctl+0x1f0/0x438 [drm]<br> [<0>] amdgpu_drm_ioctl+0x58/0x90 [amdgpu]<br> [<0>] do_vfs_ioctl+0xc4/0x8c0<br> [<0>] ksys_ioctl+0x8c/0xa0<br> [<0>] __arm64_sys_ioctl+0x28/0x38<br> [<0>] el0_svc_common+0xa0/0x180<br> [<0>] el0_svc_handler+0x38/0x78<br> [<0>] el0_svc+0x8/0xc<br> [<0>] 0xffffffffffffffff<br> <br> </div> <div><br> </div> <div> <div>--------------------</div> <div>YanHua<br> </div> <div><br> </div> <div style="font-size: 12px;font-family: Arial Narrow;padding:2px 0 2px 0;"> ------------------ 原始邮件 ------------------</div> <div style="font-size: 12px;background:#efefef;padding:8px;"> <div><b>发件人:</b> "Koenig, Christian"<a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com" moz-do-not-send="true"><Christian.Koenig@amd.com></a>;</div> <div><b>发送时间:</b> 2019年9月3日(星期二) 下午4:21</div> <div><b>收件人:</b> ""<a class="moz-txt-link-rfc2396E" href="mailto:78666679@qq.com" moz-do-not-send="true"><78666679@qq.com></a>;"amd-gfx"<a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx@lists.freedesktop.org" moz-do-not-send="true"><amd-gfx@lists.freedesktop.org></a>;<wbr></div> <div><b>抄送:</b> "Deucher, Alexander"<a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com" moz-do-not-send="true"><Alexander.Deucher@amd.com></a>;<wbr></div> <div><b>主题:</b> Re: Bug: amdgpu drm driver cause process into Disk sleep state</div> </div> <div><br> </div> Hi Yanhua,<br> <br> please update your kernel first, cause that looks like a known issue <br> which was recently fixed by patch "drm/scheduler: use job count instead <br> of peek".<br> <br> Probably best to try the latest bleeding edge kernel and if that doesn't <br> help please open up a bug report on <a class="moz-txt-link-freetext" href="https://bugs.freedesktop.org/" moz-do-not-send="true"> https://bugs.freedesktop.org/</a>.<br> <br> Regards,<br> Christian.<br> <br> Am 03.09.19 um 09:35 schrieb 78666679:<br> > Hi, Sirs:<br> > I have a wx5100 amdgpu card, It randomly come into failure. sometimes, it will cause processes into uninterruptible wait state.<br> ><br> ><br> > cps-new-ondemand-0587:~ # ps aux|grep -w D<br> > root 11268 0.0 0.0 260628 3516 ? Ssl 8月26 0:00 /usr/sbin/gssproxy -D<br> > root 136482 0.0 0.0 212500 572 pts/0 S+ 15:25 0:00 grep --color=auto -w D<br> > root 370684 0.0 0.0 17972 7428 ? Ss 9月02 0:04 /usr/sbin/sshd -D<br> > 10066 432951 0.0 0.0 0 0 ? D 9月02 0:00 [FakeFinalizerDa]<br> > root 496774 0.0 0.0 0 0 ? D 9月02 0:17 [kworker/8:1+eve]<br> > cps-new-ondemand-0587:~ # cat /proc/496774/stack<br> > [<0>] __switch_to+0x94/0xe8<br> > [<0>] drm_sched_entity_flush+0xf8/0x248 [gpu_sched]<br> > [<0>] amdgpu_ctx_mgr_entity_flush+0xac/0x148 [amdgpu]<br> > [<0>] amdgpu_flush+0x2c/0x50 [amdgpu]<br> > [<0>] filp_close+0x40/0xa0<br> > [<0>] put_files_struct+0x118/0x120<br> > [<0>] put_files_struct+0x30/0x68 [binder_linux]<br> > [<0>] binder_deferred_func+0x4d4/0x658 [binder_linux]<br> > [<0>] process_one_work+0x1b4/0x3f8<br> > [<0>] worker_thread+0x54/0x470<br> > [<0>] kthread+0x134/0x138<br> > [<0>] ret_from_fork+0x10/0x18<br> > [<0>] 0xffffffffffffffff<br> ><br> ><br> ><br> > This issue troubled me a long time. looking eagerly to get help from you!<br> ><br> ><br> > -----<br> > Yanhua<br> <br> </div> </blockquote> <br> </div> </blockquote> <br> </div> </blockquote> <br></div>