<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> <style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style> </head> <body dir="ltr"> <p style="font-family:Arial;font-size:10pt;color:#0000FF;margin:5pt;" align="Left"> [AMD Official Use Only]<br> </p> <br> <div> <div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> The error message is from HIQ dequeue procedure, not from HCQ, so no doorbell writing.</div> <div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> <br> </div> <div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> <span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477067] Call Trace:</span> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477072] dump_stack+0x7d/0x9c</span></div> <span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477651] hqd_destroy_v10_3+0x58/0x254 [amdgpu]</span> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477778] destroy_mqd+0x1e/0x30 [amdgpu]</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477884] kernel_queue_uninit+0xcf/0x100 [amdgpu]</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477985] pm_uninit+0x1a/0x30 [amdgpu] #kernel_queue_uninit(pm->priv_queue, hanging); this priv_queue == HIQ</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478127] stop_cpsch+0x98/0x100 [amdgpu] </span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478242] kgd2kfd_suspend.part.0+0x32/0x50 [amdgpu]</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478338] kgd2kfd_suspend+0x1b/0x20 [amdgpu]</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478433] amdgpu_amdkfd_suspend+0x1e/0x30 [amdgpu]</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478529] amdgpu_device_fini_hw+0x182/0x335 [amdgpu]</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478655] amdgpu_driver_unload_kms+0x5c/0x80 [amdgpu]</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478732] amdgpu_pci_remove+0x27/0x40 [amdgpu]</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478806] pci_device_remove+0x3e/0xb0</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478809] device_release_driver_internal+0x103/0x1d0</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478813] driver_detach+0x4c/0x90</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478814] bus_remove_driver+0x5c/0xd0</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478815] driver_unregister+0x31/0x50</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478817] pci_unregister_driver+0x40/0x90</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478818] amdgpu_exit+0x15/0x2d1 [amdgpu]</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478942] __x64_sys_delete_module+0x147/0x260</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478944] ? exit_to_user_mode_prepare+0x41/0x1d0</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478946] ? ksys_write+0x67/0xe0</span></div> <div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478948] do_syscall_64+0x40/0xb0</span></div> <span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478951] entry_SYSCALL_64_after_hwframe+0x44/0xae</span><br> </div> <div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> <br> </div> <div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> Regards,</div> <div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> Rico<br> </div> <div id="appendonsend"></div> <hr style="display:inline-block;width:98%" tabindex="-1"> <div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Kuehling, Felix <Felix.Kuehling@amd.com><br> <b>Sent:</b> Thursday, January 27, 2022 23:28<br> <b>To:</b> Yin, Tianci (Rico) <Tianci.Yin@amd.com>; Wang, Yang(Kevin) <KevinYang.Wang@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br> <b>Cc:</b> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, Guchun <Guchun.Chen@amd.com><br> <b>Subject:</b> Re: [PATCH] drm/amdgpu: Fix an error message in rmmod</font> <div> </div> </div> <div class="BodyFragment"><font size="2"><span style="font-size:11pt;"> <div class="PlainText">The hang you're seeing is the result of a command submission of an <br> UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a <br> doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up <br> the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this <br> doorbell not trigger gfxoff exit during rmmod?<br> <br> <br> Regards,<br> Felix<br> <br> <br> <br> Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico):<br> ><br> > [AMD Official Use Only]<br> ><br> ><br> > The rmmod ops has prerequisite multi-user target and blacklist amdgpu,<br> > which is IGT requirement so that IGT can make itself DRM master to <br> > test KMS.<br> > igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload<br> ><br> > From my understanding, the KFD process belongs to the regular way of <br> > gfxoff exit, which doorbell writing triggers gfxoff exit. For example, <br> > KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, <br> > these both trigger doorbell writing(pls refer to <br> > gfx_v10_0_ring_set_wptr_compute()).<br> ><br> > As to the IGT reload test, the dequeue request is not thru a cmd on a <br> > ring, it directly writes CP registers, so GFX core remains in gfxoff.<br> ><br> > Thanks,<br> > Rico<br> ><br> > ------------------------------------------------------------------------<br> > *From:* Kuehling, Felix <Felix.Kuehling@amd.com><br> > *Sent:* Wednesday, January 26, 2022 23:08<br> > *To:* Yin, Tianci (Rico) <Tianci.Yin@amd.com>; Wang, Yang(Kevin) <br> > <KevinYang.Wang@amd.com>; amd-gfx@lists.freedesktop.org <br> > <amd-gfx@lists.freedesktop.org><br> > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, Guchun <br> > <Guchun.Chen@amd.com><br> > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod<br> > My question is, why is this problem only seen during module unload? Why<br> > aren't we seeing HWS hangs due to GFX_OFF all the time in normal<br> > operations? For example when the GPU is idle and a new KFD process is<br> > started, creating a new runlist. Are we just getting lucky because the<br> > process first has to allocate some memory, which maybe makes some HW<br> > access (flushing TLBs etc.) that wakes up the GPU?<br> ><br> ><br> > Regards,<br> > Felix<br> ><br> ><br> ><br> > Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):<br> > ><br> > > [AMD Official Use Only]<br> > ><br> > ><br> > > Thanks Kevin and Felix!<br> > ><br> > > In gfxoff state, the dequeue request(by cp register writing) can't<br> > > make gfxoff exit, actually the cp is powered off and the cp register<br> > > writing is invalid, doorbell registers writing(regluar way) or<br> > > directly request smu to disable gfx powergate(by invoking<br> > > amdgpu_gfx_off_ctrl) can trigger gfxoff exit.<br> > ><br> > > I have also tryed<br> > > <br> > amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false),<br> > > but it has no effect.<br> > ><br> > > [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed<br> > > [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0xffffffff<br> > > [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0xffffffff<br> > > [10386.162297] amdgpu: mmCP_STAT : 0xffffffff<br> > > [10386.162303] amdgpu: mmCP_BUSY_STAT : 0xffffffff<br> > > [10386.162308] amdgpu: mmRLC_STAT : 0xffffffff<br> > > [10386.162314] amdgpu: mmGRBM_STATUS : 0xffffffff<br> > > [10386.162320] amdgpu: mmGRBM_STATUS2: 0xffffffff<br> > ><br> > > Thanks again!<br> > > Rico<br> > > ------------------------------------------------------------------------<br> > > *From:* Kuehling, Felix <Felix.Kuehling@amd.com><br> > > *Sent:* Tuesday, January 25, 2022 23:31<br> > > *To:* Wang, Yang(Kevin) <KevinYang.Wang@amd.com>; Yin, Tianci (Rico)<br> > > <Tianci.Yin@amd.com>; amd-gfx@lists.freedesktop.org<br> > > <amd-gfx@lists.freedesktop.org><br> > > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, Guchun<br> > > <Guchun.Chen@amd.com><br> > > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod<br> > > I have no objection to the change. It restores the sequence that was<br> > > used before e9669fb78262. But I don't understand why GFX_OFF is causing<br> > > a preemption error during module unload, but not when KFD is in normal<br> > > use. Maybe it's because of the compute power profile that's normally set<br> > > by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.<br> > ><br> > ><br> > > Either way, the patch is<br> > ><br> > > Acked-by: Felix Kuehling <Felix.Kuehling@amd.com><br> > ><br> > ><br> > ><br> > > Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):<br> > > ><br> > > > [AMD Official Use Only]<br> > > ><br> > > ><br> > > > [AMD Official Use Only]<br> > > ><br> > > ><br> > > > the issue is introduced in following patch, so add following<br> > > > information is better.<br> > > > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/<br> > > > /<br> > > > /<br> > > > Reviewed-by: Yang Wang <kevinyang.wang@amd.com><br> > > > /<br> > > > /<br> > > > Best Regards,<br> > > > Kevin<br> > > ><br> > > > <br> > ------------------------------------------------------------------------<br> > > > *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of<br> > > > Tianci Yin <tianci.yin@amd.com><br> > > > *Sent:* Tuesday, January 25, 2022 6:03 PM<br> > > > *To:* amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br> > > > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Yin, Tianci<br> > > > (Rico) <Tianci.Yin@amd.com>; Chen, Guchun <Guchun.Chen@amd.com><br> > > > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod<br> > > > From: "Tianci.Yin" <tianci.yin@amd.com><br> > > ><br> > > > [why]<br> > > > In rmmod procedure, kfd sends cp a dequeue request, but the<br> > > > request does not get response, then an error message "cp<br> > > > queue pipe 4 queue 0 preemption failed" printed.<br> > > ><br> > > > [how]<br> > > > Performing kfd suspending after disabling gfxoff can fix it.<br> > > ><br> > > > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930<br> > > > Signed-off-by: Tianci.Yin <tianci.yin@amd.com><br> > > > ---<br> > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--<br> > > > 1 file changed, 2 insertions(+), 2 deletions(-)<br> > > ><br> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > > index b75d67f644e5..77e9837ba342 100644<br> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br> > > > @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct<br> > > > amdgpu_device *adev)<br> > > > }<br> > > > }<br> > > ><br> > > > - amdgpu_amdkfd_suspend(adev, false);<br> > > > -<br> > > > amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);<br> > > > amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);<br> > > ><br> > > > + amdgpu_amdkfd_suspend(adev, false);<br> > > > +<br> > > > /* Workaroud for ASICs need to disable SMC first */<br> > > > amdgpu_device_smu_fini_early(adev);<br> > > ><br> > > > --<br> > > > 2.25.1<br> > > ><br> </div> </span></font></div> </div> </body> </html>