<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<p style="font-family:Arial;font-size:10pt;color:#0000FF;margin:5pt;" align="Left">
[AMD Official Use Only]<br>
</p>
<br>
<div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
The error message is from HIQ dequeue procedure, not from HCQ, so no doorbell writing.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477067] Call Trace:</span>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477072] dump_stack+0x7d/0x9c</span></div>
<span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477651] hqd_destroy_v10_3+0x58/0x254 [amdgpu]</span>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477778] destroy_mqd+0x1e/0x30 [amdgpu]</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477884] kernel_queue_uninit+0xcf/0x100 [amdgpu]</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.477985] pm_uninit+0x1a/0x30 [amdgpu] #kernel_queue_uninit(pm->priv_queue, hanging); this priv_queue == HIQ</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478127] stop_cpsch+0x98/0x100 [amdgpu] </span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478242] kgd2kfd_suspend.part.0+0x32/0x50 [amdgpu]</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478338] kgd2kfd_suspend+0x1b/0x20 [amdgpu]</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478433] amdgpu_amdkfd_suspend+0x1e/0x30 [amdgpu]</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478529] amdgpu_device_fini_hw+0x182/0x335 [amdgpu]</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478655] amdgpu_driver_unload_kms+0x5c/0x80 [amdgpu]</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478732] amdgpu_pci_remove+0x27/0x40 [amdgpu]</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478806] pci_device_remove+0x3e/0xb0</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478809] device_release_driver_internal+0x103/0x1d0</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478813] driver_detach+0x4c/0x90</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478814] bus_remove_driver+0x5c/0xd0</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478815] driver_unregister+0x31/0x50</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478817] pci_unregister_driver+0x40/0x90</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478818] amdgpu_exit+0x15/0x2d1 [amdgpu]</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478942] __x64_sys_delete_module+0x147/0x260</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478944] ? exit_to_user_mode_prepare+0x41/0x1d0</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478946] ? ksys_write+0x67/0xe0</span></div>
<div><span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478948] do_syscall_64+0x40/0xb0</span></div>
<span style="color: rgb(23, 78, 134); font-size: 10pt;">Jan 25 16:10:58 lnx-ci-node kernel: [18161.478951] entry_SYSCALL_64_after_hwframe+0x44/0xae</span><br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Regards,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Rico<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Kuehling, Felix <Felix.Kuehling@amd.com><br>
<b>Sent:</b> Thursday, January 27, 2022 23:28<br>
<b>To:</b> Yin, Tianci (Rico) <Tianci.Yin@amd.com>; Wang, Yang(Kevin) <KevinYang.Wang@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br>
<b>Cc:</b> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, Guchun <Guchun.Chen@amd.com><br>
<b>Subject:</b> Re: [PATCH] drm/amdgpu: Fix an error message in rmmod</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">The hang you're seeing is the result of a command submission of an
<br>
UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a <br>
doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up <br>
the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this <br>
doorbell not trigger gfxoff exit during rmmod?<br>
<br>
<br>
Regards,<br>
Felix<br>
<br>
<br>
<br>
Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico):<br>
><br>
> [AMD Official Use Only]<br>
><br>
><br>
> The rmmod ops has prerequisite multi-user target and blacklist amdgpu,<br>
> which is IGT requirement so that IGT can make itself DRM master to <br>
> test KMS.<br>
> igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload<br>
><br>
> From my understanding, the KFD process belongs to the regular way of <br>
> gfxoff exit, which doorbell writing triggers gfxoff exit. For example, <br>
> KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, <br>
> these both trigger doorbell writing(pls refer to <br>
> gfx_v10_0_ring_set_wptr_compute()).<br>
><br>
> As to the IGT reload test, the dequeue request is not thru a cmd on a <br>
> ring, it directly writes CP registers, so GFX core remains in gfxoff.<br>
><br>
> Thanks,<br>
> Rico<br>
><br>
> ------------------------------------------------------------------------<br>
> *From:* Kuehling, Felix <Felix.Kuehling@amd.com><br>
> *Sent:* Wednesday, January 26, 2022 23:08<br>
> *To:* Yin, Tianci (Rico) <Tianci.Yin@amd.com>; Wang, Yang(Kevin) <br>
> <KevinYang.Wang@amd.com>; amd-gfx@lists.freedesktop.org <br>
> <amd-gfx@lists.freedesktop.org><br>
> *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, Guchun <br>
> <Guchun.Chen@amd.com><br>
> *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod<br>
> My question is, why is this problem only seen during module unload? Why<br>
> aren't we seeing HWS hangs due to GFX_OFF all the time in normal<br>
> operations? For example when the GPU is idle and a new KFD process is<br>
> started, creating a new runlist. Are we just getting lucky because the<br>
> process first has to allocate some memory, which maybe makes some HW<br>
> access (flushing TLBs etc.) that wakes up the GPU?<br>
><br>
><br>
> Regards,<br>
> Felix<br>
><br>
><br>
><br>
> Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):<br>
> ><br>
> > [AMD Official Use Only]<br>
> ><br>
> ><br>
> > Thanks Kevin and Felix!<br>
> ><br>
> > In gfxoff state, the dequeue request(by cp register writing) can't<br>
> > make gfxoff exit, actually the cp is powered off and the cp register<br>
> > writing is invalid, doorbell registers writing(regluar way) or<br>
> > directly request smu to disable gfx powergate(by invoking<br>
> > amdgpu_gfx_off_ctrl) can trigger gfxoff exit.<br>
> ><br>
> > I have also tryed<br>
> > <br>
> amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false),<br>
> > but it has no effect.<br>
> ><br>
> > [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed<br>
> > [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0xffffffff<br>
> > [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0xffffffff<br>
> > [10386.162297] amdgpu: mmCP_STAT : 0xffffffff<br>
> > [10386.162303] amdgpu: mmCP_BUSY_STAT : 0xffffffff<br>
> > [10386.162308] amdgpu: mmRLC_STAT : 0xffffffff<br>
> > [10386.162314] amdgpu: mmGRBM_STATUS : 0xffffffff<br>
> > [10386.162320] amdgpu: mmGRBM_STATUS2: 0xffffffff<br>
> ><br>
> > Thanks again!<br>
> > Rico<br>
> > ------------------------------------------------------------------------<br>
> > *From:* Kuehling, Felix <Felix.Kuehling@amd.com><br>
> > *Sent:* Tuesday, January 25, 2022 23:31<br>
> > *To:* Wang, Yang(Kevin) <KevinYang.Wang@amd.com>; Yin, Tianci (Rico)<br>
> > <Tianci.Yin@amd.com>; amd-gfx@lists.freedesktop.org<br>
> > <amd-gfx@lists.freedesktop.org><br>
> > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, Guchun<br>
> > <Guchun.Chen@amd.com><br>
> > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod<br>
> > I have no objection to the change. It restores the sequence that was<br>
> > used before e9669fb78262. But I don't understand why GFX_OFF is causing<br>
> > a preemption error during module unload, but not when KFD is in normal<br>
> > use. Maybe it's because of the compute power profile that's normally set<br>
> > by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.<br>
> ><br>
> ><br>
> > Either way, the patch is<br>
> ><br>
> > Acked-by: Felix Kuehling <Felix.Kuehling@amd.com><br>
> ><br>
> ><br>
> ><br>
> > Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):<br>
> > ><br>
> > > [AMD Official Use Only]<br>
> > ><br>
> > ><br>
> > > [AMD Official Use Only]<br>
> > ><br>
> > ><br>
> > > the issue is introduced in following patch, so add following<br>
> > > information is better.<br>
> > > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/<br>
> > > /<br>
> > > /<br>
> > > Reviewed-by: Yang Wang <kevinyang.wang@amd.com><br>
> > > /<br>
> > > /<br>
> > > Best Regards,<br>
> > > Kevin<br>
> > ><br>
> > > <br>
> ------------------------------------------------------------------------<br>
> > > *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of<br>
> > > Tianci Yin <tianci.yin@amd.com><br>
> > > *Sent:* Tuesday, January 25, 2022 6:03 PM<br>
> > > *To:* amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br>
> > > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Yin, Tianci<br>
> > > (Rico) <Tianci.Yin@amd.com>; Chen, Guchun <Guchun.Chen@amd.com><br>
> > > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod<br>
> > > From: "Tianci.Yin" <tianci.yin@amd.com><br>
> > ><br>
> > > [why]<br>
> > > In rmmod procedure, kfd sends cp a dequeue request, but the<br>
> > > request does not get response, then an error message "cp<br>
> > > queue pipe 4 queue 0 preemption failed" printed.<br>
> > ><br>
> > > [how]<br>
> > > Performing kfd suspending after disabling gfxoff can fix it.<br>
> > ><br>
> > > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930<br>
> > > Signed-off-by: Tianci.Yin <tianci.yin@amd.com><br>
> > > ---<br>
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--<br>
> > > 1 file changed, 2 insertions(+), 2 deletions(-)<br>
> > ><br>
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > > index b75d67f644e5..77e9837ba342 100644<br>
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> > > @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct<br>
> > > amdgpu_device *adev)<br>
> > > }<br>
> > > }<br>
> > ><br>
> > > - amdgpu_amdkfd_suspend(adev, false);<br>
> > > -<br>
> > > amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);<br>
> > > amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);<br>
> > ><br>
> > > + amdgpu_amdkfd_suspend(adev, false);<br>
> > > +<br>
> > > /* Workaroud for ASICs need to disable SMC first */<br>
> > > amdgpu_device_smu_fini_early(adev);<br>
> > ><br>
> > > --<br>
> > > 2.25.1<br>
> > ><br>
</div>
</span></font></div>
</div>
</body>
</html>