[PATCH] drm/amdgpu: Fix an error message in rmmod

Fri Jan 28 14:50:06 UTC 2022

I see, thanks for clarifying. So this is happening because we unmap the 
HIQ with direct MMIO register writes instead of using the KIQ.

I'm OK with this patch as a workaround, but as a proper fix, we should 
probably add a hiq_hqd_destroy function that uses KIQ, similar to how we 
have hiq_mqd_load functions that use KIQ to map the HIQ.

Regards,
   Felix

Am 2022-01-27 um 21:34 schrieb Yin, Tianci (Rico):
>
> [AMD Official Use Only]
>
>
> The error message is from HIQ dequeue procedure,  not from HCQ, so no 
> doorbell writing.
>
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.477067] Call Trace:
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.477072]  dump_stack+0x7d/0x9c
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.477651] 
>  hqd_destroy_v10_3+0x58/0x254 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.477778] 
>  destroy_mqd+0x1e/0x30 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.477884] 
>  kernel_queue_uninit+0xcf/0x100 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.477985] 
>  pm_uninit+0x1a/0x30 [amdgpu] #kernel_queue_uninit(pm->priv_queue, 
> hanging); this priv_queue == HIQ
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478127] 
>  stop_cpsch+0x98/0x100 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478242] 
>  kgd2kfd_suspend.part.0+0x32/0x50 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478338] 
>  kgd2kfd_suspend+0x1b/0x20 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478433] 
>  amdgpu_amdkfd_suspend+0x1e/0x30 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478529] 
>  amdgpu_device_fini_hw+0x182/0x335 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478655] 
>  amdgpu_driver_unload_kms+0x5c/0x80 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478732] 
>  amdgpu_pci_remove+0x27/0x40 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478806] 
>  pci_device_remove+0x3e/0xb0
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478809] 
>  device_release_driver_internal+0x103/0x1d0
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478813] 
>  driver_detach+0x4c/0x90
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478814] 
>  bus_remove_driver+0x5c/0xd0
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478815] 
>  driver_unregister+0x31/0x50
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478817] 
>  pci_unregister_driver+0x40/0x90
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478818] 
>  amdgpu_exit+0x15/0x2d1 [amdgpu]
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478942] 
>  __x64_sys_delete_module+0x147/0x260
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478944]  ? 
> exit_to_user_mode_prepare+0x41/0x1d0
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478946]  ? ksys_write+0x67/0xe0
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478948] 
>  do_syscall_64+0x40/0xb0
> Jan 25 16:10:58 lnx-ci-node kernel: [18161.478951] 
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> Regards,
> Rico
> ------------------------------------------------------------------------
> *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> *Sent:* Thursday, January 27, 2022 23:28
> *To:* Yin, Tianci (Rico) <Tianci.Yin at amd.com>; Wang, Yang(Kevin) 
> <KevinYang.Wang at amd.com>; amd-gfx at lists.freedesktop.org 
> <amd-gfx at lists.freedesktop.org>
> *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, Guchun 
> <Guchun.Chen at amd.com>
> *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> The hang you're seeing is the result of a command submission of an
> UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a
> doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up
> the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this
> doorbell not trigger gfxoff exit during rmmod?
>
>
> Regards,
>    Felix
>
>
>
> Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico):
> >
> > [AMD Official Use Only]
> >
> >
> > The rmmod ops has prerequisite multi-user target and blacklist amdgpu,
> > which is IGT requirement so that IGT can make itself DRM master to
> > test KMS.
> > igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload
> >
> > From my understanding, the KFD process belongs to the regular way of
> > gfxoff exit, which doorbell writing triggers gfxoff exit. For example,
> > KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ,
> > these both trigger doorbell writing(pls refer to
> > gfx_v10_0_ring_set_wptr_compute()).
> >
> > As to the IGT reload test, the dequeue request is not thru a cmd on a
> > ring, it directly writes CP registers, so GFX core remains in gfxoff.
> >
> > Thanks,
> > Rico
> >
> > ------------------------------------------------------------------------
> > *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> > *Sent:* Wednesday, January 26, 2022 23:08
> > *To:* Yin, Tianci (Rico) <Tianci.Yin at amd.com>; Wang, Yang(Kevin)
> > <KevinYang.Wang at amd.com>; amd-gfx at lists.freedesktop.org
> > <amd-gfx at lists.freedesktop.org>
> > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, Guchun
> > <Guchun.Chen at amd.com>
> > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> > My question is, why is this problem only seen during module unload? Why
> > aren't we seeing HWS hangs due to GFX_OFF all the time in normal
> > operations? For example when the GPU is idle and a new KFD process is
> > started, creating a new runlist. Are we just getting lucky because the
> > process first has to allocate some memory, which maybe makes some HW
> > access (flushing TLBs etc.) that wakes up the GPU?
> >
> >
> > Regards,
> >    Felix
> >
> >
> >
> > Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):
> > >
> > > [AMD Official Use Only]
> > >
> > >
> > > Thanks Kevin and Felix!
> > >
> > > In gfxoff state, the dequeue request(by cp register writing) can't
> > > make gfxoff exit, actually the cp is powered off and the cp register
> > > writing is invalid, doorbell registers writing(regluar way) or
> > > directly request smu to disable gfx powergate(by invoking
> > > amdgpu_gfx_off_ctrl) can trigger gfxoff exit.
> > >
> > > I have also tryed
> > >
> > 
> amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false),
> > > but it has no effect.
> > >
> > > [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed
> > > [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0xffffffff
> > > [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0xffffffff
> > > [10386.162297] amdgpu: mmCP_STAT : 0xffffffff
> > > [10386.162303] amdgpu: mmCP_BUSY_STAT : 0xffffffff
> > > [10386.162308] amdgpu: mmRLC_STAT : 0xffffffff
> > > [10386.162314] amdgpu: mmGRBM_STATUS : 0xffffffff
> > > [10386.162320] amdgpu: mmGRBM_STATUS2: 0xffffffff
> > >
> > > Thanks again!
> > > Rico
> > > 
> ------------------------------------------------------------------------
> > > *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> > > *Sent:* Tuesday, January 25, 2022 23:31
> > > *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; Yin, Tianci (Rico)
> > > <Tianci.Yin at amd.com>; amd-gfx at lists.freedesktop.org
> > > <amd-gfx at lists.freedesktop.org>
> > > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, Guchun
> > > <Guchun.Chen at amd.com>
> > > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> > > I have no objection to the change. It restores the sequence that was
> > > used before e9669fb78262. But I don't understand why GFX_OFF is 
> causing
> > > a preemption error during module unload, but not when KFD is in normal
> > > use. Maybe it's because of the compute power profile that's 
> normally set
> > > by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.
> > >
> > >
> > > Either way, the patch is
> > >
> > > Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>
> > >
> > >
> > >
> > > Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):
> > > >
> > > > [AMD Official Use Only]
> > > >
> > > >
> > > > [AMD Official Use Only]
> > > >
> > > >
> > > > the issue is introduced in following patch, so add following
> > > > information is better.
> > > > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/
> > > > /
> > > > /
> > > > Reviewed-by: Yang Wang <kevinyang.wang at amd.com>
> > > > /
> > > > /
> > > > Best Regards,
> > > > Kevin
> > > >
> > > >
> > ------------------------------------------------------------------------
> > > > *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
> > > > Tianci Yin <tianci.yin at amd.com>
> > > > *Sent:* Tuesday, January 25, 2022 6:03 PM
> > > > *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> > > > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Yin, Tianci
> > > > (Rico) <Tianci.Yin at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>
> > > > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod
> > > > From: "Tianci.Yin" <tianci.yin at amd.com>
> > > >
> > > > [why]
> > > > In rmmod procedure, kfd sends cp a dequeue request, but the
> > > > request does not get response, then an error message "cp
> > > > queue pipe 4 queue 0 preemption failed" printed.
> > > >
> > > > [how]
> > > > Performing kfd suspending after disabling gfxoff can fix it.
> > > >
> > > > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930
> > > > Signed-off-by: Tianci.Yin <tianci.yin at amd.com>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index b75d67f644e5..77e9837ba342 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -2720,11 +2720,11 @@ static int 
> amdgpu_device_ip_fini_early(struct
> > > > amdgpu_device *adev)
> > > >                  }
> > > >          }
> > > >
> > > > -       amdgpu_amdkfd_suspend(adev, false);
> > > > -
> > > >          amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
> > > >          amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
> > > >
> > > > +       amdgpu_amdkfd_suspend(adev, false);
> > > > +
> > > >          /* Workaroud for ASICs need to disable SMC first */
> > > > amdgpu_device_smu_fini_early(adev);
> > > >
> > > > --
> > > > 2.25.1
> > > >