[PATCH] drm/amdgpu: Fix an error message in rmmod

Felix Kuehling felix.kuehling at amd.com
Thu Jan 27 15:28:41 UTC 2022


The hang you're seeing is the result of a command submission of an 
UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a 
doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up 
the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this 
doorbell not trigger gfxoff exit during rmmod?


Regards,
   Felix



Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico):
>
> [AMD Official Use Only]
>
>
> The rmmod ops has prerequisite multi-user target and blacklist amdgpu,
> which is IGT requirement so that IGT can make itself DRM master to 
> test KMS.
> igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload
>
> From my understanding, the KFD process belongs to the regular way of 
> gfxoff exit, which doorbell writing triggers gfxoff exit. For example, 
> KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, 
> these both trigger doorbell writing(pls refer to 
> gfx_v10_0_ring_set_wptr_compute()).
>
> As to the IGT reload test, the dequeue request is not thru a cmd on a 
> ring, it directly writes CP registers, so GFX core remains in gfxoff.
>
> Thanks,
> Rico
>
> ------------------------------------------------------------------------
> *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> *Sent:* Wednesday, January 26, 2022 23:08
> *To:* Yin, Tianci (Rico) <Tianci.Yin at amd.com>; Wang, Yang(Kevin) 
> <KevinYang.Wang at amd.com>; amd-gfx at lists.freedesktop.org 
> <amd-gfx at lists.freedesktop.org>
> *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, Guchun 
> <Guchun.Chen at amd.com>
> *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> My question is, why is this problem only seen during module unload? Why
> aren't we seeing HWS hangs due to GFX_OFF all the time in normal
> operations? For example when the GPU is idle and a new KFD process is
> started, creating a new runlist. Are we just getting lucky because the
> process first has to allocate some memory, which maybe makes some HW
> access (flushing TLBs etc.) that wakes up the GPU?
>
>
> Regards,
>    Felix
>
>
>
> Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):
> >
> > [AMD Official Use Only]
> >
> >
> > Thanks Kevin and Felix!
> >
> > In gfxoff state, the dequeue request(by cp register writing) can't
> > make gfxoff exit, actually the cp is powered off and the cp register
> > writing is invalid, doorbell registers writing(regluar way) or
> > directly request smu to disable gfx powergate(by invoking
> > amdgpu_gfx_off_ctrl) can trigger gfxoff exit.
> >
> > I have also tryed
> > 
> amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false),
> > but it has no effect.
> >
> > [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed
> > [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0xffffffff
> > [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0xffffffff
> > [10386.162297] amdgpu: mmCP_STAT : 0xffffffff
> > [10386.162303] amdgpu: mmCP_BUSY_STAT : 0xffffffff
> > [10386.162308] amdgpu: mmRLC_STAT : 0xffffffff
> > [10386.162314] amdgpu: mmGRBM_STATUS : 0xffffffff
> > [10386.162320] amdgpu: mmGRBM_STATUS2: 0xffffffff
> >
> > Thanks again!
> > Rico
> > ------------------------------------------------------------------------
> > *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> > *Sent:* Tuesday, January 25, 2022 23:31
> > *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; Yin, Tianci (Rico)
> > <Tianci.Yin at amd.com>; amd-gfx at lists.freedesktop.org
> > <amd-gfx at lists.freedesktop.org>
> > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, Guchun
> > <Guchun.Chen at amd.com>
> > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> > I have no objection to the change. It restores the sequence that was
> > used before e9669fb78262. But I don't understand why GFX_OFF is causing
> > a preemption error during module unload, but not when KFD is in normal
> > use. Maybe it's because of the compute power profile that's normally set
> > by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.
> >
> >
> > Either way, the patch is
> >
> > Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>
> >
> >
> >
> > Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):
> > >
> > > [AMD Official Use Only]
> > >
> > >
> > > [AMD Official Use Only]
> > >
> > >
> > > the issue is introduced in following patch, so add following
> > > information is better.
> > > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/
> > > /
> > > /
> > > Reviewed-by: Yang Wang <kevinyang.wang at amd.com>
> > > /
> > > /
> > > Best Regards,
> > > Kevin
> > >
> > > 
> ------------------------------------------------------------------------
> > > *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
> > > Tianci Yin <tianci.yin at amd.com>
> > > *Sent:* Tuesday, January 25, 2022 6:03 PM
> > > *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> > > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Yin, Tianci
> > > (Rico) <Tianci.Yin at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>
> > > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod
> > > From: "Tianci.Yin" <tianci.yin at amd.com>
> > >
> > > [why]
> > > In rmmod procedure, kfd sends cp a dequeue request, but the
> > > request does not get response, then an error message "cp
> > > queue pipe 4 queue 0 preemption failed" printed.
> > >
> > > [how]
> > > Performing kfd suspending after disabling gfxoff can fix it.
> > >
> > > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930
> > > Signed-off-by: Tianci.Yin <tianci.yin at amd.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index b75d67f644e5..77e9837ba342 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct
> > > amdgpu_device *adev)
> > >                  }
> > >          }
> > >
> > > -       amdgpu_amdkfd_suspend(adev, false);
> > > -
> > >          amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
> > >          amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
> > >
> > > +       amdgpu_amdkfd_suspend(adev, false);
> > > +
> > >          /* Workaroud for ASICs need to disable SMC first */
> > >          amdgpu_device_smu_fini_early(adev);
> > >
> > > --
> > > 2.25.1
> > >


More information about the amd-gfx mailing list