[PATCH] drm/amdgpu: Fix an error message in rmmod

Thu Jan 27 03:38:59 UTC 2022

[AMD Official Use Only]

The rmmod ops has prerequisite multi-user target and blacklist amdgpu,
which is IGT requirement so that IGT can make itself DRM master to test KMS.
igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload

>From my understanding, the KFD process belongs to the regular way of gfxoff exit, which doorbell writing triggers gfxoff exit. For example, KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, these both trigger doorbell writing(pls refer to gfx_v10_0_ring_set_wptr_compute()).

As to the IGT reload test, the dequeue request is not thru a cmd on a ring, it directly writes CP registers, so GFX core remains in gfxoff.

Thanks,
Rico

________________________________
From: Kuehling, Felix <Felix.Kuehling at amd.com>
Sent: Wednesday, January 26, 2022 23:08
To: Yin, Tianci (Rico) <Tianci.Yin at amd.com>; Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
Cc: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>
Subject: Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

My question is, why is this problem only seen during module unload? Why
aren't we seeing HWS hangs due to GFX_OFF all the time in normal
operations? For example when the GPU is idle and a new KFD process is
started, creating a new runlist. Are we just getting lucky because the
process first has to allocate some memory, which maybe makes some HW
access (flushing TLBs etc.) that wakes up the GPU?

Regards,
   Felix

Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):
>
> [AMD Official Use Only]
>
>
> Thanks Kevin and Felix!
>
> In gfxoff state, the dequeue request(by cp register writing) can't
> make gfxoff exit, actually the cp is powered off and the cp register
> writing is invalid, doorbell registers writing(regluar way) or
> directly request smu to disable gfx powergate(by invoking
> amdgpu_gfx_off_ctrl) can trigger gfxoff exit.
>
> I have also tryed
> amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false),
> but it has no effect.
>
> [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed
> [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0xffffffff
> [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0xffffffff
> [10386.162297] amdgpu: mmCP_STAT : 0xffffffff
> [10386.162303] amdgpu: mmCP_BUSY_STAT : 0xffffffff
> [10386.162308] amdgpu: mmRLC_STAT : 0xffffffff
> [10386.162314] amdgpu: mmGRBM_STATUS : 0xffffffff
> [10386.162320] amdgpu: mmGRBM_STATUS2: 0xffffffff
>
> Thanks again!
> Rico
> ------------------------------------------------------------------------
> *From:* Kuehling, Felix <Felix.Kuehling at amd.com>
> *Sent:* Tuesday, January 25, 2022 23:31
> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; Yin, Tianci (Rico)
> <Tianci.Yin at amd.com>; amd-gfx at lists.freedesktop.org
> <amd-gfx at lists.freedesktop.org>
> *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, Guchun
> <Guchun.Chen at amd.com>
> *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> I have no objection to the change. It restores the sequence that was
> used before e9669fb78262. But I don't understand why GFX_OFF is causing
> a preemption error during module unload, but not when KFD is in normal
> use. Maybe it's because of the compute power profile that's normally set
> by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.
>
>
> Either way, the patch is
>
> Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>
>
>
>
> Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):
> >
> > [AMD Official Use Only]
> >
> >
> > [AMD Official Use Only]
> >
> >
> > the issue is introduced in following patch, so add following
> > information is better.
> > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/
> > /
> > /
> > Reviewed-by: Yang Wang <kevinyang.wang at amd.com>
> > /
> > /
> > Best Regards,
> > Kevin
> >
> > ------------------------------------------------------------------------
> > *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
> > Tianci Yin <tianci.yin at amd.com>
> > *Sent:* Tuesday, January 25, 2022 6:03 PM
> > *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Yin, Tianci
> > (Rico) <Tianci.Yin at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>
> > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod
> > From: "Tianci.Yin" <tianci.yin at amd.com>
> >
> > [why]
> > In rmmod procedure, kfd sends cp a dequeue request, but the
> > request does not get response, then an error message "cp
> > queue pipe 4 queue 0 preemption failed" printed.
> >
> > [how]
> > Performing kfd suspending after disabling gfxoff can fix it.
> >
> > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930
> > Signed-off-by: Tianci.Yin <tianci.yin at amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index b75d67f644e5..77e9837ba342 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct
> > amdgpu_device *adev)
> >                  }
> >          }
> >
> > -       amdgpu_amdkfd_suspend(adev, false);
> > -
> >          amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
> >          amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
> >
> > +       amdgpu_amdkfd_suspend(adev, false);
> > +
> >          /* Workaroud for ASICs need to disable SMC first */
> >          amdgpu_device_smu_fini_early(adev);
> >
> > --
> > 2.25.1
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220127/68e9c767/attachment-0001.htm>