[PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

Tue Jan 31 17:51:46 UTC 2023

On Tue, Jan 31, 2023 at 5:58 AM Christian König
<christian.koenig at amd.com> wrote:
>
> Am 31.01.23 um 10:17 schrieb Chen, Guchun:
> > Hi Piccoli,
> >
> > Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.
> >
> > Regards,
> > Guchun
> >
> > -----Original Message-----
> > From: Chen, Guchun
> > Sent: Tuesday, January 31, 2023 2:37 PM
> > To: Alex Deucher <alexdeucher at gmail.com>; Guilherme G. Piccoli <gpiccoli at igalia.com>
> > Cc: amd-gfx at lists.freedesktop.org; kernel at gpiccoli.net; Pan, Xinhui <Xinhui.Pan at amd.com>; dri-devel at lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov at amd.com>; Limonciello, Mario <Mario.Limonciello at amd.com>; kernel-dev at igalia.com; Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> > Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
> >
> > Hi Piccoli,
> >
> > I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?
> >
> > Regards,
> > Guchun
> >
> > -----Original Message-----
> > From: Alex Deucher <alexdeucher at gmail.com>
> > Sent: Tuesday, January 31, 2023 6:30 AM
> > To: Guilherme G. Piccoli <gpiccoli at igalia.com>
> > Cc: amd-gfx at lists.freedesktop.org; kernel at gpiccoli.net; Chen, Guchun <Guchun.Chen at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>; dri-devel at lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov at amd.com>; Limonciello, Mario <Mario.Limonciello at amd.com>; kernel-dev at igalia.com; Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> > Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
> >
> > On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli at igalia.com> wrote:
> >> + Luben
> >>
> >> (sorry, missed that in the first submission).
> >>
> >> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
> >>> Currently amdgpu calls drm_sched_fini() from the fence driver sw
> >>> fini routine - such function is expected to be called only after the
> >>> respective init function - drm_sched_init() - was executed successfully.
> >>>
> >>> Happens that we faced a driver probe failure in the Steam Deck
> >>> recently, and the function drm_sched_fini() was called even without
> >>> its counter-part had been previously called, causing the following oops:
> >>>
> >>> amdgpu: probe of 0000:04:00.0 failed with error -110
> >>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
> >>> 0 P4D 0
> >>> Oops: 0002 [#1] PREEMPT SMP NOPTI
> >>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
> >>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> >>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
> >>>   <TASK>
> >>>   amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
> >>>   amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
> >>>   amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> >>>   devm_drm_dev_init_release+0x49/0x70
> >>>   [...]
> >>>
> >>> To prevent that, check if the drm_sched was properly initialized for
> >>> a given ring before calling its fini counter-part.
> >>>
> >>> Notice ideally we'd use sched.ready for that; such field is set as
> >>> the latest thing on drm_sched_init(). But amdgpu seems to "override"
> >>> the meaning of such field - in the above oops for example, it was a
> >>> GFX ring causing the crash, and the sched.ready field was set to
> >>> true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
> >>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
> >>>>> driver fini in s3 test (v2)")
> >>> Cc: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> >>> Cc: Guchun Chen <guchun.chen at amd.com>
> >>> Cc: Mario Limonciello <mario.limonciello at amd.com>
> >>> Signed-off-by: Guilherme G. Piccoli <gpiccoli at igalia.com>
> >>> ---
> >>>
> >>>
> >>> Hi folks, first of all thanks in advance for reviews / comments!
> >>> Notice that I've used the Fixes tag more in the sense to bring it to
> >>> stable, I didn't find a good patch candidate that added the call to
> >>> drm_sched_fini(), was reaching way too old commits...so
> >>> 067f44c8b459 seems a good candidate - or maybe not?
> >>>
> >>> Now, with regards sched.ready, spent a bit of time to figure what
> >>> was happening...would be feasible maybe to stop using that to mark
> >>> some kind ring status? I think it should be possible to add a flag
> >>> to the ring structure for that, and free sched.ready from being
> >>> manipulate by the amdgpu driver, what's your thoughts on that?
> > It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.
>
> To use ring->ready is not a good idea since this doesn't specify if the
> scheduler was initialized, but rather if bringup of the engine worked.

ring->ready no longer exists.  It was moved into the scheduler and the
driver side never got cleaned up.  Also, we set ring->sched.ready =
true once the rings are set up, but before we actually test them, so
it's currently a proxy for the ring being ready to test/use.

Alex

>
> It's perfectly possible that the scheduler was initialized, but then the
> IB test failed later on.
>
> I strongly suggest to use scheduler->ops instead since those are set
> during init and mandatory for correct operation.
>
> Christian.
>
> >
> > Alex
> >
> >
> >>> I could try myself, but first of course I'd like to raise the
> >>> "temperature" on this topic and check if somebody is already working
> >>> on that.
> >>>
> >>> Cheers,
> >>>
> >>> Guilherme
> >>>
> >>>
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
> >>>   1 file changed, 7 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> index 00444203220d..e154eb8241fb 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
> >>>                if (!ring || !ring->fence_drv.initialized)
> >>>                        continue;
> >>>
> >>> -             if (!ring->no_scheduler)
> >>> +             /*
> >>> +              * Notice we check for sched.name since there's some
> >>> +              * override on the meaning of sched.ready by amdgpu.
> >>> +              * The natural check would be sched.ready, which is
> >>> +              * set as drm_sched_init() finishes...
> >>> +              */
> >>> +             if (!ring->no_scheduler && ring->sched.name)
> >>>                        drm_sched_fini(&ring->sched);
> >>>
> >>>                for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
>