[PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Chen, Guchun
Guchun.Chen at amd.com
Tue Jan 31 13:58:42 UTC 2023
Hi Christian,
Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure.
Regards,
Guchun
-----Original Message-----
From: Koenig, Christian <Christian.Koenig at amd.com>
Sent: Tuesday, January 31, 2023 6:59 PM
To: Chen, Guchun <Guchun.Chen at amd.com>; Alex Deucher <alexdeucher at gmail.com>; Guilherme G. Piccoli <gpiccoli at igalia.com>
Cc: amd-gfx at lists.freedesktop.org; kernel at gpiccoli.net; Pan, Xinhui <Xinhui.Pan at amd.com>; dri-devel at lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov at amd.com>; Limonciello, Mario <Mario.Limonciello at amd.com>; kernel-dev at igalia.com; Deucher, Alexander <Alexander.Deucher at amd.com>
Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Am 31.01.23 um 10:17 schrieb Chen, Guchun:
> Hi Piccoli,
>
> Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Chen, Guchun
> Sent: Tuesday, January 31, 2023 2:37 PM
> To: Alex Deucher <alexdeucher at gmail.com>; Guilherme G. Piccoli
> <gpiccoli at igalia.com>
> Cc: amd-gfx at lists.freedesktop.org; kernel at gpiccoli.net; Pan, Xinhui
> <Xinhui.Pan at amd.com>; dri-devel at lists.freedesktop.org; Tuikov, Luben
> <Luben.Tuikov at amd.com>; Limonciello, Mario
> <Mario.Limonciello at amd.com>; kernel-dev at igalia.com; Deucher, Alexander
> <Alexander.Deucher at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>
> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
> drm_sched init/fini
>
> Hi Piccoli,
>
> I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Alex Deucher <alexdeucher at gmail.com>
> Sent: Tuesday, January 31, 2023 6:30 AM
> To: Guilherme G. Piccoli <gpiccoli at igalia.com>
> Cc: amd-gfx at lists.freedesktop.org; kernel at gpiccoli.net; Chen, Guchun
> <Guchun.Chen at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>;
> dri-devel at lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov at amd.com>;
> Limonciello, Mario <Mario.Limonciello at amd.com>; kernel-dev at igalia.com;
> Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>
> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
> drm_sched init/fini
>
> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli at igalia.com> wrote:
>> + Luben
>>
>> (sorry, missed that in the first submission).
>>
>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
>>> Currently amdgpu calls drm_sched_fini() from the fence driver sw
>>> fini routine - such function is expected to be called only after the
>>> respective init function - drm_sched_init() - was executed successfully.
>>>
>>> Happens that we faced a driver probe failure in the Steam Deck
>>> recently, and the function drm_sched_fini() was called even without
>>> its counter-part had been previously called, causing the following oops:
>>>
>>> amdgpu: probe of 0000:04:00.0 failed with error -110
>>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
>>> 0 P4D 0
>>> Oops: 0002 [#1] PREEMPT SMP NOPTI
>>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
>>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
>>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
>>> <TASK>
>>> amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>>> amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>>> amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>>> devm_drm_dev_init_release+0x49/0x70
>>> [...]
>>>
>>> To prevent that, check if the drm_sched was properly initialized for
>>> a given ring before calling its fini counter-part.
>>>
>>> Notice ideally we'd use sched.ready for that; such field is set as
>>> the latest thing on drm_sched_init(). But amdgpu seems to "override"
>>> the meaning of such field - in the above oops for example, it was a
>>> GFX ring causing the crash, and the sched.ready field was set to
>>> true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
>>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
>>>>> driver fini in s3 test (v2)")
>>> Cc: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> Cc: Guchun Chen <guchun.chen at amd.com>
>>> Cc: Mario Limonciello <mario.limonciello at amd.com>
>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli at igalia.com>
>>> ---
>>>
>>>
>>> Hi folks, first of all thanks in advance for reviews / comments!
>>> Notice that I've used the Fixes tag more in the sense to bring it to
>>> stable, I didn't find a good patch candidate that added the call to
>>> drm_sched_fini(), was reaching way too old commits...so
>>> 067f44c8b459 seems a good candidate - or maybe not?
>>>
>>> Now, with regards sched.ready, spent a bit of time to figure what
>>> was happening...would be feasible maybe to stop using that to mark
>>> some kind ring status? I think it should be possible to add a flag
>>> to the ring structure for that, and free sched.ready from being
>>> manipulate by the amdgpu driver, what's your thoughts on that?
> It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up. I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.
To use ring->ready is not a good idea since this doesn't specify if the scheduler was initialized, but rather if bringup of the engine worked.
It's perfectly possible that the scheduler was initialized, but then the IB test failed later on.
I strongly suggest to use scheduler->ops instead since those are set during init and mandatory for correct operation.
Christian.
>
> Alex
>
>
>>> I could try myself, but first of course I'd like to raise the
>>> "temperature" on this topic and check if somebody is already working
>>> on that.
>>>
>>> Cheers,
>>>
>>> Guilherme
>>>
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 00444203220d..e154eb8241fb 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>> if (!ring || !ring->fence_drv.initialized)
>>> continue;
>>>
>>> - if (!ring->no_scheduler)
>>> + /*
>>> + * Notice we check for sched.name since there's some
>>> + * override on the meaning of sched.ready by amdgpu.
>>> + * The natural check would be sched.ready, which is
>>> + * set as drm_sched_init() finishes...
>>> + */
>>> + if (!ring->no_scheduler && ring->sched.name)
>>> drm_sched_fini(&ring->sched);
>>>
>>> for (j = 0; j <= ring->fence_drv.num_fences_mask;
>>> ++j)
More information about the dri-devel
mailing list