[PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

Christian König christian.koenig at amd.com
Wed Feb 1 14:35:52 UTC 2023


Am 01.02.23 um 15:24 schrieb Alex Deucher:
> On Wed, Feb 1, 2023 at 2:18 AM Christian König
> <ckoenig.leichtzumerken at gmail.com> wrote:
>> Hi Guchun,
>>
>> no, that doesn't make any sense at all.
>>
>> The ready flag indicates that the scheduler is fully prepared for hw
>> submissions from userspace and is unrelated to the initialization
>> status. It's set to true after IB testing was successful and only set to
>> false only when a GPU reset fails and we can't get the hardware to work
>> any more.
> That might have been the original intention, but right now sched.ready
> gets set to true when we finish setting up the ring, but before we do
> ring or IB tests.

WHAT? Please not again.

I'm really tired of fixing this over and over again, the meaning of 
ring->sched.ready is to block submissions when a GPU reset fails. AND 
NOTHING ELSE!

The problem is people seem to abuse it and I have to fix it for the 
fourth or fives time now.

I'm going to send out patches,
Christian.

>
> Alex
>
>> Please use sched.ops instead as I suggested before.
>>
>> Regards,
>> Christian.
>>
>> Am 31.01.23 um 14:58 schrieb Chen, Guchun:
>>> Hi Christian,
>>>
>>> Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure.
>>>
>>> Regards,
>>> Guchun
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Tuesday, January 31, 2023 6:59 PM
>>> To: Chen, Guchun <Guchun.Chen at amd.com>; Alex Deucher <alexdeucher at gmail.com>; Guilherme G. Piccoli <gpiccoli at igalia.com>
>>> Cc: amd-gfx at lists.freedesktop.org; kernel at gpiccoli.net; Pan, Xinhui <Xinhui.Pan at amd.com>; dri-devel at lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov at amd.com>; Limonciello, Mario <Mario.Limonciello at amd.com>; kernel-dev at igalia.com; Deucher, Alexander <Alexander.Deucher at amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
>>>
>>> Am 31.01.23 um 10:17 schrieb Chen, Guchun:
>>>> Hi Piccoli,
>>>>
>>>> Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.
>>>>
>>>> Regards,
>>>> Guchun
>>>>
>>>> -----Original Message-----
>>>> From: Chen, Guchun
>>>> Sent: Tuesday, January 31, 2023 2:37 PM
>>>> To: Alex Deucher <alexdeucher at gmail.com>; Guilherme G. Piccoli
>>>> <gpiccoli at igalia.com>
>>>> Cc: amd-gfx at lists.freedesktop.org; kernel at gpiccoli.net; Pan, Xinhui
>>>> <Xinhui.Pan at amd.com>; dri-devel at lists.freedesktop.org; Tuikov, Luben
>>>> <Luben.Tuikov at amd.com>; Limonciello, Mario
>>>> <Mario.Limonciello at amd.com>; kernel-dev at igalia.com; Deucher, Alexander
>>>> <Alexander.Deucher at amd.com>; Koenig, Christian
>>>> <Christian.Koenig at amd.com>
>>>> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
>>>> drm_sched init/fini
>>>>
>>>> Hi Piccoli,
>>>>
>>>> I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?
>>>>
>>>> Regards,
>>>> Guchun
>>>>
>>>> -----Original Message-----
>>>> From: Alex Deucher <alexdeucher at gmail.com>
>>>> Sent: Tuesday, January 31, 2023 6:30 AM
>>>> To: Guilherme G. Piccoli <gpiccoli at igalia.com>
>>>> Cc: amd-gfx at lists.freedesktop.org; kernel at gpiccoli.net; Chen, Guchun
>>>> <Guchun.Chen at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>;
>>>> dri-devel at lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov at amd.com>;
>>>> Limonciello, Mario <Mario.Limonciello at amd.com>; kernel-dev at igalia.com;
>>>> Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian
>>>> <Christian.Koenig at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
>>>> drm_sched init/fini
>>>>
>>>> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli at igalia.com> wrote:
>>>>> + Luben
>>>>>
>>>>> (sorry, missed that in the first submission).
>>>>>
>>>>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
>>>>>> Currently amdgpu calls drm_sched_fini() from the fence driver sw
>>>>>> fini routine - such function is expected to be called only after the
>>>>>> respective init function - drm_sched_init() - was executed successfully.
>>>>>>
>>>>>> Happens that we faced a driver probe failure in the Steam Deck
>>>>>> recently, and the function drm_sched_fini() was called even without
>>>>>> its counter-part had been previously called, causing the following oops:
>>>>>>
>>>>>> amdgpu: probe of 0000:04:00.0 failed with error -110
>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
>>>>>> 0 P4D 0
>>>>>> Oops: 0002 [#1] PREEMPT SMP NOPTI
>>>>>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
>>>>>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
>>>>>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
>>>>>>     <TASK>
>>>>>>     amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>>>>>>     amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>>>>>>     amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>>>>>>     devm_drm_dev_init_release+0x49/0x70
>>>>>>     [...]
>>>>>>
>>>>>> To prevent that, check if the drm_sched was properly initialized for
>>>>>> a given ring before calling its fini counter-part.
>>>>>>
>>>>>> Notice ideally we'd use sched.ready for that; such field is set as
>>>>>> the latest thing on drm_sched_init(). But amdgpu seems to "override"
>>>>>> the meaning of such field - in the above oops for example, it was a
>>>>>> GFX ring causing the crash, and the sched.ready field was set to
>>>>>> true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
>>>>>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
>>>>>>>> driver fini in s3 test (v2)")
>>>>>> Cc: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>> Cc: Guchun Chen <guchun.chen at amd.com>
>>>>>> Cc: Mario Limonciello <mario.limonciello at amd.com>
>>>>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli at igalia.com>
>>>>>> ---
>>>>>>
>>>>>>
>>>>>> Hi folks, first of all thanks in advance for reviews / comments!
>>>>>> Notice that I've used the Fixes tag more in the sense to bring it to
>>>>>> stable, I didn't find a good patch candidate that added the call to
>>>>>> drm_sched_fini(), was reaching way too old commits...so
>>>>>> 067f44c8b459 seems a good candidate - or maybe not?
>>>>>>
>>>>>> Now, with regards sched.ready, spent a bit of time to figure what
>>>>>> was happening...would be feasible maybe to stop using that to mark
>>>>>> some kind ring status? I think it should be possible to add a flag
>>>>>> to the ring structure for that, and free sched.ready from being
>>>>>> manipulate by the amdgpu driver, what's your thoughts on that?
>>>> It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.
>>> To use ring->ready is not a good idea since this doesn't specify if the scheduler was initialized, but rather if bringup of the engine worked.
>>>
>>> It's perfectly possible that the scheduler was initialized, but then the IB test failed later on.
>>>
>>> I strongly suggest to use scheduler->ops instead since those are set during init and mandatory for correct operation.
>>>
>>> Christian.
>>>
>>>> Alex
>>>>
>>>>
>>>>>> I could try myself, but first of course I'd like to raise the
>>>>>> "temperature" on this topic and check if somebody is already working
>>>>>> on that.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Guilherme
>>>>>>
>>>>>>
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>>>>>>     1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> index 00444203220d..e154eb8241fb 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>>>>                  if (!ring || !ring->fence_drv.initialized)
>>>>>>                          continue;
>>>>>>
>>>>>> -             if (!ring->no_scheduler)
>>>>>> +             /*
>>>>>> +              * Notice we check for sched.name since there's some
>>>>>> +              * override on the meaning of sched.ready by amdgpu.
>>>>>> +              * The natural check would be sched.ready, which is
>>>>>> +              * set as drm_sched_init() finishes...
>>>>>> +              */
>>>>>> +             if (!ring->no_scheduler && ring->sched.name)
>>>>>>                          drm_sched_fini(&ring->sched);
>>>>>>
>>>>>>                  for (j = 0; j <= ring->fence_drv.num_fences_mask;
>>>>>> ++j)



More information about the amd-gfx mailing list