[PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

Guilherme G. Piccoli gpiccoli at igalia.com
Mon Jan 30 21:51:09 UTC 2023


+ Luben

(sorry, missed that in the first submission).

On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
> Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
> routine - such function is expected to be called only after the
> respective init function - drm_sched_init() - was executed successfully.
> 
> Happens that we faced a driver probe failure in the Steam Deck
> recently, and the function drm_sched_fini() was called even without
> its counter-part had been previously called, causing the following oops:
> 
> amdgpu: probe of 0000:04:00.0 failed with error -110
> BUG: kernel NULL pointer dereference, address: 0000000000000090
> PGD 0 P4D 0
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
> Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
> [...]
> Call Trace:
>  <TASK>
>  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>  devm_drm_dev_init_release+0x49/0x70
>  [...]
> 
> To prevent that, check if the drm_sched was properly initialized for a
> given ring before calling its fini counter-part.
> 
> Notice ideally we'd use sched.ready for that; such field is set as the latest
> thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
> field - in the above oops for example, it was a GFX ring causing the crash, and
> the sched.ready field was set to true in the ring init routine, regardless of
> the state of the DRM scheduler. Hence, we ended-up using another sched field.
> 
> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
> Cc: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> Cc: Guchun Chen <guchun.chen at amd.com>
> Cc: Mario Limonciello <mario.limonciello at amd.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli at igalia.com>
> ---
> 
> 
> Hi folks, first of all thanks in advance for reviews / comments!
> Notice that I've used the Fixes tag more in the sense to bring it
> to stable, I didn't find a good patch candidate that added the
> call to drm_sched_fini(), was reaching way too old commits...so
> 067f44c8b459 seems a good candidate - or maybe not?
> 
> Now, with regards sched.ready, spent a bit of time to figure what
> was happening...would be feasible maybe to stop using that to
> mark some kind ring status? I think it should be possible to add a
> flag to the ring structure for that, and free sched.ready from
> being manipulate by the amdgpu driver, what's your thoughts on that?
> 
> I could try myself, but first of course I'd like to raise the
> "temperature" on this topic and check if somebody is already working
> on that.
> 
> Cheers,
> 
> Guilherme
> 
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 00444203220d..e154eb8241fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>  		if (!ring || !ring->fence_drv.initialized)
>  			continue;
>  
> -		if (!ring->no_scheduler)
> +		/*
> +		 * Notice we check for sched.name since there's some
> +		 * override on the meaning of sched.ready by amdgpu.
> +		 * The natural check would be sched.ready, which is
> +		 * set as drm_sched_init() finishes...
> +		 */
> +		if (!ring->no_scheduler && ring->sched.name)
>  			drm_sched_fini(&ring->sched);
>  
>  		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)


More information about the dri-devel mailing list