[PATCH] drm/sched: Check scheduler work queue before calling timeout handling
Alex Deucher
alexdeucher at gmail.com
Tue May 9 21:56:35 UTC 2023
+ dri-devel
On Tue, May 9, 2023 at 5:43 PM <vitaly.prosyak at amd.com> wrote:
>
> From: Vitaly Prosyak <vitaly.prosyak at amd.com>
>
> During an IGT GPU reset test we see again oops despite of
> commit 0c8c901aaaebc9bf8bf189ffc116e678f7a2dc16
> drm/sched: Check scheduler ready before calling timeout handling.
>
> It uses ready condition whether to call drm_sched_fault which unwind
> the TDR leads to GPU reset.
> However it looks the ready condition is overloaded with other meanings,
> for example, for the following stack is related GPU reset :
>
> 0 gfx_v9_0_cp_gfx_start
> 1 gfx_v9_0_cp_gfx_resume
> 2 gfx_v9_0_cp_resume
> 3 gfx_v9_0_hw_init
> 4 gfx_v9_0_resume
> 5 amdgpu_device_ip_resume_phase2
>
> does the following:
> /* start the ring */
> gfx_v9_0_cp_gfx_start(adev);
> ring->sched.ready = true;
>
> The same approach is for other ASICs as well :
> gfx_v8_0_cp_gfx_resume
> gfx_v10_0_kiq_resume, etc...
>
> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
> and then drm_sched_fault. However now it depends on whether the interrupt service routine
> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
> field of the scheduler to true even for not initialized schedulers and causes oops vs
> no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and
> NULL pointer dereference does not occur.
>
> Use the field timeout_wq to prevent oops for uninitialized schedulers.
> The field could be initialized by the work queue of resetting the domain.
>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak at amd.com>
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 649fac2e1ccb..670b7997f389 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
> */
> void drm_sched_fault(struct drm_gpu_scheduler *sched)
> {
> - if (sched->ready)
> + if (sched->timeout_wq)
> mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
> }
> EXPORT_SYMBOL(drm_sched_fault);
> --
> 2.25.1
>
More information about the amd-gfx
mailing list