[PATCH] drm/sched: Check scheduler work queue before calling timeout handling

Luben Tuikov luben.tuikov at amd.com
Wed May 10 14:19:12 UTC 2023


On 2023-05-10 09:51, vitaly.prosyak at amd.com wrote:
> From: Vitaly Prosyak <vitaly.prosyak at amd.com>
> 
> During an IGT GPU reset test we see again oops despite of
> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
> timeout handling).
> 
> It uses ready condition whether to call drm_sched_fault which unwind
> the TDR leads to GPU reset.
> However it looks the ready condition is overloaded with other meanings,
> for example, for the following stack is related GPU reset :
> 
> 0  gfx_v9_0_cp_gfx_start
> 1  gfx_v9_0_cp_gfx_resume
> 2  gfx_v9_0_cp_resume
> 3  gfx_v9_0_hw_init
> 4  gfx_v9_0_resume
> 5  amdgpu_device_ip_resume_phase2
> 
> does the following:
> 	/* start the ring */
> 	gfx_v9_0_cp_gfx_start(adev);
> 	ring->sched.ready = true;
> 
> The same approach is for other ASICs as well :
> gfx_v8_0_cp_gfx_resume
> gfx_v10_0_kiq_resume, etc...
> 
> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
> and then drm_sched_fault. However now it depends on whether the interrupt service routine
> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
> field of the scheduler to true even  for uninitialized schedulers and causes oops vs
> no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
> NULL pointer dereference does not occur.
> 
> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
> The field could be initialized by the work queue of resetting the domain.
> 
> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")
> 
> v1: Corrections to commit message (Luben)
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak at amd.com>
> Reviewed-by: Luben Tuikov <luben.tuikov at amd.com>

I didn't give my RB to this patch so I'm not sure what it is doing here.

The fixes tag should be before the SOB tag, and the v1 line should be separated
by a line before the Git tags.

Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next,
I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick
that into amd-staging-drm-next.

Don't push it to amd-staging-drm-next.

I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with
a Link: tag.

Regards,
Luben


> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 649fac2e1ccb..670b7997f389 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>   */
>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>  {
> -	if (sched->ready)
> +	if (sched->timeout_wq)
>  		mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
>  }
>  EXPORT_SYMBOL(drm_sched_fault);



More information about the amd-gfx mailing list