[PATCH] drm/sched: Check scheduler work queue before calling timeout handling
Luben Tuikov
luben.tuikov at amd.com
Wed May 10 10:40:49 UTC 2023
On 2023-05-09 17:43, vitaly.prosyak at amd.com wrote:
> From: Vitaly Prosyak <vitaly.prosyak at amd.com>
>
> During an IGT GPU reset test we see again oops despite of
> commit 0c8c901aaaebc9bf8bf189ffc116e678f7a2dc16
> drm/sched: Check scheduler ready before calling timeout handling.
You can probably use the more succinct fixes line:
0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")
>
> It uses ready condition whether to call drm_sched_fault which unwind
> the TDR leads to GPU reset.
> However it looks the ready condition is overloaded with other meanings,
> for example, for the following stack is related GPU reset :
>
> 0 gfx_v9_0_cp_gfx_start
> 1 gfx_v9_0_cp_gfx_resume
> 2 gfx_v9_0_cp_resume
> 3 gfx_v9_0_hw_init
> 4 gfx_v9_0_resume
> 5 amdgpu_device_ip_resume_phase2
>
> does the following:
> /* start the ring */
> gfx_v9_0_cp_gfx_start(adev);
> ring->sched.ready = true;
>
> The same approach is for other ASICs as well :
> gfx_v8_0_cp_gfx_resume
> gfx_v10_0_kiq_resume, etc...
>
> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
> and then drm_sched_fault. However now it depends on whether the interrupt service routine
> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
> field of the scheduler to true even for not initialized schedulers and causes oops vs
"not initialized" --> "uninitialized" reads better.
> no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and
> NULL pointer dereference does not occur.
>
> Use the field timeout_wq to prevent oops for uninitialized schedulers.
> The field could be initialized by the work queue of resetting the domain.
>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak at amd.com>
Add, a fixes tag,
Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")
Before the SOB tag.
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 649fac2e1ccb..670b7997f389 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
> */
> void drm_sched_fault(struct drm_gpu_scheduler *sched)
> {
> - if (sched->ready)
> + if (sched->timeout_wq)
> mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
> }
> EXPORT_SYMBOL(drm_sched_fault);
Yes, this does indeed seem more correct.
Apply the comments above and repost the patch to amd-gfx and dri-devel and
I'll push it to drm-misc-fixes and amd-staging-drm-next.
--
Regards,
Luben
More information about the amd-gfx
mailing list