[PATCH] drm/sched: Check scheduler work queue before calling timeout handling

Wed May 10 14:24:12 UTC 2023

On 2023-05-10 10:19, Luben Tuikov wrote:
> On 2023-05-10 09:51, vitaly.prosyak at amd.com wrote:
>> From: Vitaly Prosyak <vitaly.prosyak at amd.com>
>>
>> During an IGT GPU reset test we see again oops despite of
>> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
>> timeout handling).
>>
>> It uses ready condition whether to call drm_sched_fault which unwind
>> the TDR leads to GPU reset.
>> However it looks the ready condition is overloaded with other meanings,
>> for example, for the following stack is related GPU reset :
>>
>> 0  gfx_v9_0_cp_gfx_start
>> 1  gfx_v9_0_cp_gfx_resume
>> 2  gfx_v9_0_cp_resume
>> 3  gfx_v9_0_hw_init
>> 4  gfx_v9_0_resume
>> 5  amdgpu_device_ip_resume_phase2
>>
>> does the following:
>> 	/* start the ring */
>> 	gfx_v9_0_cp_gfx_start(adev);
>> 	ring->sched.ready = true;
>>
>> The same approach is for other ASICs as well :
>> gfx_v8_0_cp_gfx_resume
>> gfx_v10_0_kiq_resume, etc...
>>
>> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
>> and then drm_sched_fault. However now it depends on whether the interrupt service routine
>> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
>> field of the scheduler to true even  for uninitialized schedulers and causes oops vs
>> no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
>> NULL pointer dereference does not occur.
>>
>> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
>> The field could be initialized by the work queue of resetting the domain.
>>
>> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")
>>
>> v1: Corrections to commit message (Luben)
>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak at amd.com>
>> Reviewed-by: Luben Tuikov <luben.tuikov at amd.com>
> I didn't give my RB to this patch so I'm not sure what it is doing here.
I removed your rb, also if you do not know what is doing here why do you want to push this to amd-staging-drm-next and to  drm-misc-fixed?
>
> The fixes tag should be before the SOB tag, and the v1 line should be separated
> by a line before the Git tags.
>
> Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next,
> I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick
> that into amd-staging-drm-next.
>
> Don't push it to amd-staging-drm-next.
>
> I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with
> a Link: tag.
>
> Regards,
> Luben
>
>
>> ---
>>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 649fac2e1ccb..670b7997f389 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>>   */
>>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>>  {
>> -	if (sched->ready)
>> +	if (sched->timeout_wq)
>>  		mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
>>  }
>>  EXPORT_SYMBOL(drm_sched_fault);