[PATCH] drm/sched: Check scheduler work queue before calling timeout handling
Luben Tuikov
luben.tuikov at amd.com
Wed May 10 14:31:36 UTC 2023
On 2023-05-10 10:24, vitaly prosyak wrote:
>
> On 2023-05-10 10:19, Luben Tuikov wrote:
>> On 2023-05-10 09:51, vitaly.prosyak at amd.com wrote:
>>> From: Vitaly Prosyak <vitaly.prosyak at amd.com>
>>>
>>> During an IGT GPU reset test we see again oops despite of
>>> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
>>> timeout handling).
>>>
>>> It uses ready condition whether to call drm_sched_fault which unwind
>>> the TDR leads to GPU reset.
>>> However it looks the ready condition is overloaded with other meanings,
>>> for example, for the following stack is related GPU reset :
>>>
>>> 0 gfx_v9_0_cp_gfx_start
>>> 1 gfx_v9_0_cp_gfx_resume
>>> 2 gfx_v9_0_cp_resume
>>> 3 gfx_v9_0_hw_init
>>> 4 gfx_v9_0_resume
>>> 5 amdgpu_device_ip_resume_phase2
>>>
>>> does the following:
>>> /* start the ring */
>>> gfx_v9_0_cp_gfx_start(adev);
>>> ring->sched.ready = true;
>>>
>>> The same approach is for other ASICs as well :
>>> gfx_v8_0_cp_gfx_resume
>>> gfx_v10_0_kiq_resume, etc...
>>>
>>> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
>>> and then drm_sched_fault. However now it depends on whether the interrupt service routine
>>> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
>>> field of the scheduler to true even for uninitialized schedulers and causes oops vs
>>> no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and
>>> NULL pointer dereference does not occur.
>>>
>>> Use the field timeout_wq to prevent oops for uninitialized schedulers.
>>> The field could be initialized by the work queue of resetting the domain.
>>>
>>> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")
>>>
>>> v1: Corrections to commit message (Luben)
>>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak at amd.com>
>>> Reviewed-by: Luben Tuikov <luben.tuikov at amd.com>
>> I didn't give my RB to this patch so I'm not sure what it is doing here.
> I removed your rb, also if you do not know what is doing here why do you want to push this to amd-staging-drm-next and to drm-misc-fixed?
I'll add my RB as I push it to those two branches.
I'll also add a Link tag and fix the commit SHA for the Fixes tag to
one which is found in drm-misc-fixes.
Thanks for the patch fixing this long-standing bug.
Regards,
Luben
>>
>> The fixes tag should be before the SOB tag, and the v1 line should be separated
>> by a line before the Git tags.
>>
>> Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next,
>> I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick
>> that into amd-staging-drm-next.
>>
>> Don't push it to amd-staging-drm-next.
>>
>> I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with
>> a Link: tag.
>>
>> Regards,
>> Luben
>>
>>
>>> ---
>>> drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 649fac2e1ccb..670b7997f389 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>>> */
>>> void drm_sched_fault(struct drm_gpu_scheduler *sched)
>>> {
>>> - if (sched->ready)
>>> + if (sched->timeout_wq)
>>> mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
>>> }
>>> EXPORT_SYMBOL(drm_sched_fault);
More information about the dri-devel
mailing list