[PATCH] drm/sched: Fix a race in DRM_GPU_SCHED_STAT_NO_HANG test

Tvrtko Ursulin tvrtko.ursulin at igalia.com
Wed Jul 16 14:13:58 UTC 2025


On 16/07/2025 15:05, Maíra Canal wrote:
> Hi Tvrtko,
> 
> On 16/07/25 10:41, Tvrtko Ursulin wrote:
>>
>> On 16/07/2025 13:47, Maíra Canal wrote:
>>> Hi Tvrtko,
>>>
>>> On 16/07/25 05:48, Tvrtko Ursulin wrote:
>>>> The "skip reset" test waits for the timeout handler to run for the
>>>> duration of 2 * MOCK_TIMEOUT, and because the mock scheduler opted to
>>>
>>> Would it make any sense to wait for 1.5 * MOCK_TIMEOUT? This way we
>>> would guarantee that only one timeout happened. I'm fine with the
>>> current solution as well.
>>
>> 1.5 * MOCK_TIMEOUT would work as well. I considered it, and even 
>> though I thought it would be safe, I concluded that it is better to 
>> have fewer dependencies on timings given these are two threads in this 
>> story.
> 
> Why not both? Just to make sure we won't run the timedout function
> twice, but still fixing the timing dependency by using
> DRM_MOCK_SCHED_JOB_RESET_SKIPPED.

Just because I think probably does not guarantee timedout worker does 
not run twice. It still in _theory_ could if the unit test thread would 
be starved by some other system activity. In practice however it does 
work, because nothing much is running in parallel to the unit test, for 
vast majority of use cases. So I thought someone would be bound to 
complain if I just did the 1.5x approach. If no one objects I can add 
that tweak, no problem.

>> Making the DRM_MOCK_SCHED_JOB_DONT_RESET persist allows for not having 
>> to think about timings. So slight preference to that. At least until 
>> some more advanced tests are attempted to be added.
>>
>>>> remove the "skip reset" flag once it fires, this gives opportunity 
>>>> for the
>>>> timeout handler to run twice. Second time the job will be removed 
>>>> from the
>>>> mock scheduler job list and the drm_mock_sched_advance() call in the 
>>>> test
>>>> will fail.
>>>>
>>>> Fix it by making the "don't reset" flag persist for the lifetime of the
>>>> job and add a new flag to verify that the code path had executed as
>>>> expected.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>> Fixes: 1472e7549f84 ("drm/sched: Add new test for 
>>>> DRM_GPU_SCHED_STAT_NO_HANG")
>>>  > Cc: Maíra Canal <mcanal at igalia.com>> Cc: Philipp Stanner 
>>> <phasta at kernel.org>
>>>> ---
>>>>   drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 2 +-
>>>>   drivers/gpu/drm/scheduler/tests/sched_tests.h    | 7 ++++---
>>>>   drivers/gpu/drm/scheduler/tests/tests_basic.c    | 4 ++--
>>>>   3 files changed, 7 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/ 
>>>> drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>> index 65acffc3fea8..8e9ae7d980eb 100644
>>>> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>> @@ -219,7 +219,7 @@ mock_sched_timedout_job(struct drm_sched_job 
>>>> *sched_job)
>>>>       unsigned long flags;
>>>>       if (job->flags & DRM_MOCK_SCHED_JOB_DONT_RESET) {
>>>> -        job->flags &= ~DRM_MOCK_SCHED_JOB_DONT_RESET;
>>>> +        job->flags |= DRM_MOCK_SCHED_JOB_RESET_SKIPPED;
>>>>           return DRM_GPU_SCHED_STAT_NO_HANG;
>>>>       }
>>>> diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h b/ 
>>>> drivers/ gpu/drm/scheduler/tests/sched_tests.h
>>>> index 63d4f2ac7074..5b262126b776 100644
>>>> --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>>> +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>>> @@ -95,9 +95,10 @@ struct drm_mock_sched_job {
>>>>       struct completion    done;
>>>> -#define DRM_MOCK_SCHED_JOB_DONE        0x1
>>>> -#define DRM_MOCK_SCHED_JOB_TIMEDOUT    0x2
>>>> -#define DRM_MOCK_SCHED_JOB_DONT_RESET    0x4
>>>> +#define DRM_MOCK_SCHED_JOB_DONE            0x1
>>>> +#define DRM_MOCK_SCHED_JOB_TIMEDOUT        0x2
>>>> +#define DRM_MOCK_SCHED_JOB_DONT_RESET        0x4
>>>> +#define DRM_MOCK_SCHED_JOB_RESET_SKIPPED    0x8
>>>>       unsigned long        flags;
>>>>       struct list_head    link;
>>>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/ 
>>>> drivers/ gpu/drm/scheduler/tests/tests_basic.c
>>>> index 55eb142bd7c5..82a41a456b0a 100644
>>>> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>>> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>>> @@ -317,8 +317,8 @@ static void drm_sched_skip_reset(struct kunit 
>>>> *test)
>>>>       KUNIT_ASSERT_FALSE(test, done);
>>>>       KUNIT_ASSERT_EQ(test,
>>>> -            job->flags & DRM_MOCK_SCHED_JOB_DONT_RESET,
>>>> -            0);
>>>> +            job->flags & DRM_MOCK_SCHED_JOB_RESET_SKIPPED,
>>>> +            DRM_MOCK_SCHED_JOB_RESET_SKIPPED);
>>>
>>> Maybe we could assert that job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT == 0
>>
>> Could but I am not sure it is needed.
> 
> Np.
> 
>>
>>> Anyway, thanks for the fix!
>>>
>>> Reviewed-by: Maíra Canal <mcanal at igalia.com>
>>
>> Thank you!
>>
>> Btw the failure for the record:
>>
>> [09:07:20] # drm_sched_skip_reset: ASSERTION FAILED at drivers/gpu/ 
>> drm/ scheduler/tests/tests_basic.c:324
>> [09:07:20] Expected i == 1, but
>> [09:07:20]     i == 0 (0x0)
>> [09:07:20] [FAILED] drm_sched_skip_reset
>> [09:07:20]     # module: drm_sched_tests
>> [09:07:20] # drm_sched_basic_timeout_tests: pass:1 fail:1 skip:0 total:2
>> [09:07:20] # Totals: pass:1 fail:1 skip:0 total:2
> 
> Unfortunately, I didn't get this error during my test run. On the other
> hand, I only ran it once before pushing the series, so that's on me.
> Thanks for catching it!

No worries.

Regards,

Tvrtko

>>>>       i = drm_mock_sched_advance(sched, 1);
>>>>       KUNIT_ASSERT_EQ(test, i, 1);
>>>
>>
> 



More information about the dri-devel mailing list