[PATCH] drm/sched: Fix a race in DRM_GPU_SCHED_STAT_NO_HANG test

Maíra Canal mcanal at igalia.com
Wed Jul 16 14:19:52 UTC 2025


Hi Tvrtko,

On 16/07/25 11:13, Tvrtko Ursulin wrote:
> 
> On 16/07/2025 15:05, Maíra Canal wrote:
>> Hi Tvrtko,
>>
>> On 16/07/25 10:41, Tvrtko Ursulin wrote:
>>>
>>> On 16/07/2025 13:47, Maíra Canal wrote:
>>>> Hi Tvrtko,
>>>>
>>>> On 16/07/25 05:48, Tvrtko Ursulin wrote:
>>>>> The "skip reset" test waits for the timeout handler to run for the
>>>>> duration of 2 * MOCK_TIMEOUT, and because the mock scheduler opted to
>>>>
>>>> Would it make any sense to wait for 1.5 * MOCK_TIMEOUT? This way we
>>>> would guarantee that only one timeout happened. I'm fine with the
>>>> current solution as well.
>>>
>>> 1.5 * MOCK_TIMEOUT would work as well. I considered it, and even 
>>> though I thought it would be safe, I concluded that it is better to 
>>> have fewer dependencies on timings given these are two threads in 
>>> this story.
>>
>> Why not both? Just to make sure we won't run the timedout function
>> twice, but still fixing the timing dependency by using
>> DRM_MOCK_SCHED_JOB_RESET_SKIPPED.
> 
> Just because I think probably does not guarantee timedout worker does 
> not run twice. It still in _theory_ could if the unit test thread would 
> be starved by some other system activity. In practice however it does 
> work, because nothing much is running in parallel to the unit test, for 
> vast majority of use cases. So I thought someone would be bound to 
> complain if I just did the 1.5x approach. If no one objects I can add 
> that tweak, no problem.

Fair, I'm okay in keeping it as it is.

Best Regards,
- Maíra

> 
>>> Making the DRM_MOCK_SCHED_JOB_DONT_RESET persist allows for not 
>>> having to think about timings. So slight preference to that. At least 
>>> until some more advanced tests are attempted to be added.
>>>
>>>>> remove the "skip reset" flag once it fires, this gives opportunity 
>>>>> for the
>>>>> timeout handler to run twice. Second time the job will be removed 
>>>>> from the
>>>>> mock scheduler job list and the drm_mock_sched_advance() call in 
>>>>> the test
>>>>> will fail.
>>>>>
>>>>> Fix it by making the "don't reset" flag persist for the lifetime of 
>>>>> the
>>>>> job and add a new flag to verify that the code path had executed as
>>>>> expected.
>>>>>
>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>>> Fixes: 1472e7549f84 ("drm/sched: Add new test for 
>>>>> DRM_GPU_SCHED_STAT_NO_HANG")
>>>>  > Cc: Maíra Canal <mcanal at igalia.com>> Cc: Philipp Stanner 
>>>> <phasta at kernel.org>
>>>>> ---
>>>>>   drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 2 +-
>>>>>   drivers/gpu/drm/scheduler/tests/sched_tests.h    | 7 ++++---
>>>>>   drivers/gpu/drm/scheduler/tests/tests_basic.c    | 4 ++--
>>>>>   3 files changed, 7 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/ 
>>>>> drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>>> index 65acffc3fea8..8e9ae7d980eb 100644
>>>>> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>>> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>>> @@ -219,7 +219,7 @@ mock_sched_timedout_job(struct drm_sched_job 
>>>>> *sched_job)
>>>>>       unsigned long flags;
>>>>>       if (job->flags & DRM_MOCK_SCHED_JOB_DONT_RESET) {
>>>>> -        job->flags &= ~DRM_MOCK_SCHED_JOB_DONT_RESET;
>>>>> +        job->flags |= DRM_MOCK_SCHED_JOB_RESET_SKIPPED;
>>>>>           return DRM_GPU_SCHED_STAT_NO_HANG;
>>>>>       }
>>>>> diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h b/ 
>>>>> drivers/ gpu/drm/scheduler/tests/sched_tests.h
>>>>> index 63d4f2ac7074..5b262126b776 100644
>>>>> --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>>>> +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>>>> @@ -95,9 +95,10 @@ struct drm_mock_sched_job {
>>>>>       struct completion    done;
>>>>> -#define DRM_MOCK_SCHED_JOB_DONE        0x1
>>>>> -#define DRM_MOCK_SCHED_JOB_TIMEDOUT    0x2
>>>>> -#define DRM_MOCK_SCHED_JOB_DONT_RESET    0x4
>>>>> +#define DRM_MOCK_SCHED_JOB_DONE            0x1
>>>>> +#define DRM_MOCK_SCHED_JOB_TIMEDOUT        0x2
>>>>> +#define DRM_MOCK_SCHED_JOB_DONT_RESET        0x4
>>>>> +#define DRM_MOCK_SCHED_JOB_RESET_SKIPPED    0x8
>>>>>       unsigned long        flags;
>>>>>       struct list_head    link;
>>>>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/ 
>>>>> drivers/ gpu/drm/scheduler/tests/tests_basic.c
>>>>> index 55eb142bd7c5..82a41a456b0a 100644
>>>>> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>>>> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>>>> @@ -317,8 +317,8 @@ static void drm_sched_skip_reset(struct kunit 
>>>>> *test)
>>>>>       KUNIT_ASSERT_FALSE(test, done);
>>>>>       KUNIT_ASSERT_EQ(test,
>>>>> -            job->flags & DRM_MOCK_SCHED_JOB_DONT_RESET,
>>>>> -            0);
>>>>> +            job->flags & DRM_MOCK_SCHED_JOB_RESET_SKIPPED,
>>>>> +            DRM_MOCK_SCHED_JOB_RESET_SKIPPED);
>>>>
>>>> Maybe we could assert that job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT 
>>>> == 0
>>>
>>> Could but I am not sure it is needed.
>>
>> Np.
>>
>>>
>>>> Anyway, thanks for the fix!
>>>>
>>>> Reviewed-by: Maíra Canal <mcanal at igalia.com>
>>>
>>> Thank you!
>>>
>>> Btw the failure for the record:
>>>
>>> [09:07:20] # drm_sched_skip_reset: ASSERTION FAILED at drivers/gpu/ 
>>> drm/ scheduler/tests/tests_basic.c:324
>>> [09:07:20] Expected i == 1, but
>>> [09:07:20]     i == 0 (0x0)
>>> [09:07:20] [FAILED] drm_sched_skip_reset
>>> [09:07:20]     # module: drm_sched_tests
>>> [09:07:20] # drm_sched_basic_timeout_tests: pass:1 fail:1 skip:0 total:2
>>> [09:07:20] # Totals: pass:1 fail:1 skip:0 total:2
>>
>> Unfortunately, I didn't get this error during my test run. On the other
>> hand, I only ran it once before pushing the series, so that's on me.
>> Thanks for catching it!
> 
> No worries.
> 
> Regards,
> 
> Tvrtko
> 
>>>>>       i = drm_mock_sched_advance(sched, 1);
>>>>>       KUNIT_ASSERT_EQ(test, i, 1);
>>>>
>>>
>>
> 



More information about the dri-devel mailing list