[PATCH] drm/sched: Fix a race in DRM_GPU_SCHED_STAT_NO_HANG test
Maíra Canal
mcanal at igalia.com
Wed Jul 16 14:05:44 UTC 2025
Hi Tvrtko,
On 16/07/25 10:41, Tvrtko Ursulin wrote:
>
> On 16/07/2025 13:47, Maíra Canal wrote:
>> Hi Tvrtko,
>>
>> On 16/07/25 05:48, Tvrtko Ursulin wrote:
>>> The "skip reset" test waits for the timeout handler to run for the
>>> duration of 2 * MOCK_TIMEOUT, and because the mock scheduler opted to
>>
>> Would it make any sense to wait for 1.5 * MOCK_TIMEOUT? This way we
>> would guarantee that only one timeout happened. I'm fine with the
>> current solution as well.
>
> 1.5 * MOCK_TIMEOUT would work as well. I considered it, and even though
> I thought it would be safe, I concluded that it is better to have fewer
> dependencies on timings given these are two threads in this story.
Why not both? Just to make sure we won't run the timedout function
twice, but still fixing the timing dependency by using
DRM_MOCK_SCHED_JOB_RESET_SKIPPED.
>
> Making the DRM_MOCK_SCHED_JOB_DONT_RESET persist allows for not having
> to think about timings. So slight preference to that. At least until
> some more advanced tests are attempted to be added.
>
>>> remove the "skip reset" flag once it fires, this gives opportunity
>>> for the
>>> timeout handler to run twice. Second time the job will be removed
>>> from the
>>> mock scheduler job list and the drm_mock_sched_advance() call in the
>>> test
>>> will fail.
>>>
>>> Fix it by making the "don't reset" flag persist for the lifetime of the
>>> job and add a new flag to verify that the code path had executed as
>>> expected.
>>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>> Fixes: 1472e7549f84 ("drm/sched: Add new test for
>>> DRM_GPU_SCHED_STAT_NO_HANG")
>> > Cc: Maíra Canal <mcanal at igalia.com>> Cc: Philipp Stanner
>> <phasta at kernel.org>
>>> ---
>>> drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 2 +-
>>> drivers/gpu/drm/scheduler/tests/sched_tests.h | 7 ++++---
>>> drivers/gpu/drm/scheduler/tests/tests_basic.c | 4 ++--
>>> 3 files changed, 7 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/
>>> drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> index 65acffc3fea8..8e9ae7d980eb 100644
>>> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>> @@ -219,7 +219,7 @@ mock_sched_timedout_job(struct drm_sched_job
>>> *sched_job)
>>> unsigned long flags;
>>> if (job->flags & DRM_MOCK_SCHED_JOB_DONT_RESET) {
>>> - job->flags &= ~DRM_MOCK_SCHED_JOB_DONT_RESET;
>>> + job->flags |= DRM_MOCK_SCHED_JOB_RESET_SKIPPED;
>>> return DRM_GPU_SCHED_STAT_NO_HANG;
>>> }
>>> diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h b/drivers/
>>> gpu/drm/scheduler/tests/sched_tests.h
>>> index 63d4f2ac7074..5b262126b776 100644
>>> --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>> +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
>>> @@ -95,9 +95,10 @@ struct drm_mock_sched_job {
>>> struct completion done;
>>> -#define DRM_MOCK_SCHED_JOB_DONE 0x1
>>> -#define DRM_MOCK_SCHED_JOB_TIMEDOUT 0x2
>>> -#define DRM_MOCK_SCHED_JOB_DONT_RESET 0x4
>>> +#define DRM_MOCK_SCHED_JOB_DONE 0x1
>>> +#define DRM_MOCK_SCHED_JOB_TIMEDOUT 0x2
>>> +#define DRM_MOCK_SCHED_JOB_DONT_RESET 0x4
>>> +#define DRM_MOCK_SCHED_JOB_RESET_SKIPPED 0x8
>>> unsigned long flags;
>>> struct list_head link;
>>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/
>>> gpu/drm/scheduler/tests/tests_basic.c
>>> index 55eb142bd7c5..82a41a456b0a 100644
>>> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>> @@ -317,8 +317,8 @@ static void drm_sched_skip_reset(struct kunit *test)
>>> KUNIT_ASSERT_FALSE(test, done);
>>> KUNIT_ASSERT_EQ(test,
>>> - job->flags & DRM_MOCK_SCHED_JOB_DONT_RESET,
>>> - 0);
>>> + job->flags & DRM_MOCK_SCHED_JOB_RESET_SKIPPED,
>>> + DRM_MOCK_SCHED_JOB_RESET_SKIPPED);
>>
>> Maybe we could assert that job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT == 0
>
> Could but I am not sure it is needed.
Np.
>
>> Anyway, thanks for the fix!
>>
>> Reviewed-by: Maíra Canal <mcanal at igalia.com>
>
> Thank you!
>
> Btw the failure for the record:
>
> [09:07:20] # drm_sched_skip_reset: ASSERTION FAILED at drivers/gpu/drm/
> scheduler/tests/tests_basic.c:324
> [09:07:20] Expected i == 1, but
> [09:07:20] i == 0 (0x0)
> [09:07:20] [FAILED] drm_sched_skip_reset
> [09:07:20] # module: drm_sched_tests
> [09:07:20] # drm_sched_basic_timeout_tests: pass:1 fail:1 skip:0 total:2
> [09:07:20] # Totals: pass:1 fail:1 skip:0 total:2
Unfortunately, I didn't get this error during my test run. On the other
hand, I only ran it once before pushing the series, so that's on me.
Thanks for catching it!
Best Regards,
- Maíra
>
> Regards,
>
> Tvrtko
>
>>> i = drm_mock_sched_advance(sched, 1);
>>> KUNIT_ASSERT_EQ(test, i, 1);
>>
>
More information about the dri-devel
mailing list