[PATCH 2/8] drm/sched: Always free the job after the timeout
Tvrtko Ursulin
tvrtko.ursulin at igalia.com
Tue May 6 13:28:15 UTC 2025
On 06/05/2025 13:46, Maíra Canal wrote:
> Hi Tvrtko,
>
> Thanks for your review!
>
> On 06/05/25 08:49, Tvrtko Ursulin wrote:
>>
>> On 03/05/2025 21:59, Maíra Canal wrote:
>>> Currently, if we add the assertions presented in this commit to the mock
>>> scheduler, we will see the following output:
>>>
>>> [15:47:08] ============== [PASSED] drm_sched_basic_tests ==============
>>> [15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) =========
>>> [15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/
>>> gpu/ drm/scheduler/tests/tests_basic.c:246
>>> [15:47:08] Expected list_empty(&sched->job_list) to be true, but is
>>> false
>>> [15:47:08] [FAILED] drm_sched_basic_timeout
>>> [15:47:08] # module: drm_sched_tests
>>>
>>> This occurs because `mock_sched_timedout_job()` doesn't properly handle
>>> the hang. From the DRM sched documentation, `drm_sched_stop()` and
>>> `drm_sched_start()` are typically used for reset recovery. If these
>>> functions are not used, the offending job won't be freed and should be
>>> freed by the caller.
>>>
>>> Currently, the mock scheduler doesn't use the functions provided by the
>>> API, nor does it handle the freeing of the job. As a result, the job
>>> isn't
>>> removed from the job list.
>>
>> For the record the job does gets freed via the kunit managed allocation.
>
> Sorry, I didn't express myself correctly. Indeed, it is. I meant that
> the DRM scheduler didn't free the job.
>
>>
>> It was a design choice for this test to be a *strict* unit test which
>> tests only a _single_ thing. And that is that the timedout_job() hook
>> gets called. As such the hook was implemented to satisfy that single
>> requirement only.
>>
>
> What do you think about checking that `sched->job_list` won't be empty?
>
> I wanted to add such assertion to make sure that the behavior of the
> timeout won't change in future (e.g. a patch makes a change that calls
> `free_job()` for the guilty job at timeout). Does it make sense to you?
Where would that assert be?
>> But I also do not oppose making it test multiple things in one test
>> per se.
>>
>>> This commit mocks a GPU reset by stopping the scheduler affected by the
>>> reset, waiting a couple of microseconds to mimic a hardware reset, and
>>> then restart the affected scheduler.
>>>
>>> Signed-off-by: Maíra Canal <mcanal at igalia.com>
>>> ---
>>> drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 10 ++++++++++
>>> drivers/gpu/drm/scheduler/tests/tests_basic.c | 3 +++
>>> 2 files changed, 13 insertions(+)
>>>
>
> [...]
>
>>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/
>>> gpu/drm/scheduler/tests/tests_basic.c
>>> index
>>> 7230057e0594c6246f02608f07fcb1f8d738ac75..8f960f0fd31d0af7873f410ceba2d636f58a5474 100644
>>> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>> @@ -241,6 +241,9 @@ static void drm_sched_basic_timeout(struct kunit
>>> *test)
>>> job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
>>> DRM_MOCK_SCHED_JOB_TIMEDOUT);
>>> + KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
>>
>> Hmm I think this assert could be racy because it appears to rely on
>> the free worker to run and cleanup the "finished" job in the window
>> between drm_mock_sched_job_wait_finished() (or drm_sched_start(),
>> depends how you look at it) and here. Am I missing something?
>
> From what I understand, the job is freed by the timeout worker [1] after
> `drm_sched_stop()` marked the job as guilty.
>
> Therefore, if the timeout was called (and we asserted that through
> `job->flags`), we can be sure that the job was freed.
>
> [1] https://gitlab.freedesktop.org/drm/misc/kernel/-/blob/drm-misc-next/
> drivers/gpu/drm/scheduler/sched_main.c#L568
Hm I thought it would end up on the dma_fence_remove_callback() == true
branch in drm_sched_stop().
I gave it a quick spin locally and that indeed appears to be the case.
So AFAICT it does rely on the free worker to have had executed before
the assert.
Regards,
Tvrtko
>
> Best Regards,
> - Maíra
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>> + KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list));
>> > +> drm_mock_sched_entity_free(entity);
>>> }
>>>
>>
>
More information about the etnaviv
mailing list