[PATCH 2/8] drm/sched: Always free the job after the timeout

Tue May 6 12:46:06 UTC 2025

Hi Tvrtko,

Thanks for your review!

On 06/05/25 08:49, Tvrtko Ursulin wrote:
> 
> On 03/05/2025 21:59, Maíra Canal wrote:
>> Currently, if we add the assertions presented in this commit to the mock
>> scheduler, we will see the following output:
>>
>> [15:47:08] ============== [PASSED] drm_sched_basic_tests ==============
>> [15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) =========
>> [15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/gpu/ 
>> drm/scheduler/tests/tests_basic.c:246
>> [15:47:08] Expected list_empty(&sched->job_list) to be true, but is false
>> [15:47:08] [FAILED] drm_sched_basic_timeout
>> [15:47:08] # module: drm_sched_tests
>>
>> This occurs because `mock_sched_timedout_job()` doesn't properly handle
>> the hang. From the DRM sched documentation, `drm_sched_stop()` and
>> `drm_sched_start()` are typically used for reset recovery. If these
>> functions are not used, the offending job won't be freed and should be
>> freed by the caller.
>>
>> Currently, the mock scheduler doesn't use the functions provided by the
>> API, nor does it handle the freeing of the job. As a result, the job 
>> isn't
>> removed from the job list.
> 
> For the record the job does gets freed via the kunit managed allocation.

Sorry, I didn't express myself correctly. Indeed, it is. I meant that
the DRM scheduler didn't free the job.

> 
> It was a design choice for this test to be a *strict* unit test which 
> tests only a _single_ thing. And that is that the timedout_job() hook 
> gets called. As such the hook was implemented to satisfy that single 
> requirement only.
> 

What do you think about checking that `sched->job_list` won't be empty?

I wanted to add such assertion to make sure that the behavior of the
timeout won't change in future (e.g. a patch makes a change that calls
`free_job()` for the guilty job at timeout). Does it make sense to you?

> But I also do not oppose making it test multiple things in one test per se.
> 
>> This commit mocks a GPU reset by stopping the scheduler affected by the
>> reset, waiting a couple of microseconds to mimic a hardware reset, and
>> then restart the affected scheduler.
>>
>> Signed-off-by: Maíra Canal <mcanal at igalia.com>
>> ---
>>   drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 10 ++++++++++
>>   drivers/gpu/drm/scheduler/tests/tests_basic.c    |  3 +++
>>   2 files changed, 13 insertions(+)
>>

[...]

>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/ 
>> gpu/drm/scheduler/tests/tests_basic.c
>> index 
>> 7230057e0594c6246f02608f07fcb1f8d738ac75..8f960f0fd31d0af7873f410ceba2d636f58a5474 100644
>> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
>> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
>> @@ -241,6 +241,9 @@ static void drm_sched_basic_timeout(struct kunit 
>> *test)
>>               job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
>>               DRM_MOCK_SCHED_JOB_TIMEDOUT);
>> +    KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
> 
> Hmm I think this assert could be racy because it appears to rely on the 
> free worker to run and cleanup the "finished" job in the window between 
> drm_mock_sched_job_wait_finished() (or drm_sched_start(), depends how 
> you look at it) and here. Am I missing something?

 From what I understand, the job is freed by the timeout worker [1] after
`drm_sched_stop()` marked the job as guilty.

Therefore, if the timeout was called (and we asserted that through
`job->flags`), we can be sure that the job was freed.

[1] 
https://gitlab.freedesktop.org/drm/misc/kernel/-/blob/drm-misc-next/drivers/gpu/drm/scheduler/sched_main.c#L568

Best Regards,
- Maíra

> 
> Regards,
> 
> Tvrtko
> 
>> +    KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list));
>  > +>       drm_mock_sched_entity_free(entity);
>>   }
>>
>