[PATCH 2/8] drm/sched: Always free the job after the timeout

Tvrtko Ursulin tvrtko.ursulin at igalia.com
Tue May 6 13:28:15 UTC 2025


On 06/05/2025 13:46, Maíra Canal wrote:
> Hi Tvrtko,
> 
> Thanks for your review!
> 
> On 06/05/25 08:49, Tvrtko Ursulin wrote:
>>
>> On 03/05/2025 21:59, Maíra Canal wrote:
>>> Currently, if we add the assertions presented in this commit to the mock
>>> scheduler, we will see the following output:
>>>
>>> [15:47:08] ============== [PASSED] drm_sched_basic_tests ==============
>>> [15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) =========
>>> [15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/ 
>>> gpu/ drm/scheduler/tests/tests_basic.c:246
>>> [15:47:08] Expected list_empty(&sched->job_list) to be true, but is 
>>> false
>>> [15:47:08] [FAILED] drm_sched_basic_timeout
>>> [15:47:08] # module: drm_sched_tests
>>>
>>> This occurs because `mock_sched_timedout_job()` doesn't properly handle
>>> the hang. From the DRM sched documentation, `drm_sched_stop()` and
>>> `drm_sched_start()` are typically used for reset recovery. If these
>>> functions are not used, the offending job won't be freed and should be
>>> freed by the caller.
>>>
>>> Currently, the mock scheduler doesn't use the functions provided by the
>>> API, nor does it handle the freeing of the job. As a result, the job 
>>> isn't
>>> removed from the job list.
>>
>> For the record the job does gets freed via the kunit managed allocation.
> 
> Sorry, I didn't express myself correctly. Indeed, it is. I meant that
> the DRM scheduler didn't free the job.
> 
>>
>> It was a design choice for this test to be a *strict* unit test which 
>> tests only a _single_ thing. And that is that the timedout_job() hook 
>> gets called. As such the hook was implemented to satisfy that single 
>> requirement only.
>>
> 
> What do you think about checking that `sched->job_list` won't be empty?
> 
> I wanted to add such assertion to make sure that the behavior of the
> timeout won't change in future (e.g. a patch makes a change that calls
> `free_job()` for the guilty job at timeout). Does it make sense to you?

Where would that assert be?

>> But I also do not oppose making it test multiple things in one test 
>> per se.
>>
>>> This commit mocks a GPU reset by stopping the scheduler affected by the
>>> reset, waiting a couple of microseconds to mimic a hardware reset, and
>>> then restart the affected scheduler.
>>>
>>> Signed-off-by: Maíra Canal <mcanal at igalia.com>
>>> ---
>>>   drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 10 ++++++++++
>>>   drivers/gpu/drm/scheduler/tests/tests_basic.c    |  3 +++
>>>   2 files changed, 13 insertions(+)
>>>
> 
> [...]
> 
>>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/ 
>>> gpu/drm/scheduler/tests/tests_basic.c
>>> index 
>>> 7230057e0594c6246f02608f07fcb1f8d738ac75..8f960f0fd31d0af7873f410ceba2d636f58a5474 100644
>>> --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>> +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
>>> @@ -241,6 +241,9 @@ static void drm_sched_basic_timeout(struct kunit 
>>> *test)
>>>               job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT,
>>>               DRM_MOCK_SCHED_JOB_TIMEDOUT);
>>> +    KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
>>
>> Hmm I think this assert could be racy because it appears to rely on 
>> the free worker to run and cleanup the "finished" job in the window 
>> between drm_mock_sched_job_wait_finished() (or drm_sched_start(), 
>> depends how you look at it) and here. Am I missing something?
> 
>  From what I understand, the job is freed by the timeout worker [1] after
> `drm_sched_stop()` marked the job as guilty.
> 
> Therefore, if the timeout was called (and we asserted that through
> `job->flags`), we can be sure that the job was freed.
> 
> [1] https://gitlab.freedesktop.org/drm/misc/kernel/-/blob/drm-misc-next/ 
> drivers/gpu/drm/scheduler/sched_main.c#L568

Hm I thought it would end up on the dma_fence_remove_callback() == true 
branch in drm_sched_stop().

I gave it a quick spin locally and that indeed appears to be the case. 
So AFAICT it does rely on the free worker to have had executed before 
the assert.

Regards,

Tvrtko

> 
> Best Regards,
> - Maíra
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>> +    KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list));
>>  > +>       drm_mock_sched_entity_free(entity);
>>>   }
>>>
>>
> 



More information about the Intel-xe mailing list