[PATCH v3 0/5] DRM scheduler kunit tests
Tvrtko Ursulin
tvrtko.ursulin at igalia.com
Fri Mar 7 11:20:54 UTC 2025
On 07/03/2025 07:21, Philipp Stanner wrote:
> On Thu, 2025-03-06 at 17:10 +0000, Tvrtko Ursulin wrote:
>>
>> On 06/03/2025 15:18, Tvrtko Ursulin wrote:
>>>
>>> On 06/03/2025 14:56, Tvrtko Ursulin wrote:
>>>>
>>>> On 06/03/2025 12:37, Philipp Stanner wrote:
>>>>> On Tue, 2025-03-04 at 13:10 +0000, Tvrtko Ursulin wrote:
>>>>>> There has repeatedly been quite a bit of apprehension when
>>>>>> any change
>>>>>> to the DRM
>>>>>> scheduler is proposed, with two main reasons being code base
>>>>>> is
>>>>>> considered
>>>>>> fragile, not well understood and not very well documented,
>>>>>> and
>>>>>> secondly the lack
>>>>>> of systematic testing outside the vendor specific tests
>>>>>> suites and/or
>>>>>> test
>>>>>> farms.
>>>>>>
>>>>>> This series is an attempt to dislodge this status quo by
>>>>>> adding some
>>>>>> unit tests
>>>>>> using the kunit framework.
>>>>>>
>>>>>> General approach is that there is a mock "hardware" backend
>>>>>> which can
>>>>>> be
>>>>>> controlled from tests, which in turn allows exercising
>>>>>> various
>>>>>> scheduler code
>>>>>> paths.
>>>>>>
>>>>>> Only some simple basic tests get added in the series and
>>>>>> hopefully it
>>>>>> is easy to
>>>>>> understand what tests are doing.
>>>>>>
>>>>>> An obligatory "screenshot" for reference:
>>>>>>
>>>>>> [14:29:37] ============ drm_sched_basic_tests (3 subtests)
>>>>>> ============
>>>>>> [14:29:38] [PASSED] drm_sched_basic_submit
>>>>>> [14:29:38] ================== drm_sched_basic_test
>>>>>> ===================
>>>>>> [14:29:38] [PASSED] A queue of jobs in a single entity
>>>>>> [14:29:38] [PASSED] A chain of dependent jobs across multiple
>>>>>> entities
>>>>>> [14:29:38] [PASSED] Multiple independent job queues
>>>>>> [14:29:38] [PASSED] Multiple inter-dependent job queues
>>>>>> [14:29:38] ============== [PASSED] drm_sched_basic_test
>>>>>> ===============
>>>>>> [14:29:38] [PASSED] drm_sched_basic_entity_cleanup
>>>>>> [14:29:38] ============== [PASSED] drm_sched_basic_tests
>>>>>> ==============
>>>>>> [14:29:38] ======== drm_sched_basic_timeout_tests (1 subtest)
>>>>>> =========
>>>>>> [14:29:40] [PASSED] drm_sched_basic_timeout
>>>>>> [14:29:40] ========== [PASSED] drm_sched_basic_timeout_tests
>>>>>> ==========
>>>>>> [14:29:40] ======= drm_sched_basic_priority_tests (2
>>>>>> subtests)
>>>>>> ========
>>>>>> [14:29:42] [PASSED] drm_sched_priorities
>>>>>> [14:29:42] [PASSED] drm_sched_change_priority
>>>>>> [14:29:42] ========= [PASSED] drm_sched_basic_priority_tests
>>>>>> ==========
>>>>>> [14:29:42] ====== drm_sched_basic_modify_sched_tests (1
>>>>>> subtest)
>>>>>> ======
>>>>>> [14:29:43] [PASSED] drm_sched_test_modify_sched
>>>>>> [14:29:43] ======= [PASSED]
>>>>>> drm_sched_basic_modify_sched_tests
>>>>>> ========
>>>>>> [14:29:43]
>>>>>> ============================================================
>>>>>> [14:29:43] Testing complete. Ran 10 tests: passed: 10
>>>>>> [14:29:43] Elapsed time: 13.330s total, 0.001s configuring,
>>>>>> 4.005s
>>>>>> building, 9.276s running
>>>>>
>>>>> Yo,
>>>>>
>>>>> so I tried to test this all this in QEMU and I am encountering
>>>>> some
>>>>> explosions when I activate the scheduler tests. Just DRM tests
>>>>> boot
>>>>> fine.
>>>>>
>>>>> I'm using a kernel on relatively current drm-misc-next:
>>>>> 44d2f310f008
>>>>>
>>>>> I apply your series, then
>>>>> make defconfig
>>>>> make menuconfig # switch on kunit framework and scheduler tests
>>>>> install everything + initramfs
>>>>>
>>>>> Boot then causes errors as below. Just using the DRM kunit
>>>>> tests works
>>>>> fine.
>>>>>
>>>>> Excerpt of the first fault:
>>>>>
>>>>> [ 1.040513] # kunit_device: pass:3 fail:0 skip:0 total:3
>>>>> [ 1.040867] # Totals: pass:3 fail:0 skip:0 total:3
>>>>> [ 1.041296] ok 7 kunit_device
>>>>> [ 1.041936] KTAP version 1
>>>>> [ 1.042186] # Subtest: kunit_fault
>>>>> [ 1.042517] # module: kunit_test
>>>>> [ 1.042517] 1..1
>>>>> [ 1.043147] BUG: kernel NULL pointer dereference, address:
>>>>> 0000000000000000
>>>>> [ 1.043765] #PF: supervisor write access in kernel mode
>>>>> [ 1.044189] #PF: error_code(0x0002) - not-present page
>>>>> [ 1.044617] PGD 0 P4D 0
>>>>> [ 1.044818] Oops: Oops: 0002 [#1] PREEMPT SMP PTI
>>>>> [ 1.045380] CPU: 7 UID: 0 PID: 214 Comm: kunit_try_catch
>>>>> Tainted:
>>>>> G N 6.14.0-rc4-00387-g33e4632926a0 #8
>>>>> [ 1.046262] Tainted: [N]=TEST
>>>>> [ 1.046521] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>> 1996),
>>>>> BIOS 1.16.3-2.fc40 04/01/2014
>>>>> [ 1.047224] RIP: 0010:kunit_test_null_dereference+0x37/0x80
>>>>> [ 1.047706] Code: 80 b5 49 c7 c0 50 7f 56 b4 ba 01 00 00 00
>>>>> 65 48
>>>>> 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8d 4c 24 07 48 c7
>>>>> c6 80
>>>>> 8a 26 b5 <c7> 04 25 00 00 00 00 00 00 00 00 48 c7 87 70 01 00
>>>>> 00 a6
>>>>> e9 8c b5
>>>>> [ 1.049204] RSP: 0000:ffffa609807c7ec8 EFLAGS: 00010246
>>>>> [ 1.049642] RAX: 0000000000000000 RBX: ffff91d982623000 RCX:
>>>>> ffffa609807c7ecf
>>>>> [ 1.050213] RDX: 0000000000000001 RSI: ffffffffb5268a80 RDI:
>>>>> ffffa60980013c68
>>>>> [ 1.050799] RBP: ffff91d98105afc0 R08: ffffffffb4567f50 R09:
>>>>> ffffffffb5807ce8
>>>>> [ 1.051375] R10: 0000000000000000 R11: 0000000000000001 R12:
>>>>> ffff91d98105afc0
>>>>> [ 1.051941] R13: ffff91d983c749c0 R14: ffffffffb45685e0 R15:
>>>>> ffff91d982623000
>>>>> [ 1.052543] FS: 0000000000000000(0000)
>>>>> GS:ffff91e48f9c0000(0000)
>>>>> knlGS:0000000000000000
>>>>> [ 1.053187] CS: 0010 DS: 0000 ES: 0000 CR0:
>>>>> 0000000080050033
>>>>> [ 1.053649] CR2: 0000000000000000 CR3: 00000004cee30000 CR4:
>>>>> 00000000000006f0
>>>>> [ 1.054214] Call Trace:
>>>>> [ 1.054427] <TASK>
>>>>> [ 1.054597] ? __die+0x1e/0x60
>>>>> [ 1.054844] ? page_fault_oops+0x17b/0x4a0
>>>>> [ 1.055174] ? search_extable+0x26/0x30
>>>>> [ 1.055482] ? kunit_test_null_dereference+0x37/0x80
>>>>> [ 1.055888] ? search_module_extables+0x14/0x50
>>>>> [ 1.056255] ? exc_page_fault+0x6b/0x150
>>>>> [ 1.056571] ? asm_exc_page_fault+0x26/0x30
>>>>> [ 1.056898] ?
>>>>> __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
>>>>> [ 1.057387] ? __pfx_kunit_fail_assert_format+0x10/0x10
>>>>> [ 1.057799] ? kunit_test_null_dereference+0x37/0x80
>>>>> [ 1.058195] ? __kthread_parkme+0x33/0x80
>>>>> [ 1.058523] kunit_generic_run_threadfn_adapter+0x1c/0x40
>>>>> [ 1.058949] kthread+0xe9/0x1f0
>>>>> [ 1.059206] ? __pfx_kthread+0x10/0x10
>>>>> [ 1.059513] ret_from_fork+0x2f/0x50
>>>>> [ 1.059798] ? __pfx_kthread+0x10/0x10
>>>>> [ 1.060095] ret_from_fork_asm+0x1a/0x30
>>>>> [ 1.060421] </TASK>
>>>>> [ 1.060597] Modules linked in:
>>>>> [ 1.060841] CR2: 0000000000000000
>>>>> [ 1.061104] ---[ end trace 0000000000000000 ]---
>>>>> [ 1.061481] RIP: 0010:kunit_test_null_dereference+0x37/0x80
>>>>>
>>>>>
>>>>> I attach my kernel config and the full log file.
>>>>>
>>>>> What's awkward is that it does not seem to be related directly
>>>>> to
>>>>> sched, but only faults with sched.
>>>>>
>>>>>
>>>>> Could you try to reproduce this, Tvrtko?
>>>>
>>>> Any chance that between the two runs you somehow manage to enable
>>>> CONFIG_KUNIT_FAULT_TEST?
>>>
>>> Hmm in the sea of kunit_test_null_dereference there was a drm sched
>>> related fail too. Investigating.
>>
>> Well this was quite embarrassing - I had an use after free due
>> relying
>> on scheduler fences for querying job status. That's what I get for
>> over-relying on KASAN...
>
> Interesting, so KASAN had a false negative with those?
No, just that it cannot see UAF if it doesn't happen due timing. I was
referring to the false sense of security of "I am running with KASAN
therefore there cannot possibly be bugs." :)
>> I've fixed it by tracking the job completion status in the mock job
>> object directly and sent v4 out.
>>
>> Also interestingly, for me testing under qemu failed to catch it.
>> Only
>> one out of two real hw test machines hit it.
>
> That's very awkward. How can my QEMU be that different from your QEMU?
> Did you use the commit I provided above as the base of your kernel,
> too?
It was just about timing. Scheduler free job worker had to run before
the test dereferenced job->s_fence one last time. I replaced that with a
direct status check on the container object in the new version.
Regards,
Tvrtko
>> Excellent that you gave it
>> a spin and caught it before merge, thanks for that!
>
> No worries, that's our mission. Cool that you found it so quickly!
>
> P.
>
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>> P.
>>>>>
>>>>>
>>>>>>
>>>>>> v2:
>>>>>> * Parameterize a bunch of similar tests.
>>>>>> * Improve test commentary.
>>>>>> * Rename TDR test to timeout. (Christian)
>>>>>> * Improve quality and consistency of naming. (Philipp)
>>>>>>
>>>>>> RFC v2 -> series v1:
>>>>>> * Rebased for drm_sched_init changes.
>>>>>> * Fixed modular build.
>>>>>> * Added some comments.
>>>>>> * Filename renames. (Philipp)
>>>>>>
>>>>>> v2:
>>>>>> * Dealt with a bunch of checkpatch warnings.
>>>>>>
>>>>>> v3:
>>>>>> * Some mock API renames, kerneldoc grammar fixes and
>>>>>> indentation
>>>>>> fixes.
>>>>>>
>>>>>> Cc: Christian König <christian.koenig at amd.com>
>>>>>> Cc: Danilo Krummrich <dakr at kernel.org>
>>>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>>>> Cc: Philipp Stanner <phasta at kernel.org>
>>>>>>
>>>>>> Tvrtko Ursulin (5):
>>>>>> drm: Move some options to separate new Kconfig
>>>>>> drm/scheduler: Add scheduler unit testing infrastructure
>>>>>> and some
>>>>>> basic tests
>>>>>> drm/scheduler: Add a simple timeout test
>>>>>> drm/scheduler: Add basic priority tests
>>>>>> drm/scheduler: Add a basic test for modifying entities
>>>>>> scheduler
>>>>>> list
>>>>>>
>>>>>> drivers/gpu/drm/Kconfig | 109 +----
>>>>>> drivers/gpu/drm/Kconfig.debug | 115 +++++
>>>>>> drivers/gpu/drm/scheduler/.kunitconfig | 12 +
>>>>>> drivers/gpu/drm/scheduler/Makefile | 2 +
>>>>>> drivers/gpu/drm/scheduler/tests/Makefile | 7 +
>>>>>> .../gpu/drm/scheduler/tests/mock_scheduler.c | 323
>>>>>> +++++++++++++
>>>>>> drivers/gpu/drm/scheduler/tests/sched_tests.h | 223
>>>>>> +++++++++
>>>>>> drivers/gpu/drm/scheduler/tests/tests_basic.c | 426
>>>>>> ++++++++++++++++++
>>>>>> 8 files changed, 1113 insertions(+), 104 deletions(-)
>>>>>> create mode 100644 drivers/gpu/drm/Kconfig.debug
>>>>>> create mode 100644 drivers/gpu/drm/scheduler/.kunitconfig
>>>>>> create mode 100644 drivers/gpu/drm/scheduler/tests/Makefile
>>>>>> create mode 100644
>>>>>> drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>>>> create mode 100644
>>>>>> drivers/gpu/drm/scheduler/tests/sched_tests.h
>>>>>> create mode 100644
>>>>>> drivers/gpu/drm/scheduler/tests/tests_basic.c
>>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the dri-devel
mailing list