[PATCH v3 0/5] DRM scheduler kunit tests

Tvrtko Ursulin tvrtko.ursulin at igalia.com
Thu Mar 6 17:10:11 UTC 2025


On 06/03/2025 15:18, Tvrtko Ursulin wrote:
> 
> On 06/03/2025 14:56, Tvrtko Ursulin wrote:
>>
>> On 06/03/2025 12:37, Philipp Stanner wrote:
>>> On Tue, 2025-03-04 at 13:10 +0000, Tvrtko Ursulin wrote:
>>>> There has repeatedly been quite a bit of apprehension when any change
>>>> to the DRM
>>>> scheduler is proposed, with two main reasons being code base is
>>>> considered
>>>> fragile, not well understood and not very well documented, and
>>>> secondly the lack
>>>> of systematic testing outside the vendor specific tests suites and/or
>>>> test
>>>> farms.
>>>>
>>>> This series is an attempt to dislodge this status quo by adding some
>>>> unit tests
>>>> using the kunit framework.
>>>>
>>>> General approach is that there is a mock "hardware" backend which can
>>>> be
>>>> controlled from tests, which in turn allows exercising various
>>>> scheduler code
>>>> paths.
>>>>
>>>> Only some simple basic tests get added in the series and hopefully it
>>>> is easy to
>>>> understand what tests are doing.
>>>>
>>>> An obligatory "screenshot" for reference:
>>>>
>>>> [14:29:37] ============ drm_sched_basic_tests (3 subtests)
>>>> ============
>>>> [14:29:38] [PASSED] drm_sched_basic_submit
>>>> [14:29:38] ================== drm_sched_basic_test
>>>> ===================
>>>> [14:29:38] [PASSED] A queue of jobs in a single entity
>>>> [14:29:38] [PASSED] A chain of dependent jobs across multiple
>>>> entities
>>>> [14:29:38] [PASSED] Multiple independent job queues
>>>> [14:29:38] [PASSED] Multiple inter-dependent job queues
>>>> [14:29:38] ============== [PASSED] drm_sched_basic_test
>>>> ===============
>>>> [14:29:38] [PASSED] drm_sched_basic_entity_cleanup
>>>> [14:29:38] ============== [PASSED] drm_sched_basic_tests
>>>> ==============
>>>> [14:29:38] ======== drm_sched_basic_timeout_tests (1 subtest)
>>>> =========
>>>> [14:29:40] [PASSED] drm_sched_basic_timeout
>>>> [14:29:40] ========== [PASSED] drm_sched_basic_timeout_tests
>>>> ==========
>>>> [14:29:40] ======= drm_sched_basic_priority_tests (2 subtests)
>>>> ========
>>>> [14:29:42] [PASSED] drm_sched_priorities
>>>> [14:29:42] [PASSED] drm_sched_change_priority
>>>> [14:29:42] ========= [PASSED] drm_sched_basic_priority_tests
>>>> ==========
>>>> [14:29:42] ====== drm_sched_basic_modify_sched_tests (1 subtest)
>>>> ======
>>>> [14:29:43] [PASSED] drm_sched_test_modify_sched
>>>> [14:29:43] ======= [PASSED] drm_sched_basic_modify_sched_tests
>>>> ========
>>>> [14:29:43]
>>>> ============================================================
>>>> [14:29:43] Testing complete. Ran 10 tests: passed: 10
>>>> [14:29:43] Elapsed time: 13.330s total, 0.001s configuring, 4.005s
>>>> building, 9.276s running
>>>
>>> Yo,
>>>
>>> so I tried to test this all this in QEMU and I am encountering some
>>> explosions when I activate the scheduler tests. Just DRM tests boot
>>> fine.
>>>
>>> I'm using a kernel on relatively current drm-misc-next: 44d2f310f008
>>>
>>> I apply your series, then
>>> make defconfig
>>> make menuconfig # switch on kunit framework and scheduler tests
>>> install everything + initramfs
>>>
>>> Boot then causes errors as below. Just using the DRM kunit tests works
>>> fine.
>>>
>>> Excerpt of the first fault:
>>>
>>> [    1.040513] # kunit_device: pass:3 fail:0 skip:0 total:3
>>> [    1.040867] # Totals: pass:3 fail:0 skip:0 total:3
>>> [    1.041296] ok 7 kunit_device
>>> [    1.041936]     KTAP version 1
>>> [    1.042186]     # Subtest: kunit_fault
>>> [    1.042517]     # module: kunit_test
>>> [    1.042517]     1..1
>>> [    1.043147] BUG: kernel NULL pointer dereference, address: 
>>> 0000000000000000
>>> [    1.043765] #PF: supervisor write access in kernel mode
>>> [    1.044189] #PF: error_code(0x0002) - not-present page
>>> [    1.044617] PGD 0 P4D 0
>>> [    1.044818] Oops: Oops: 0002 [#1] PREEMPT SMP PTI
>>> [    1.045380] CPU: 7 UID: 0 PID: 214 Comm: kunit_try_catch Tainted: 
>>> G                 N 6.14.0-rc4-00387-g33e4632926a0 #8
>>> [    1.046262] Tainted: [N]=TEST
>>> [    1.046521] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
>>> BIOS 1.16.3-2.fc40 04/01/2014
>>> [    1.047224] RIP: 0010:kunit_test_null_dereference+0x37/0x80
>>> [    1.047706] Code: 80 b5 49 c7 c0 50 7f 56 b4 ba 01 00 00 00 65 48 
>>> 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8d 4c 24 07 48 c7 c6 80 
>>> 8a 26 b5 <c7> 04 25 00 00 00 00 00 00 00 00 48 c7 87 70 01 00 00 a6 
>>> e9 8c b5
>>> [    1.049204] RSP: 0000:ffffa609807c7ec8 EFLAGS: 00010246
>>> [    1.049642] RAX: 0000000000000000 RBX: ffff91d982623000 RCX: 
>>> ffffa609807c7ecf
>>> [    1.050213] RDX: 0000000000000001 RSI: ffffffffb5268a80 RDI: 
>>> ffffa60980013c68
>>> [    1.050799] RBP: ffff91d98105afc0 R08: ffffffffb4567f50 R09: 
>>> ffffffffb5807ce8
>>> [    1.051375] R10: 0000000000000000 R11: 0000000000000001 R12: 
>>> ffff91d98105afc0
>>> [    1.051941] R13: ffff91d983c749c0 R14: ffffffffb45685e0 R15: 
>>> ffff91d982623000
>>> [    1.052543] FS:  0000000000000000(0000) GS:ffff91e48f9c0000(0000) 
>>> knlGS:0000000000000000
>>> [    1.053187] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    1.053649] CR2: 0000000000000000 CR3: 00000004cee30000 CR4: 
>>> 00000000000006f0
>>> [    1.054214] Call Trace:
>>> [    1.054427]  <TASK>
>>> [    1.054597]  ? __die+0x1e/0x60
>>> [    1.054844]  ? page_fault_oops+0x17b/0x4a0
>>> [    1.055174]  ? search_extable+0x26/0x30
>>> [    1.055482]  ? kunit_test_null_dereference+0x37/0x80
>>> [    1.055888]  ? search_module_extables+0x14/0x50
>>> [    1.056255]  ? exc_page_fault+0x6b/0x150
>>> [    1.056571]  ? asm_exc_page_fault+0x26/0x30
>>> [    1.056898]  ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
>>> [    1.057387]  ? __pfx_kunit_fail_assert_format+0x10/0x10
>>> [    1.057799]  ? kunit_test_null_dereference+0x37/0x80
>>> [    1.058195]  ? __kthread_parkme+0x33/0x80
>>> [    1.058523]  kunit_generic_run_threadfn_adapter+0x1c/0x40
>>> [    1.058949]  kthread+0xe9/0x1f0
>>> [    1.059206]  ? __pfx_kthread+0x10/0x10
>>> [    1.059513]  ret_from_fork+0x2f/0x50
>>> [    1.059798]  ? __pfx_kthread+0x10/0x10
>>> [    1.060095]  ret_from_fork_asm+0x1a/0x30
>>> [    1.060421]  </TASK>
>>> [    1.060597] Modules linked in:
>>> [    1.060841] CR2: 0000000000000000
>>> [    1.061104] ---[ end trace 0000000000000000 ]---
>>> [    1.061481] RIP: 0010:kunit_test_null_dereference+0x37/0x80
>>>
>>>
>>> I attach my kernel config and the full log file.
>>>
>>> What's awkward is that it does not seem to be related directly to
>>> sched, but only faults with sched.
>>>
>>>
>>> Could you try to reproduce this, Tvrtko?
>>
>> Any chance that between the two runs you somehow manage to enable 
>> CONFIG_KUNIT_FAULT_TEST?
> 
> Hmm in the sea of kunit_test_null_dereference there was a drm sched 
> related fail too. Investigating.

Well this was quite embarrassing - I had an use after free due relying 
on scheduler fences for querying job status. That's what I get for 
over-relying on KASAN...

I've fixed it by tracking the job completion status in the mock job 
object directly and sent v4 out.

Also interestingly, for me testing under qemu failed to catch it. Only 
one out of two real hw test machines hit it. Excellent that you gave it 
a spin and caught it before merge, thanks for that!

Regards,

Tvrtko

> 
> Regards,
> 
> Tvrtko
> 
>>>
>>>
>>> Thanks
>>> P.
>>>
>>>
>>>>
>>>> v2:
>>>>   * Parameterize a bunch of similar tests.
>>>>   * Improve test commentary.
>>>>   * Rename TDR test to timeout. (Christian)
>>>>   * Improve quality and consistency of naming. (Philipp)
>>>>
>>>> RFC v2 -> series v1:
>>>>   * Rebased for drm_sched_init changes.
>>>>   * Fixed modular build.
>>>>   * Added some comments.
>>>>   * Filename renames. (Philipp)
>>>>
>>>> v2:
>>>>   * Dealt with a bunch of checkpatch warnings.
>>>>
>>>> v3:
>>>>   * Some mock API renames, kerneldoc grammar fixes and indentation
>>>> fixes.
>>>>
>>>> Cc: Christian König <christian.koenig at amd.com>
>>>> Cc: Danilo Krummrich <dakr at kernel.org>
>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>> Cc: Philipp Stanner <phasta at kernel.org>
>>>>
>>>> Tvrtko Ursulin (5):
>>>>    drm: Move some options to separate new Kconfig
>>>>    drm/scheduler: Add scheduler unit testing infrastructure and some
>>>>      basic tests
>>>>    drm/scheduler: Add a simple timeout test
>>>>    drm/scheduler: Add basic priority tests
>>>>    drm/scheduler: Add a basic test for modifying entities scheduler
>>>> list
>>>>
>>>>   drivers/gpu/drm/Kconfig                       | 109 +----
>>>>   drivers/gpu/drm/Kconfig.debug                 | 115 +++++
>>>>   drivers/gpu/drm/scheduler/.kunitconfig        |  12 +
>>>>   drivers/gpu/drm/scheduler/Makefile            |   2 +
>>>>   drivers/gpu/drm/scheduler/tests/Makefile      |   7 +
>>>>   .../gpu/drm/scheduler/tests/mock_scheduler.c  | 323 +++++++++++++
>>>>   drivers/gpu/drm/scheduler/tests/sched_tests.h | 223 +++++++++
>>>>   drivers/gpu/drm/scheduler/tests/tests_basic.c | 426
>>>> ++++++++++++++++++
>>>>   8 files changed, 1113 insertions(+), 104 deletions(-)
>>>>   create mode 100644 drivers/gpu/drm/Kconfig.debug
>>>>   create mode 100644 drivers/gpu/drm/scheduler/.kunitconfig
>>>>   create mode 100644 drivers/gpu/drm/scheduler/tests/Makefile
>>>>   create mode 100644 drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>>>   create mode 100644 drivers/gpu/drm/scheduler/tests/sched_tests.h
>>>>   create mode 100644 drivers/gpu/drm/scheduler/tests/tests_basic.c
>>>>
>>>
>>
> 



More information about the dri-devel mailing list