[PATCH v7 0/7] DRM scheduler kunit tests

Thu Mar 13 10:15:24 UTC 2025

On 13/03/2025 08:34, Philipp Stanner wrote:
> On Wed, 2025-03-12 at 14:33 +0000, Tvrtko Ursulin wrote:
>>
>> On 12/03/2025 13:49, Philipp Stanner wrote:
>>
>> [snip]
>>
>>>>>> There I see a UAF. Do you have an idea what that might be? I
>>>>>> would only
>>>>>> expect memory leaks and the test assertions failing.
>>>>>
>>>>> It is expected all hell to break loose in aspirational mode.
>>>>> There
>>>>> the
>>>>> mock scheduler shutdown relies solely on drm_sched_fini. So all
>>>>> tests
>>>>> which do not fully idle the mock scheduler themselves, whether
>>>>> implicity
>>>>> or explicitly, will explode in various ways.
>>>>
>>>> To clarify I made it like that because that is what I understood
>>>> was
>>>> your vision for drm_sched_fini. That it would be able to clean
>>>> everything up etc.
>>>>
>>>> If you meant something different we need to adjust it. Although
>>>> it
>>>> could
>>>> be done later as well.
>>>
>>> My primary intention was to use those tests to see whether my
>>> memory
>>> leak fix works or not.
>>
>> Okay here I don't know exactly what are you changing and how you are
>> testing to comment.
>>
>>> Thus I'd need something like in v4, where the tests run as intended
>>> but
>>> leak the jobs in the pending_list. Would indeed be neat if they
>>> also
>>> check for list_len(pending_list) so that you don't have to run
>>> kmemleak
>>> manually.
>>>
>>> And since we shouldn't have such leaks in production tests the idea
>>> of
>>> optionally enabling them arose.
>>>
>>> More generally speaking, I think that the tests should behave as we
>>> expect drivers to behave *currently*, while we have the option for
>>> the
>>> tests to purposefully misbehave. In my case this means: just leak
>>> the
>>> jobs and optionally tell about the leaks.
>>>
>>> When used as intended, the scheduler currently AFAIK doesn't have
>>> UAF
>>> or nullptrderefs, so they shouldn't occur in tests that use it as
>>> intended. Now the leaks are special because we never intended them.
>>>
>>> So what surprised me is that, when compared to v4, this v7 here
>>> actually now also has UAF. Besides not cleaning up leaks, what are
>>> you
>>> doing in aspiritional mode?
>>>
>>> Let's talk it through before you invest your time into v8
>>
>> 1)
>>
>> "Normal" mode is fine, do you agree? No leaks, no UAF. Ie. it is
>> exercising the scheduler how drivers are supposed to use it today.
> 
> Yes, it is fine.
> 
>>
>> 2)
>>
>> Aspirational mode I added on your request and you can define how you
>> want it to work.
> 
> My personal intention is simple: have the jobs leaked like in v4,
> without UAF or faults and the like.
> 
>>
>> Lets have a look in drm_mock_sched_fini() and see that it is pretty
>> straightforward. First I'll pre-process and annotate the normal
>> version:
>>
>> void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
>> {
>> 	struct drm_mock_sched_job *job, *next;
>> 	struct kunit  *test = sched->test;
>> 	unsigned long flags;
>> 	LIST_HEAD(signal);
>>
>> 	// Stop the scheduler workqueues so nothing further gets
>> scheduled
>> 	drm_sched_wqueue_stop(&sched->base);
>>
>> 	// Idle the mock scheduler in-flight jobs:
>>
>> 	// First move them to a local list so any in parallel
>> "hardware"
>> activity does not see them any more
>>
>> 	spin_lock_irqsave(&sched->lock, flags);
>> 	list_for_each_entry_safe(job, next, &sched->job_list, link)
>> 		list_move_tail(&job->link, &signal);
>> 	spin_unlock_irqrestore(&sched->lock, flags);
>>
>> 	// Now for jobs which "hardware" hasn't processed yet we do:
>> 	//
>> 	// 1. Cancel the completion timer
>> 	// 2. Mark the job as complete (signal the hw fence, wake up
>> waiters)
>> 	// 3. Release the hardware fence (list had a reference)
>> 	// 4. Free the job itself
>>
>> 	list_for_each_entry(job, &signal, link) {
>> 		hrtimer_cancel(&job->timer);
>> 		drm_mock_sched_job_complete(job);
>> 		dma_fence_put(&job->hw_fence);
>> 		drm_sched_job_cleanup(&job->base);
>> 	}
>>
>> 	// Finally drm_sched_fini
>> 	drm_sched_fini(&sched->base);
>> }
>>
>> In aspirational mode it becomes this:
>>
>> void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
>> {
>> 	struct kunit  *test = sched->test;
>>
>> 	drm_sched_fini(&sched->base);
>>
>> 	KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
>> 	KUNIT_ASSERT_TRUE(test, list_empty(&sched-
>>> base.pending_list));
>> }
>>
>> So relies on drm_sched_fini to cleanup _everything_. Including the
>> driver (mock scheduler) state. Which is what I understood you wanted.
> 
> Thx for the detailed explanation. _Everything_ seems a lot, though.
> Even in this version, you still stop submission to your entities, don't
> you?
> 
> IOW: is drm_sched_fini(), in aspirational mode, responsible for more
> than preventing the leaks (and for what it is already doing in mainline
> linux)?

Yes it is, it would need to idle and free all driver state. Again, this 
is I understood you wanted to work towards, but as this thread will 
conclude as dropping it for now it doesn't matter.

>> For example drm_sched_basic_entity_cleanup test case will exit (kill
>> the
>> entities and tear down the scheduler) with half of the submitted jobs
>> still in flight. That's one example of what will be caught here by
>> the
>> asserts and also UAF since kunit test will just exit and free up
>> memory
>> regardless.
> 
> Hmm. OK, so kunit does the free. But who does the access?
> drm_sched_fini() still stops all the work items.

In-flight jobs. Completion timers will fire, simulating hardware 
interrupts firing with a real driver.

>> If this isn't what you had in mind you can either a) tell me what you
>> want maybe I can tweak it*, or b) we can drop the aspirational patch
>> for
>> now.
> 
> 
> So, here's the bad news. I had a chat with Danilo yesterday who
> reminded me of the fact that there are testers out there (like the
> Intel bots) which do random kconfigs - so avoiding that with another
> kconfig option does not solve the problem.
> 
> Thus, I think the way to go is to indeed drop aspirational mode and
> whoever works on Schrödinger-Problems (like myself) has to tweak the
> tests locally.
> 
> That's probably even more productive process-wise, otherwise we'd have
> to remove broken aspects of (or the entire) aspirational mode every
> time that we solved a known problem.
> 
> Unless you have a better proposal, let's drop that mode.
> 
> 
> Sorry for the zig-zag, but that wasn't on my / our radar

No worries. Can you review the v7 as is or you want me to send v8? I am 
pretty sure aspirational mode patch is standalone and could be simply 
dropped when merging.

Regards,

Tvrtko

>> *)
>> For example it would be possible to avoid UAFs by moving away from
>> kunit
>> managed allocations and just leak memory. Plus, it would also be
>> required to cancel the timers or so.
>>
>> For running in the VM or UML cases, which I thought were typical, it
>> wouldn't be a benefit, but if you are worried about running on the
>> host
>> kernel and expect no crashes, then that aspect would need to change.
>>
>> Regards,
>>
>> Tvrtko
>>
>