[PATCH v4 1/5] drm: Move some options to separate new Kconfig

Fri Mar 7 18:06:09 UTC 2025

On Fri, 2025-03-07 at 16:59 +0000, Tvrtko Ursulin wrote:
> 
> On 07/03/2025 13:41, Philipp Stanner wrote:
> > Hi,
> > 
> > You forgot to put folks in CC as recipents for the cover letter :(
> > 
> > 
> > On Thu, 2025-03-06 at 17:05 +0000, Tvrtko Ursulin wrote:
> > > Move some options out into a new debug specific kconfig file in
> > > order
> > > to
> > > make things a bit cleaner.
> > > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
> > > Cc: Christian König <christian.koenig at amd.com>
> > > Cc: Danilo Krummrich <dakr at kernel.org>
> > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > Cc: Philipp Stanner <phasta at kernel.org>
> > 
> > We all have our individual work flows, so don't take this as
> > lecturing
> > or anything – I just suspect that I was forgotten in the cover
> > letter
> > because you Cc people by hand in the individual patches.
> > 
> > What I do is that I run get_maintainer and then put the individuals
> > listed there into the --to= field. That sends the entire series to
> > all
> > of them.
> > 
> > Only sometimes, when there's a huge list of recipents or when the
> > patches of a series are very independent, I deviate from that rule.
> > 
> > JFYI
> 
> Notice it was there in v3, I just omitted to paste it this time.
> 
> > Anyways, we have a bigger problem about the entire series. I now
> > tested
> > again with the same setup as yesterday and the faults are indeed
> > gone,
> > so that's good.
> > 
> > But to be sure I then did run kmemleak and got a list of leaks that
> > is
> > more than 2000 lines long.
> 
> There is this comment for drm_sched_fini which ends with:
> 
> """
> ...
>   * This stops submission of new jobs to the hardware through
>   * drm_sched_backend_ops.run_job(). Consequently, 
> drm_sched_backend_ops.free_job()
>   * will not be called for all jobs still in
> drm_gpu_scheduler.pending_list.
>   * There is no solution for this currently. Thus, it is up to the 
> driver to make
>   * sure that:
>   *
>   *  a) drm_sched_fini() is only called after for all submitted jobs
>   *     drm_sched_backend_ops.free_job() has been called or that
>   *  b) the jobs for which drm_sched_backend_ops.free_job() has not
> been 
> called
>   *     after drm_sched_fini() ran are freed manually.
>   *
> 
>   * FIXME: Take care of the above problem and prevent this function
> from 
> leaking
>   * the jobs in drm_gpu_scheduler.pending_list under any
> circumstances.
> """
> 
> I got bitten by that. Keep forgetting how fragile the thing is.. :(

argh damn, those are *all* from the pending_list?!

OK. Well.

Now we've got a philosophical problem:

We still have to fix those leaks (I'm still working on it, but my
current attempt has failed and I probably fall back to a refcount
solution).

And to see whether the fix actually fixes the leaks, directly using the
kunit tests would be handy.

After all, this is what the kunit tests are there for: show what is
broken within the scheduler. And those leaks definitely qualify. Or
should kunit tests follow the same rules we demand from drivers?

I'd like to hear more opinions about that.

@Danilo, @Dave, @Sima
would it be OK if we add kunit tests for the scheduler to DRM that
cause leaks until we can fix them?

P.

> 
> Regards,
> 
> Tvrtko
>