[PATCH 0/6] Fix crash after reloading a driver using ttm

Karol Herbst kherbst at redhat.com
Tue Apr 16 11:20:59 UTC 2019


On Tue, Apr 16, 2019 at 1:07 PM Koenig, Christian
<Christian.Koenig at amd.com> wrote:
>
> Am 16.04.19 um 12:54 schrieb Karol Herbst:
> > On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian
> > <Christian.Koenig at amd.com> wrote:
> >> Am 16.04.19 um 11:10 schrieb Karol Herbst:
> >>> On Tue, Apr 16, 2019 at 8:38 AM Christian König
> >>> <ckoenig.leichtzumerken at gmail.com> wrote:
> >>>> Am 16.04.19 um 02:35 schrieb Karol Herbst:
> >>>>> Kobjects are supposed to be dynamically allocated, but with recent changes
> >>>>> this rule was violated. Reverting those commits fixes crashes when a drm
> >>>>> driver using TTM gets loaded again.
> >>>>>
> >>>>> The object in question is "ttm_mem_glob" declared inside
> >>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside
> >>>>> "drivers/gpu/drm/ttm/ttm_memory.c".
> >>>>>
> >>>>> from "Documentation/kobject.txt":
> >>>>> "Because kobjects are dynamic, they must not be declared statically or on
> >>>>> the stack, but instead, always allocated dynamically.  Future versions of
> >>>>> the kernel will contain a run-time check for kobjects that are created
> >>>>> statically and will warn the developer of this improper usage."
> >>>>>
> >>>>> Unloading ttm before reloading the driver workarounds that crash, because
> >>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
> >>>>> and kobject_put function never free or clean up the kobject object leaving
> >>>>> it in an undefined state.
> >>>>>
> >>>>> I reverted a few more commits to make it less painful for me to rever this
> >>>>> rather big change.
> >>>> Well, NAK. By reverting those change you also re-introduced the problems
> >>>> we originally fixed with those patches.
> >>>>
> >>>> Please work on a proper fix instead,
> >>>> Christian.
> >>> And which problem was that besides duplicated code (or maybe even a
> >>> bit of memory consumption if multiple ttm driver were used)? If I had
> >>> to choose between duplicated code and a crash, I choose the former.
> >>>
> >>> Maybe I missed the real reason why those changes are made, but the
> >>> commit messages don't really seem to tell me.
> >> The old implementation crashed because different drivers tried to
> >> allocate the same kobj.
> >>
> >> Crashing in one way is not better than crashing in another way.
> >>
> >> Christian.
> >>
> > how can that old crash be triggered? By loading two ttm based drivers
> > at the same time or by other means?
>
> Yes, exactly. Even worse it could be triggered by one driver
> instantiating multiple times at the same time, e.g two AMD GPUs in the
> same system.
>
> On the other hand I completely agree that using kobj static is
> completely nuts. The problem is that using a kobj was the wrong approach
> in the first place.
>
> In other words when you have something which controls a global behavior
> of a module, what do you use? Right, a module parameter.
>
> Point is that we can't get away from those kobj without breaking the
> uapi, so that is something which can't be done easily.
>
> Regards,
> Christian.

okay, thanks for the explanation!

Currently I will be testing your first patch you sent out on top of
5.0.7, so that we could propose that for stable. The second didn't
apply, but it didn't look relevant for the actual fix. Will reply to
the patch with my results then.

Thanks


More information about the dri-devel mailing list