[PATCH v10 08/10] dyndbg: add print-to-tracefs, selftest with it - RFC

Tue Nov 23 09:32:28 UTC 2021

First off, let me reiterate that this feature would be invaluable as user-space
developers. It's often pretty difficult to figure out the cause of an EINVAL,
we have to ask users to follow complicated instructions [1] to grab DRM logs.
Then have to skim through several megabytes of logs to find the error.

I have a hack [2] which just calls system("sudo dmesg") after a failed atomic
commit, it's been pretty handy. But it's really just a hack, a proper solution
would be awesome.

[1]: https://gitlab.freedesktop.org/wlroots/wlroots/-/wikis/DRM-Debugging
[2]: https://gitlab.freedesktop.org/emersion/libliftoff/-/merge_requests/61

> > > > Having a subsystem specific trace buffer would allow subsystem specific
> > > > trace log permissions depending on the sensitivity of the data. But
> > > > doesn't drm output today go to the system log which is typically world
> > > > readable today?

dmesg isn't world-readable these days, it's been changed recently-ish (last
year?) at least on my distribution (Arch). I need root to grab dmesg.

(Maybe we can we just let the DRM master process grab the logs?)

> > > Yes, and that is exactly the problem. The DRM debug output is so high
> > > traffic it would make the system log both unusable due to cruft and
> > > slow down the whole machine. The debug output is only useful when
> > > something went wrong, and at that point it is too late to enable
> > > debugging. That's why a flight recorder with an over-written circular
> > > in-memory buffer is needed.
> >
> > Seans patch reuses enum drm_debug_category to split the tracing
> > stream into 10 sub-streams
> > - how much traffic from each ?
> > - are some sub-streams more valuable for post-mortem ?
> > - any value from further refinement of categories ?
> > - drop irrelevant callsites individually to reduce clutter, extend
> > buffer time/space ?
>
> I think it's hard to predict which sub-streams you are going to need
> before you have a bug to debug. Hence I would err on the side of
> enabling too much. This also means that better or more refined
> categorisation might not be that much of help - or if it is, then are
> the excluded debug messages worth having in the kernel to begin with.
> Well, we're probably not that interested in GPU debugs but just
> everything related to the KMS side, which on the existing categories
> is... everything except half of CORE and DRIVER, maybe? Not sure.

We've been recommending drm.debug=0x19F so far (see wiki linked above).
KMS + PRIME + ATOMIC + LEASE is definitely something we want in, and
CORE + DRIVER contains other useful info. We definitely don't want VBL.

> My feeling is that that could mean in the order of hundreds of log
> events at framerate (e.g. 60 times per second) per each enabled output
> individually. And per DRM device, of course. This is with the
> uninteresting GPU debugs already excluded.

Indeed, successful KMS atomic commits already generate a lot of noise. On my
machine, setting drm.debug=0x19F and running the following command:

    sudo dmesg -w | pv >/dev/null

I get 400KiB/s when idling, and 850KiB/s when wiggling the cursor.

> Still, I don't think the flight recorder buffer would need to be
> massive. I suspect it would be enough to hold a few frames' worth which
> is just a split second under active operation. When something happens,
> the userspace stack is likely going to stop on its tracks immediately
> to collect the debug information, which means the flooding should pause
> and the relevant messages don't get overwritten before we get them. In
> a multi-seat system where each device is controlled by a separate
> display server instance, per-device logs would help with this. OTOH,
> multi-seat is not a very common use case I suppose.

There's also the case of multi-GPU where GPU B's logs could clutter GPU A's,
making it harder to understand the cause of an atomic commit failure on GPU A.
So per-device logs would be useful, but not a hard requirement for me, having
*anything* at all would already be a big win.

In my experiments linked above [2], system("sudo dmesg") after atomic commit
failure worked pretty well, and the bottom of the log contained the cause of
the failure. It was pretty useful to system("sudo dmesg -C") before performing
an atomic commit, to be able to only collect the extract of the log relevant to
the atomic commit.

Having some kind of "marker" mechanism could be pretty cool. "Mark" the log
stream before performing an atomic commit (ideally that'd just return e.g. an
uint64 offset), then on failure request the logs collected after that mark.