[PATCH] drm: Funnel drm logs to tracepoints

Daniel Vetter daniel.vetter at ffwll.ch
Tue Oct 15 22:35:39 UTC 2019


On Tue, Oct 15, 2019 at 10:02 PM Ville Syrjälä
<ville.syrjala at linux.intel.com> wrote:
>
> On Tue, Oct 15, 2019 at 03:11:56PM -0400, Sean Paul wrote:
> > On Tue, Oct 15, 2019 at 02:07:00PM +0200, Thomas Zimmermann wrote:
> > > Hi
> > >
> >
> > Hi Thomas (and Daniel/Joonas/Rob),
> > Thanks for your feedback
> >
> > > Am 10.10.19 um 22:48 schrieb Sean Paul:
> > > > From: Sean Paul <seanpaul at chromium.org>
> > > >
> > > > *Record scratch* You read that subject correctly, I bet you're wondering
> > > > how we got here. At least hear me out before you flame :-)
> > > >
> > > > For a long while now, we (ChromeOS) have been struggling getting any
> > > > value out of user feedback reports of display failures (notably external
> > > > displays not working). The problem is that all logging, even fatal
> > > > errors (well, fatal in the sense that a display won't light up) are
> > > > logged at DEBUG log level. So in order to extract these logs, you need
> > > > to be able to turn on logging, and reproduce the issue with debug
> > > > enabled. Unfortunately, this isn't really something we can ask CrOS users
> > > > I spoke with airlied about this and RHEL has similar issues.
> > > >
> > > > This is the point where you ask me, "So Sean, why don't you just enable
> > > > DRM_UT_BLAH?". Good question! Here are the reasons in ascending order of
> > > > severity:
> > > >  1- People aren't consistent with their categories, so we'd have to
> > > >     enable a bunch to get proper coverage
> > > >  2- We don't want to overwhelm syslog with drm spam, others have to use
> > > >     it too
> > > >  3- Console logging is slow
> > > >
> > > > Hopefully you're with me so far. I have a problem that has no existing
> > > > solution. What I really want is a ringbuffer of the most recent logs
> > > > (in the categories I'm interested in) exposed via debugfs so I can
> > > > extract it when users file feedback.
> > >
> > > For bug reports, I don't want categories or anything else that users can
> > > switch on/off. All I'd want is a simple way of retrieving the last ~100
> > > messages from DRM (ala: "please attach the content of the file at
> > > /sys/debug...").
> > >
> >
> > Yeah, this is basically what I'm looking for (although I'd filter out a few
> > categories that are particularly noisy and bump N to as big as I can get away
> > with).
> >
> > > > It just so happens that there is something which does _exactly_ this! I
> > > > can dump the most recent logs into tracepoints, turn them on and off
> > > > depending on which category I want, and pull them from debugfs on demand.
> > > >
> > > > "What about trace_printk()?" You'll say. It doesn't give us the control we
> > > > get from using tracepoints and it's not meant to be left sprinkled around
> > > > in code.
> > > >
> > > > So that is how we got here, now it's time for you to tell me why this is
> > > > a horrible idea :-)
> > >
> > > Tracepoints are considered stable uapi, right? As a distro person (SUSE)
> > > I don't want us to have to maintain debugging messages forever.
> > >
> >
> > This is something I wasn't aware of (that's what reviews are for, right?!). So
> > this idea is perhaps dead in the water. I could probably integrate with
> > ringbuffer at a lower level and expose a drm-specific debugfs node without _too_
> > much extra goo, but [ab]using tracepoints seemed pretty natural. The
> > other benefit to this approach is that backporting it into distro kernels is
> > relatively easy.
> >
> > >
> > > The problem itself doesn't seem related to DRM. Do other subsystems have
> > > similar requirements?
> >
> > I think other subsystems are chattier than ours (ie: when an error occurs they
> > actually log something). Some of our drivers in drm are more helpful than others
> > as far as complaining when something unexpected happens. The core in particular
> > is really conservative about issuing errors (which makes sense since a lot of
> > the failures are expected (atomic testing is messy), only the compositor really
> > knows which ones are noteworthy).
>
> Apart from the "let's not spam the logs if userspace is fuzzing our
> uapi" issue I think the other main reason for conservatism is hotplug,
> or rather unplug. When the other end of the cable is flapping around
> in the breeze it's hard to talk to whoever you think you're talking
> to. And it doesn't help that often the other end decides to hang up
> and go for a nap even if the cable is superglued on.
>
> If we didn't have hotpluggable displays it would totally make sense
> to turn all commit phase debugs into errors. As it stands we've had
> to do the exact oppposite or CI in particular would drown in noise.
> If only those DP connector hooks were motorized and we could lock
> them from the driver...

Yeah I don't think tuning the spam level will ever work. What we need
is some external input (most likely from the user clicking the "my
external screen doesn't work" button, or maybe the compositor
realizing something that should work didn't, or some other thing that
indicates trouble), and then retroactively capture all
debug/informational message leading up to doom.

But without that external "houston we have a problem" input all the
debug spam is really just spam and unwanted. btw even if we don't spam
dmesg if we enable too much we might have simply trouble with all the
printk formatting work we do for nothing. So maybe we need something
like trace_printk which iirc delays the formatting until the stuff
actually gets read from the log buffer. Plus trace_printk might make
it clear enough that it's not stable uapi ... so maybe we do want
trace_printk in the end?

Just not really looking forward to reimplementing half the tracing
infrastructure just for this ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


More information about the dri-devel mailing list