[PATCH v7 03/10] drm/xe/devcoredump: Add ASCII85 dump helper function

Wed Sep 11 19:54:56 UTC 2024

On Wed, 2024-09-11 at 12:35 -0700, John Harrison wrote:
> On 9/11/2024 12:30, Souza, Jose wrote:
> > On Wed, 2024-09-11 at 14:12 -0500, Lucas De Marchi wrote:
> > > On Tue, Sep 10, 2024 at 01:17:11PM GMT, John Harrison wrote:
> > > > On 9/10/2024 12:43, Lucas De Marchi wrote:
> > > > > On Mon, Sep 09, 2024 at 06:31:41PM GMT, John Harrison wrote:
> > > > > > On 9/6/2024 19:06, John Harrison wrote:
> > > > > > > On 9/5/2024 20:04, Lucas De Marchi wrote:
> > > > > > > > On Thu, Sep 05, 2024 at 07:01:33PM GMT, John Harrison wrote:
> > > > > > > > > On 9/5/2024 18:54, Lucas De Marchi wrote:
> > > > > > > > > > On Thu, Sep 05, 2024 at 01:50:58PM GMT,
> > > > > > > > > > John.C.Harrison at Intel.com wrote:
> > > > > > > > > > > From: John Harrison <John.C.Harrison at Intel.com>
> > > > > > > > > > > 
> > > > > > > > > > > There is a need to include the GuC log and other large
> > > > > > > > > > > binary objects
> > > > > > > > > > > in core dumps and via dmesg. So add a helper for dumping
> > > > > > > > > > > to a printer
> > > > > > > > > > > function via conversion to ASCII85 encoding.
> > > > > > > > > > why are we not dumping the binary data directly to devcoredump?
> > > > > > > > > As per earlier comments, there is a WiFi driver or some such
> > > > > > > > > that does exactly that. But all they are dumping is a binary
> > > > > > > > > blob.
> > > > > > > > In your v5 I see you mentioned
> > > > > > > > drivers/net/wireless/ath/ath10k/coredump.c, but that is a
> > > > > > > > precedence for
> > > > > > > > including it as is from the device rather converting it to ASCII85 or
> > > > > > > > something else. It seems odd to do that type of conversion in kernel
> > > > > > > > space when it could be perfectly done in userspace.
> > > > > > > It really can't. An end user could maybe be expected to zip or
> > > > > > > tar a coredump file before attaching it to a bug report but they
> > > > > > > are certainly not going to try to ASCII85 encode random bits of
> > > > > > > it. Whereas, putting that in the kernel means it is just there.
> > > > > > > It is done. And it is pretty trivial - just call a helper
> > > > > > > function and it does everything for you. Also, I very much doubt
> > > > > > > you can spew raw binary data via dmesg. Even if the kernel would
> > > > > > > print it for you (which I doubt), the user tools like syslogd
> > > > > > > and dmesg itself are going to filter it to make it ASCII safe.
> > > > > > > 
> > > > > > > The i915 error dumps have been ASCII85 encoded using the
> > > > > > > kernel's ASCII85 encoding helper function since forever. This
> > > > > > > patch is just a wrapper around the kernel's existing
> > > > > > > implementation in order to make it more compatible with printing
> > > > > > > to dmesg. This is not creating a new precedent. It already
> > > > > > > exists.
> > > > > > > 
> > > > > > > > $ git grep ascii85.h
> > > > > > > > drivers/gpu/drm/i915/i915_gpu_error.c:#include <linux/ascii85.h>
> > > > > > > > drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c:#include <linux/ascii85.h>
> > > > > > > > drivers/gpu/drm/msm/adreno/adreno_gpu.c:#include <linux/ascii85.h>
> > > > > > > > drivers/gpu/drm/xe/xe_lrc.c:#include <linux/ascii85.h>
> > > > > > > > drivers/gpu/drm/xe/xe_vm.c:#include <linux/ascii85.h>
> > > > > > > And the list of drivers which dump raw binary data in a coredump
> > > > > > > file is... ath10k. ASCII85 wins 3 to 1.
> > > > > > > 
> > > > > > > 
> > > > > > > > > We want the devcoredump file to still be human readable.
> > > > > > > > > That won't be the case if you stuff binary data in the
> > > > > > > > > middle of it. Most obvious problem - the zeros in the data
> > > > > > > > > will terminate your text file at that point. Potentially
> > > > > > > > > bigger problem for end users - random fake ANSI codes will
> > > > > > > > > destroy your terminal window if you try to cat the file to
> > > > > > > > > read it.
> > > > > > > > Users don't get a coredump and cat it to the terminal.
> > > > > > > > =(lk%A8`T7AKYH#FD,6++EqOABHUhsG%5H2ARoq#E$/V$Bl7Q+@<5pmBe<q;Bk;0mCj at .3DIal2FD5Q-+E_RBART+X@VfTuGA2/4Dfp.E at 3BN0DfB9.+E1b0F(KAV+:8
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Lucas De Marchi
> > > > > > > They might. Either intentionally or accidentally. I've certainly
> > > > > > > done it myself. And people will certainly want to look at it in
> > > > > > > any random choice of text editor, pager, etc. 'cos you know, it
> > > > > > > is meant to be read by humans. If it is full of binary data then
> > > > > > > that becomes even more difficult than simply being full of ASCII
> > > > > > > gibberish. No matter what you are doing, the ASCII version is
> > > > > > > safer and easier to look at the rest of the file around it.
> > > > > > > 
> > > > > > > I don't understand why you are so desperate to have raw binary
> > > > > > > data in the middle of a text file. The disadvantages are
> > > > > > > multiple but the only advantage is a slightly smaller file. And
> > > > > > > the true route to smaller files is to add compression like we
> > > > > > > have in i915.
> > > > > > > 
> > > > > > > John.
> > > > > > > 
> > > > > > PS: Also meant to add that one of the important uses cases for
> > > > > > dumping logs to dmesg is for the really hard to repro bugs that
> > > > > > show up in CI extremely rarely. We get the driver to dump an error
> > > > > > capture to dmesg and pull that out from the CI logs. Even if you
> > > > > > could get binary data through dmesg, pretty sure the CI tools
> > > > > > would also not be happy with it. Anything non-printable will get
> > > > > > munged for sure when turning it into a web page.
> > > > > I think that's the main source of confusion on what we are discussing.
> > > > > I was not talking about dmesg at all. I'm only complaining about feeding
> > > > > ascii85-encoded data into a *devcoredump* when apparently there isn't a
> > > > > good reason to do so. I'd rather copy the binary data to the
> > > > > devcoredump.
> > > > But the intent is to dump a devcoredump to dmesg. It makes much sense
> > > It seems like an awful idea to dump hundreds of MB to dmesg.  When we
> > > talked about printing to dmesg it was about **GuC log** and on very
> > > initial states of driver probe where we didn't actually have a good
> > > interface for that. And the log wouldn't be so big. If we can already
> > > capture the devcoredump, what would be the reason to dump to dmesg
> > > (other than the non-valid "our CI captures dmesg, and doesn't
> > > capture devcoredump", which should be fixed).
> > > 
> > > If any sysadmin have their serial console flooded by such garbage there
> > > are 2 reactions: 1) someone got in control of my machine; 2) something
> > > went really bad with this machine. It's not "fear not, wait for it to
> > > complete, it's just normal debug data I will attach to an issue in
> > > gitlab".  And I'm mentioning a serial console here due to that
> > > cond_resched() added, which is only needed because you are trying to do
> > > in kernel space what should be in userspace.
> > > 
> > > Oh well... looking at this the main reason to use ascii85 I can see is
> > > because we already have parts of *our* devcoredump using it, and
> > > userspace relying on that. That's new to me. Let's stop bringing dmesg
> > > into this discussion.
> > > 
> > > > to have a single implementation that can be used for multiple
> > > > purposes. Otherwise you are duplicating a lot of code unnecessarily.
> > > > 
> > > > And I still think it is a *very* bad idea to be including binary data
> > > > in a text file. The devcoredump is supposed to be human readable. It
> > > no, it's not. devcoredump doesn't dictate the format, it's up to the
> > > drivers to do that. See their documentation.
> > > 
> > > > is supposed to be obtained by end users and passed around. Having
> > > > binary data in there just makes everything more complex and error
> > > > prone. We want this to be as simple, easy and safe as possible.
> > > > 
> > > > > For dmesg, there's a reason to encode it as you pointed out... but
> > > > > no users shouldn't actually see it - we should be getting all of those
> > > > > cases in CI. For the escape scenarios, yeah... better having it
> > > > > ascii85-encoded.
> > > > > 
> > > > > What you are adding to devcoredump also doesn't even seem to be an
> > > > > ascii85 representation, but a multiple lines that should be concatenated
> > > > > to form the ascii85 representation. For dmesg it makes sense. Not for
> > > > > devcoredump.  We should also probabaly need a length field (correctly
> > > > > accounting for the additional characters for each line) so we don't
> > > > > have an implicit dependency on what's the next field to know how much to
> > > > > parse.
> > > > The decoding is pretty trivial given that line feeds are not part of
> > > > the ASCII85 character set and so can just be dropped. Besides The
> > > > output is already not 'pure' ASCII85 because the ASCII85 data is
> > > > embedded within a devcoredump. There is all sorts of other text about,
> > > > including on the start of the line. There are multiple ASCII85 blobs
> > > > in there that need to be decoded separately. This is nothing new to my
> > > > patch set. All of that is already there. And as per comments on the
> > > > previous devcoredump patches from Matthew B, the object data can many
> > > > hundreds of MBs in size. Yet no-one batted an eyelid when that was
> > > > added. So why the sudden paranoia about adding a couple of MB of GuC
> > > > log in the same form?
> > > I suppose you are talking about commit 4d5242a003bb ("drm/xe: Implement capture of
> > > HWSP and HWCTX").  Probably because I haven't seen that commit doing an
> > > ascii85 encoding before, otherwise I'd have similar review feedback.
> > > 
> > > Looking at this just now, so I will also have to balance the previous
> > > users and existing userspace consuming it.
> > > 
> > > +José, would it be ok from the userspace POV to start adding the \n?
> > > Then we can at least have all fields in our devcoredump to follow the
> > > same format. Are these the decoder parts on the mesa side?
> > > 
> > > 	src/intel/tools/aubinator_error_decode.c
> > > 	src/intel/tools/error2hangdump.c?
> > > 
> > >   From a quick look, read_xe_data_file() already continues the previous
> > > topic when it reads a newline, but the parsers for HWCTX and HWSP
> > > seems to expect to to have the entire topic in a single line. But I may
> > > be missing something.
> > Sorry I'm not following up this thread.
> > Add a '\n' where exactly?
> To break very long ASCII85 streams across multiple lines. That will 
> allow the devcoredump file to output via dmesg for the situations where 
> reading from sysfs is not possible.
> 
> This patch is adding an ASCII85 encoding helper that is more friendly to 
> output via dmesg. The patch does not currently change the existing 
> ASCII85 encodings of VMs and hw contexts. However, the intention would 
> be to update that code to use this helper eventually.

It will break the parser and I don't think we are allowed to break it at this point.

What I can suggest is add another sysfs with a coredump that skips the binary dumps so it is readable by humans.

> 
> John.
> 
> > 
> > > Lucas De Marchi
> > > 
> > > > And again, arbitrarily long lines (potentially many thousands of
> > > > characters wide) in a text file can cause problems. Having it line
> > > > wrapped gets rid of those potential problems and so is safer. Anything
> > > > that reduces the risk of an error report being broken is a good thing
> > > > IMHO. Robustness is worthwhile!
> > > > 
> > > > John.
> > > > 
> > > > > Lucas De Marchi
> > > > > 
> > > > > > John.
> > > > > > 
>