[PATCH v7 03/10] drm/xe/devcoredump: Add ASCII85 dump helper function

Wed Sep 11 19:30:49 UTC 2024

On Wed, 2024-09-11 at 14:12 -0500, Lucas De Marchi wrote:
> On Tue, Sep 10, 2024 at 01:17:11PM GMT, John Harrison wrote:
> > On 9/10/2024 12:43, Lucas De Marchi wrote:
> > > On Mon, Sep 09, 2024 at 06:31:41PM GMT, John Harrison wrote:
> > > > On 9/6/2024 19:06, John Harrison wrote:
> > > > > On 9/5/2024 20:04, Lucas De Marchi wrote:
> > > > > > On Thu, Sep 05, 2024 at 07:01:33PM GMT, John Harrison wrote:
> > > > > > > On 9/5/2024 18:54, Lucas De Marchi wrote:
> > > > > > > > On Thu, Sep 05, 2024 at 01:50:58PM GMT, 
> > > > > > > > John.C.Harrison at Intel.com wrote:
> > > > > > > > > From: John Harrison <John.C.Harrison at Intel.com>
> > > > > > > > > 
> > > > > > > > > There is a need to include the GuC log and other large 
> > > > > > > > > binary objects
> > > > > > > > > in core dumps and via dmesg. So add a helper for dumping 
> > > > > > > > > to a printer
> > > > > > > > > function via conversion to ASCII85 encoding.
> > > > > > > > 
> > > > > > > > why are we not dumping the binary data directly to devcoredump?
> > > > > > > As per earlier comments, there is a WiFi driver or some such 
> > > > > > > that does exactly that. But all they are dumping is a binary 
> > > > > > > blob.
> > > > > > 
> > > > > > In your v5 I see you mentioned
> > > > > > drivers/net/wireless/ath/ath10k/coredump.c, but that is a 
> > > > > > precedence for
> > > > > > including it as is from the device rather converting it to ASCII85 or
> > > > > > something else. It seems odd to do that type of conversion in kernel
> > > > > > space when it could be perfectly done in userspace.
> > > > > It really can't. An end user could maybe be expected to zip or 
> > > > > tar a coredump file before attaching it to a bug report but they 
> > > > > are certainly not going to try to ASCII85 encode random bits of 
> > > > > it. Whereas, putting that in the kernel means it is just there. 
> > > > > It is done. And it is pretty trivial - just call a helper 
> > > > > function and it does everything for you. Also, I very much doubt 
> > > > > you can spew raw binary data via dmesg. Even if the kernel would 
> > > > > print it for you (which I doubt), the user tools like syslogd 
> > > > > and dmesg itself are going to filter it to make it ASCII safe.
> > > > > 
> > > > > The i915 error dumps have been ASCII85 encoded using the 
> > > > > kernel's ASCII85 encoding helper function since forever. This 
> > > > > patch is just a wrapper around the kernel's existing 
> > > > > implementation in order to make it more compatible with printing 
> > > > > to dmesg. This is not creating a new precedent. It already 
> > > > > exists.
> > > > > 
> > > > > > 
> > > > > > $ git grep ascii85.h
> > > > > > drivers/gpu/drm/i915/i915_gpu_error.c:#include <linux/ascii85.h>
> > > > > > drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c:#include <linux/ascii85.h>
> > > > > > drivers/gpu/drm/msm/adreno/adreno_gpu.c:#include <linux/ascii85.h>
> > > > > > drivers/gpu/drm/xe/xe_lrc.c:#include <linux/ascii85.h>
> > > > > > drivers/gpu/drm/xe/xe_vm.c:#include <linux/ascii85.h>
> > > > > And the list of drivers which dump raw binary data in a coredump 
> > > > > file is... ath10k. ASCII85 wins 3 to 1.
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > We want the devcoredump file to still be human readable. 
> > > > > > > That won't be the case if you stuff binary data in the 
> > > > > > > middle of it. Most obvious problem - the zeros in the data 
> > > > > > > will terminate your text file at that point. Potentially 
> > > > > > > bigger problem for end users - random fake ANSI codes will 
> > > > > > > destroy your terminal window if you try to cat the file to 
> > > > > > > read it.
> > > > > > 
> > > > > > Users don't get a coredump and cat it to the terminal.
> > > > > > =(lk%A8`T7AKYH#FD,6++EqOABHUhsG%5H2ARoq#E$/V$Bl7Q+@<5pmBe<q;Bk;0mCj at .3DIal2FD5Q-+E_RBART+X@VfTuGA2/4Dfp.E at 3BN0DfB9.+E1b0F(KAV+:8
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Lucas De Marchi
> > > > > They might. Either intentionally or accidentally. I've certainly 
> > > > > done it myself. And people will certainly want to look at it in 
> > > > > any random choice of text editor, pager, etc. 'cos you know, it 
> > > > > is meant to be read by humans. If it is full of binary data then 
> > > > > that becomes even more difficult than simply being full of ASCII 
> > > > > gibberish. No matter what you are doing, the ASCII version is 
> > > > > safer and easier to look at the rest of the file around it.
> > > > > 
> > > > > I don't understand why you are so desperate to have raw binary 
> > > > > data in the middle of a text file. The disadvantages are 
> > > > > multiple but the only advantage is a slightly smaller file. And 
> > > > > the true route to smaller files is to add compression like we 
> > > > > have in i915.
> > > > > 
> > > > > John.
> > > > > 
> > > > PS: Also meant to add that one of the important uses cases for 
> > > > dumping logs to dmesg is for the really hard to repro bugs that 
> > > > show up in CI extremely rarely. We get the driver to dump an error 
> > > > capture to dmesg and pull that out from the CI logs. Even if you 
> > > > could get binary data through dmesg, pretty sure the CI tools 
> > > > would also not be happy with it. Anything non-printable will get 
> > > > munged for sure when turning it into a web page.
> > > 
> > > I think that's the main source of confusion on what we are discussing.
> > > I was not talking about dmesg at all. I'm only complaining about feeding
> > > ascii85-encoded data into a *devcoredump* when apparently there isn't a
> > > good reason to do so. I'd rather copy the binary data to the
> > > devcoredump.
> > But the intent is to dump a devcoredump to dmesg. It makes much sense 
> 
> It seems like an awful idea to dump hundreds of MB to dmesg.  When we
> talked about printing to dmesg it was about **GuC log** and on very
> initial states of driver probe where we didn't actually have a good
> interface for that. And the log wouldn't be so big. If we can already
> capture the devcoredump, what would be the reason to dump to dmesg
> (other than the non-valid "our CI captures dmesg, and doesn't
> capture devcoredump", which should be fixed).
> 
> If any sysadmin have their serial console flooded by such garbage there
> are 2 reactions: 1) someone got in control of my machine; 2) something
> went really bad with this machine. It's not "fear not, wait for it to
> complete, it's just normal debug data I will attach to an issue in
> gitlab".  And I'm mentioning a serial console here due to that
> cond_resched() added, which is only needed because you are trying to do
> in kernel space what should be in userspace.
> 
> Oh well... looking at this the main reason to use ascii85 I can see is
> because we already have parts of *our* devcoredump using it, and
> userspace relying on that. That's new to me. Let's stop bringing dmesg
> into this discussion.
> 
> > to have a single implementation that can be used for multiple 
> > purposes. Otherwise you are duplicating a lot of code unnecessarily.
> > 
> > And I still think it is a *very* bad idea to be including binary data 
> > in a text file. The devcoredump is supposed to be human readable. It 
> 
> no, it's not. devcoredump doesn't dictate the format, it's up to the
> drivers to do that. See their documentation.
> 
> > is supposed to be obtained by end users and passed around. Having 
> > binary data in there just makes everything more complex and error 
> > prone. We want this to be as simple, easy and safe as possible.
> > 
> > > 
> > > For dmesg, there's a reason to encode it as you pointed out... but
> > > no users shouldn't actually see it - we should be getting all of those
> > > cases in CI. For the escape scenarios, yeah... better having it
> > > ascii85-encoded.
> > > 
> > > What you are adding to devcoredump also doesn't even seem to be an
> > > ascii85 representation, but a multiple lines that should be concatenated
> > > to form the ascii85 representation. For dmesg it makes sense. Not for
> > > devcoredump.  We should also probabaly need a length field (correctly
> > > accounting for the additional characters for each line) so we don't
> > > have an implicit dependency on what's the next field to know how much to
> > > parse.
> > The decoding is pretty trivial given that line feeds are not part of 
> > the ASCII85 character set and so can just be dropped. Besides The 
> > output is already not 'pure' ASCII85 because the ASCII85 data is 
> > embedded within a devcoredump. There is all sorts of other text about, 
> > including on the start of the line. There are multiple ASCII85 blobs 
> > in there that need to be decoded separately. This is nothing new to my 
> > patch set. All of that is already there. And as per comments on the 
> > previous devcoredump patches from Matthew B, the object data can many 
> > hundreds of MBs in size. Yet no-one batted an eyelid when that was 
> > added. So why the sudden paranoia about adding a couple of MB of GuC 
> > log in the same form?
> 
> I suppose you are talking about commit 4d5242a003bb ("drm/xe: Implement capture of
> HWSP and HWCTX").  Probably because I haven't seen that commit doing an
> ascii85 encoding before, otherwise I'd have similar review feedback.
> 
> Looking at this just now, so I will also have to balance the previous
> users and existing userspace consuming it.
> 
> +José, would it be ok from the userspace POV to start adding the \n?
> Then we can at least have all fields in our devcoredump to follow the
> same format. Are these the decoder parts on the mesa side?
> 
> 	src/intel/tools/aubinator_error_decode.c
> 	src/intel/tools/error2hangdump.c?
> 
>  From a quick look, read_xe_data_file() already continues the previous
> topic when it reads a newline, but the parsers for HWCTX and HWSP
> seems to expect to to have the entire topic in a single line. But I may
> be missing something.

Sorry I'm not following up this thread.
Add a '\n' where exactly? 

> 
> Lucas De Marchi
> 
> > 
> > And again, arbitrarily long lines (potentially many thousands of 
> > characters wide) in a text file can cause problems. Having it line 
> > wrapped gets rid of those potential problems and so is safer. Anything 
> > that reduces the risk of an error report being broken is a good thing 
> > IMHO. Robustness is worthwhile!
> > 
> > John.
> > 
> > > 
> > > Lucas De Marchi
> > > 
> > > > 
> > > > John.
> > > > 
> >