[PATCH v7 03/10] drm/xe/devcoredump: Add ASCII85 dump helper function

Wed Sep 11 19:31:20 UTC 2024

On 9/11/2024 12:12, Lucas De Marchi wrote:
> On Tue, Sep 10, 2024 at 01:17:11PM GMT, John Harrison wrote:
>> On 9/10/2024 12:43, Lucas De Marchi wrote:
>>> On Mon, Sep 09, 2024 at 06:31:41PM GMT, John Harrison wrote:
>>>> On 9/6/2024 19:06, John Harrison wrote:
>>>>> On 9/5/2024 20:04, Lucas De Marchi wrote:
>>>>>> On Thu, Sep 05, 2024 at 07:01:33PM GMT, John Harrison wrote:
>>>>>>> On 9/5/2024 18:54, Lucas De Marchi wrote:
>>>>>>>> On Thu, Sep 05, 2024 at 01:50:58PM GMT, 
>>>>>>>> John.C.Harrison at Intel.com wrote:
>>>>>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>>>
>>>>>>>>> There is a need to include the GuC log and other large binary 
>>>>>>>>> objects
>>>>>>>>> in core dumps and via dmesg. So add a helper for dumping to a 
>>>>>>>>> printer
>>>>>>>>> function via conversion to ASCII85 encoding.
>>>>>>>>
>>>>>>>> why are we not dumping the binary data directly to devcoredump?
>>>>>>> As per earlier comments, there is a WiFi driver or some such 
>>>>>>> that does exactly that. But all they are dumping is a binary blob.
>>>>>>
>>>>>> In your v5 I see you mentioned
>>>>>> drivers/net/wireless/ath/ath10k/coredump.c, but that is a 
>>>>>> precedence for
>>>>>> including it as is from the device rather converting it to 
>>>>>> ASCII85 or
>>>>>> something else. It seems odd to do that type of conversion in kernel
>>>>>> space when it could be perfectly done in userspace.
>>>>> It really can't. An end user could maybe be expected to zip or tar 
>>>>> a coredump file before attaching it to a bug report but they are 
>>>>> certainly not going to try to ASCII85 encode random bits of it. 
>>>>> Whereas, putting that in the kernel means it is just there. It is 
>>>>> done. And it is pretty trivial - just call a helper function and 
>>>>> it does everything for you. Also, I very much doubt you can spew 
>>>>> raw binary data via dmesg. Even if the kernel would print it for 
>>>>> you (which I doubt), the user tools like syslogd and dmesg itself 
>>>>> are going to filter it to make it ASCII safe.
>>>>>
>>>>> The i915 error dumps have been ASCII85 encoded using the kernel's 
>>>>> ASCII85 encoding helper function since forever. This patch is just 
>>>>> a wrapper around the kernel's existing implementation in order to 
>>>>> make it more compatible with printing to dmesg. This is not 
>>>>> creating a new precedent. It already exists.
>>>>>
>>>>>>
>>>>>> $ git grep ascii85.h
>>>>>> drivers/gpu/drm/i915/i915_gpu_error.c:#include <linux/ascii85.h>
>>>>>> drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c:#include 
>>>>>> <linux/ascii85.h>
>>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c:#include <linux/ascii85.h>
>>>>>> drivers/gpu/drm/xe/xe_lrc.c:#include <linux/ascii85.h>
>>>>>> drivers/gpu/drm/xe/xe_vm.c:#include <linux/ascii85.h>
>>>>> And the list of drivers which dump raw binary data in a coredump 
>>>>> file is... ath10k. ASCII85 wins 3 to 1.
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> We want the devcoredump file to still be human readable. That 
>>>>>>> won't be the case if you stuff binary data in the middle of it. 
>>>>>>> Most obvious problem - the zeros in the data will terminate your 
>>>>>>> text file at that point. Potentially bigger problem for end 
>>>>>>> users - random fake ANSI codes will destroy your terminal window 
>>>>>>> if you try to cat the file to read it.
>>>>>>
>>>>>> Users don't get a coredump and cat it to the terminal.
>>>>>> =(lk%A8`T7AKYH#FD,6++EqOABHUhsG%5H2ARoq#E$/V$Bl7Q+@<5pmBe<q;Bk;0mCj at .3DIal2FD5Q-+E_RBART+X@VfTuGA2/4Dfp.E at 3BN0DfB9.+E1b0F(KAV+:8 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Lucas De Marchi
>>>>> They might. Either intentionally or accidentally. I've certainly 
>>>>> done it myself. And people will certainly want to look at it in 
>>>>> any random choice of text editor, pager, etc. 'cos you know, it is 
>>>>> meant to be read by humans. If it is full of binary data then that 
>>>>> becomes even more difficult than simply being full of ASCII 
>>>>> gibberish. No matter what you are doing, the ASCII version is 
>>>>> safer and easier to look at the rest of the file around it.
>>>>>
>>>>> I don't understand why you are so desperate to have raw binary 
>>>>> data in the middle of a text file. The disadvantages are multiple 
>>>>> but the only advantage is a slightly smaller file. And the true 
>>>>> route to smaller files is to add compression like we have in i915.
>>>>>
>>>>> John.
>>>>>
>>>> PS: Also meant to add that one of the important uses cases for 
>>>> dumping logs to dmesg is for the really hard to repro bugs that 
>>>> show up in CI extremely rarely. We get the driver to dump an error 
>>>> capture to dmesg and pull that out from the CI logs. Even if you 
>>>> could get binary data through dmesg, pretty sure the CI tools would 
>>>> also not be happy with it. Anything non-printable will get munged 
>>>> for sure when turning it into a web page.
>>>
>>> I think that's the main source of confusion on what we are discussing.
>>> I was not talking about dmesg at all. I'm only complaining about 
>>> feeding
>>> ascii85-encoded data into a *devcoredump* when apparently there isn't a
>>> good reason to do so. I'd rather copy the binary data to the
>>> devcoredump.
>> But the intent is to dump a devcoredump to dmesg. It makes much sense 
>
> It seems like an awful idea to dump hundreds of MB to dmesg.  When we
> talked about printing to dmesg it was about **GuC log** and on very
> initial states of driver probe where we didn't actually have a good
> interface for that. And the log wouldn't be so big. If we can already
> capture the devcoredump, what would be the reason to dump to dmesg
> (other than the non-valid "our CI captures dmesg, and doesn't
> capture devcoredump", which should be fixed).
>
> If any sysadmin have their serial console flooded by such garbage there
> are 2 reactions: 1) someone got in control of my machine; 2) something
> went really bad with this machine. It's not "fear not, wait for it to
> complete, it's just normal debug data I will attach to an issue in
> gitlab".  And I'm mentioning a serial console here due to that
> cond_resched() added, which is only needed because you are trying to do
> in kernel space what should be in userspace.
>
> Oh well... looking at this the main reason to use ascii85 I can see is
> because we already have parts of *our* devcoredump using it, and
> userspace relying on that. That's new to me. Let's stop bringing dmesg
> into this discussion.
You are missing the point.

The construction of a devcoredump file is the best form of crash 
analysis we have. There is no point trying to re-invent that with 
partial versions that only include some of the information in random 
different situations. No matter where the crash has happened or been 
detected, a devcoredump file should be the correct way to debug it. 
Conversely, that means making devcoredump as useful as possible by 
adding in all the important things we might need to debug a problem.

In other words, we want everything in the devcoredump code path and we 
want no other code paths that are duplicated sub-parts of devcoredump.

That is part 1.

As a follow up to that, there is the problem that not all hangs occur at 
points where it is possible to get a devcoredump out via sysfs. Module 
load time, kernel selftests, presi, etc. There are many valid reasons 
why sysfs is not the best answer in all situations. For those 
situations, dmesg is the simplest, most convenient and most reliable 
option. Therefore, we want to be able to send a devcoredump capture to 
dmesg.

That is part 2.

This patch set is addressing part 1 (add the GuC log and other useful 
stuff to devcoredump) and is preparing the way for part 2 (there are 
still problems with not have an xe scheduler job meaning you can't 
actually create a devcoredump in the first place).

I absolutely do not expect to ever dump a devcoredump to dmesg when 
sysfs is available. A sysadmin should never see a devcoredump being 
spewed to dmesg whether large or small. However, there are many reasons 
why developers and/or CI will need that facility. And it is infinitely 
preferable to have that facility available in the driver ready to be 
used than to have to carry a bucket load of patches in private branches.

>
>> to have a single implementation that can be used for multiple 
>> purposes. Otherwise you are duplicating a lot of code unnecessarily.
>>
>> And I still think it is a *very* bad idea to be including binary data 
>> in a text file. The devcoredump is supposed to be human readable. It 
>
> no, it's not. devcoredump doesn't dictate the format, it's up to the
> drivers to do that. See their documentation.
But *our* devcoredump is supposed to readable by a human. We put lots of 
things in there that developers want to quickly look at to get an idea 
of what happened. So why would we want our driver to dictate a format 
that mixes binary blobs with human readable text? That is the worst of 
all worlds and a right pain for anyone trying to work with a devcoredump 
file - developer or end user.

John.

>
>> is supposed to be obtained by end users and passed around. Having 
>> binary data in there just makes everything more complex and error 
>> prone. We want this to be as simple, easy and safe as possible.
>>
>>>
>>> For dmesg, there's a reason to encode it as you pointed out... but
>>> no users shouldn't actually see it - we should be getting all of those
>>> cases in CI. For the escape scenarios, yeah... better having it
>>> ascii85-encoded.
>>>
>>> What you are adding to devcoredump also doesn't even seem to be an
>>> ascii85 representation, but a multiple lines that should be 
>>> concatenated
>>> to form the ascii85 representation. For dmesg it makes sense. Not for
>>> devcoredump.  We should also probabaly need a length field (correctly
>>> accounting for the additional characters for each line) so we don't
>>> have an implicit dependency on what's the next field to know how 
>>> much to
>>> parse.
>> The decoding is pretty trivial given that line feeds are not part of 
>> the ASCII85 character set and so can just be dropped. Besides The 
>> output is already not 'pure' ASCII85 because the ASCII85 data is 
>> embedded within a devcoredump. There is all sorts of other text 
>> about, including on the start of the line. There are multiple ASCII85 
>> blobs in there that need to be decoded separately. This is nothing 
>> new to my patch set. All of that is already there. And as per 
>> comments on the previous devcoredump patches from Matthew B, the 
>> object data can many hundreds of MBs in size. Yet no-one batted an 
>> eyelid when that was added. So why the sudden paranoia about adding a 
>> couple of MB of GuC log in the same form?
>
> I suppose you are talking about commit 4d5242a003bb ("drm/xe: 
> Implement capture of
> HWSP and HWCTX").  Probably because I haven't seen that commit doing an
> ascii85 encoding before, otherwise I'd have similar review feedback.
>
> Looking at this just now, so I will also have to balance the previous
> users and existing userspace consuming it.
>
> +José, would it be ok from the userspace POV to start adding the \n?
> Then we can at least have all fields in our devcoredump to follow the
> same format. Are these the decoder parts on the mesa side?
>
>     src/intel/tools/aubinator_error_decode.c
>     src/intel/tools/error2hangdump.c?
>
> From a quick look, read_xe_data_file() already continues the previous
> topic when it reads a newline, but the parsers for HWCTX and HWSP
> seems to expect to to have the entire topic in a single line. But I may
> be missing something.
>
> Lucas De Marchi
>
>>
>> And again, arbitrarily long lines (potentially many thousands of 
>> characters wide) in a text file can cause problems. Having it line 
>> wrapped gets rid of those potential problems and so is safer. 
>> Anything that reduces the risk of an error report being broken is a 
>> good thing IMHO. Robustness is worthwhile!
>>
>> John.
>>
>>>
>>> Lucas De Marchi
>>>
>>>>
>>>> John.
>>>>
>>