[PATCH v3 1/1] drm/xe/guc: Fix GuC log/ct output via debugfs

Thu Jan 9 15:39:35 UTC 2025

On Wed, Jan 08, 2025 at 03:59:58PM -0800, John Harrison wrote:
>On 1/8/2025 14:11, Lucas De Marchi wrote:
>>+Jose
>>+Rodrigo
>>
>>On Wed, Jan 08, 2025 at 12:14:49PM -0800, John Harrison wrote:
>>>On 1/7/2025 13:10, Lucas De Marchi wrote:
>>>>On Tue, Jan 07, 2025 at 12:22:52PM -0800, Julia Filipchuk wrote:
>>>>>Change to disable asci85 GuC logging only when output to devcoredump
>>>>>(was temporarily disabled for all code paths).
>>>>>
>>>>>v2: Ignore only for devcoredump case (not dmesg output).
>>>>>v3: Rebase to resolve parent tag mismatch.
>>>>>
>>>>>Signed-off-by: Julia Filipchuk <julia.filipchuk at intel.com>
>>>>>---
>>>>>drivers/gpu/drm/xe/xe_devcoredump.c | 8 +++++---
>>>>>include/drm/drm_print.h             | 2 ++
>>>>>2 files changed, 7 insertions(+), 3 deletions(-)
>>>>>
>>>>>diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c 
>>>>>b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>index 6980304c8903..8e5d1f9866a7 100644
>>>>>--- a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>@@ -424,10 +424,12 @@ void xe_print_blob_ascii85(struct 
>>>>>drm_printer *p, const char *prefix,
>>>>>     * Splitting blobs across multiple lines is not compatible 
>>>>>with the mesa
>>>>>     * debug decoder tool. Note that even dropping the 
>>>>>explicit '\n' below
>>>>>     * doesn't help because the GuC log is so big some 
>>>>>underlying implementation
>>>>>-     * still splits the lines at 512K characters. So just 
>>>>>bail completely for
>>>>>-     * the moment.
>>>>>+     * still splits the lines at 512K characters.
>>>>
>>>>did we investigate where this is done and how we can overcome it? I
>>>Yes. And the comment could be updated as part of this patch to 
>>>make it clearer...
>>>
>>>
>>>     * Splitting blobs across multiple lines is not compatible 
>>>with the mesa
>>>     * debug decoder tool. Note that even dropping the explicit 
>>>line wrapping
>>>     * below doesn't help because the GuC log can be so big it 
>>>needs to be split
>>>     * into multiple 2MB chunks, each of which must be printed 
>>>individually and
>>>     * therefore will be a separate line.
>>
>>so there's no "underlying implemenation that splits like at 512K chars"?
>>Nowe it's something else: a 2MB chunk because of what?
>The GuC log is captured in multiple 2MB chunks because we can't 
>kmalloc a full 16MB block at a single time irrespective of memory 
>fragmentation - the kmalloc function aborts with a page size error 
>above 8MB, I recall. Might even have been 4MB? The 2MB chunks become 
>approx 512K characters of ASCII85 output (when mostly full of zeros). 
>Because each chunk is a separate bo object the memory is not 
>contiguous so each must be printed with a separate call to 
>xe_print_blob_ascii85(). If xe_print_blob_ascii85() includes a 
>terminating line feed at the end of the print then that means multiple 
>lines for the GuC log.

now it makes sense... and contradicts what was said earlier about the
"underlying implementation still splits the lines at 512K characters".
It's this very function that is splitting it.

>
>>
>>I don't really want to keep updating this instead of fixing it.
>>
>>>
>>>The dump helper could be updated to never add any line feeds, not 
>>>even
>>
>>when you say "dump helper", what exactly are you talking about?
>>xe_print_blob_ascii85(), __xe_devcoredump_read(), __drm_puts_coredump()
>>or what?
>xe_print_blob_ascii85
>
>>
>>If it's about xe_print_blob_ascii85(), I think this would be the 
>>first step:
>>drop the \n you are adding. AFAICS calling the print function in chunks
>>of 800B would be fine, just don't add any \n. If that still adds \n
>>somewhere, we can check how to drop it or you then may have a pretty
>>solid argument to adapt the mesa tool with a newline
>>continuar char and send them the MR.
>As above, the chunking is at the next level out. So dropping the 800 
>byte \n won't help with the GuC log still being split into multiple 
>lines. You have to either move the terminating \n outside of the 
>helper and into the callers or add a bool parameter to enable/disable 
>it. Either of which is just messy.

If the caller knows it's calling the function multiple times to continue
previous data, it's not unreasonable to have an arg to terminate it or
not. It's for example what python does with its 'end' arg to the print()
function.

>
>Or extend the helper to take an array/list of pointers in a 
>scatter/gather fashion.
>
>And if you drop the 800B \n then you break the dmesg output facility. 
>Which means we can't get logs out from CI panic type situations.

Or that we should have another way to get a devcoredump. I don't think
we are going to print the devcoredump to dmesg when the kernel panics.
We may print it when we fail to probe, but the cpu should still be very
much alive - we tie the devcoredump lifecycle to the lifecycle of the
device, so we don't have an opportunity to get that data.

The workaround you add is to dump the whole thing to dmesg, but we could
dump it somewhere else like a debugfs that is tied to the module
lifecycle.

The dmesg (kmsg) format already contains other things you have to ignore
to extract the message:

  * /dev/kmsg exports the structured data in the following line format:           
  *   "<level>,<sequnum>,<timestamp>,<contflag>[,additional_values, ... ];<message text>\n"

so you will always have to preprocess it, ignore lines that don't belong
to the log, etc.

And if you print it correctly, you can even make use of the LOG_CONT to
mark the new line as continuation of the previous one.

>
>>>at the end. And then the burden would be on the callers to add 
>>>line feeds as appropriate. That seems extremely messy, though. And 
>>>it will break the dmesg output facility. That does need the line 
>>>wrapping at ~800 characters which can't be done from outside the 
>>>helper.
>>>
>>>>
>>>>understand having to split it into multiple calls, but not something
>>>>adding a \n. particularly for the functions dealing with seq_file and
>>>>devcoredump.
>>>>
>>>>>+     *
>>>>>+     * Only disable from devcoredump output.
>>>>>     */
>>>>>-    return;
>>>>>+    if (p->coredump)
>>>>
>>>>but we do want the guc log to be inside the devcoredump, so rather than
>>>>adding more workarounds, can we fix it ?
>>>The simplest fix is your suggestion of adding an extra white space 
>>>character at the end of the line in the helper function and 
>>>updating the mesa tool to support multi-line output. That seems 
>>>like it should be a trivial update to the mesa tool and it keeps 
>>>the KMD side simple and consistent across all potential output 
>>>methods. But Jose seemed absolutely dead set against any updates 
>>>to the mesa tool at all :(.
>>>
>>>The more complex fix is Joonas' idea of using an official 
>>>structured output format of some sort - TOML, YAML, JSON or 
>>>whatever. Something
>>
>>it actually doesn't fix this particular issue. It fixes others, but not
>>this one.
>If we do it properly then it fixes everything. The format library 
>would have helper functions that the devcoredump code can call to 

but we are *NOT* going to export a library for that. You are just
shifting the problem to "we have only one userspace that is our own
library, deal with it".

>print integers, bools, strings and BLOBs. The debugfs/sysfs output 
>would just stream the encoded output exactly as is. The dmesg output 
>layer would add the 800 character line wrapping in a way that the 
>dmesg decoder can undo before passing to the decoder library.
>
>>
>>    "Guc Log": "<encoded data .........."
>>
>>if you need to split lines, you still need something to do do it in
>>whatever format you choose. One json I remember working on in the past
>>that embeds the binary is the .apj format... example:
>>https://firmware.ardupilot.org/Copter/2024-12/2024-12-02-13:12/CubeBlack/arducopter.apj
>>
>>
>>    {
>>        "board_id": 9,
>>        "magic": "APJFWv1",
>>        "description": "Firmware for a STM32F427xx board",
>>        "image": "eNqUvAlcU1faMH5u7k1yCQGCICCgBoIaRW0EtVSxhg ...
>>
>>here "image" is base64-encoded  afair. But in your case, just printing
>>the json to dmesg wouldn't work because of needing to split the line.
>>It wouldn't fix the kind of bugs we had in the past too:  in json it
>>would be the equivalent of adding a field to the wrong object and then
>>move it elsewhere. An application trying to read that object in the
>>previous place would be broken regardless.
>>
>>toml is more for configuration and yaml isn't a good one for this case
>>as it further relies on indentation. I wouldn't oppose to use json, but
>>don't expect to have dmesg to magically work because of that.
>TOML basically says 'use base64 and print it as a string'. And it 
>supports multi-line strings, so the reader would have to drop newlines 
>within a base64 object before doing the base64 decode. Or just use a 
>base64 decoder that ignores whitespace.
>
>YAML seems to make even more explicit. From the 1.2.2 spec page:
>    picture: !!binary |
>     R0lGODlhDAAMAIQAAP//9/X
      ^ space here for line continuation, as I mentioned.

>     17unp5WZmZgAAAOfn515eXv
>     Pz7Y6OjuDg4J+fn5OTk6enp
>     56enmleECcgggoBADs=
>
>So the userland reader just gets binary data straight out of the 
>decoder library with no processing required.

A decoder library that we are *NOT* going to write.

>
>
>>
>>So let's fix the issue at hand and let the talk about xe's coredump
>>format for a later time.
>The issue at hand is that we have two fundamentally conflicting 
>requirements. DMesg output requires line wrapping but the mesa tool 
>has been coded in such a way that it breaks when given wrapped lines. 
>There is nothing we can do about dmesg but making the mesa tool's 
>parser more robust does not seem difficult.
>
>Switching to an externally defined encoding scheme that internally 
>manages binary objects and supports wrapped lines would fix all issues 
>and guarantee future compatibility.
>
>Note that line wrapping is not the only problem with the current mesa 
>tool. Unless it has been updated since? The tool breaks if the context 
>information is not part of the "GuC Communication Transport" section. 
>Which is totally incorrect. Context state is very definitely not part 

We reverted it because it's not ok to just move it to the "correct"
place and break the users. If it's in the incorrect place and we know
there's this parser already using it incorrectly, we can treat as a bug
in the kernel and adjust the userspace (it'd be considered a bug there
too and backported to their earlier releases).  What we don't do is
simply merge the fix and break userspace.

Send the fix to the mesa tool if you want to change this in the kernel.

And as I said, for this particular change it doesn't matter if it's a
custom format, json, toml or whatever... the bug would still be the
same.

>of the GuC CT layer. That absolutely needs to be fixed. And generally, 
>anything we can do to make the decoding more robust and future proof 
>would be a very good thing.
>
>Note that the devcoredump content is highly dependent upon the 
>internal workings of the driver. It is going to change over time. So 
>the more robust the decoder, the better chance it has of not breaking 
>when some new piece of data is added to the dump or an existing item 
>removed or modified.
>
>>
>>>that ideally supports binary data and compression natively and has 
>>>a helper library that can be included in the kernel source tree. 
>>>That
>>
>>oh no, we'd need to define the schema - there are plenty of json 
>>libraries
>>in multiple languages for users to choose from  and that would be 
>>basically
>>then main benefit of using it.
>Well, we can certainly define extra formats on top of a base format 
>but it seems like it would be simpler if the base format supported all 
>the data types we need already. Doesn't really matter either way as 
>long as the underlying file format supports all the uses we want of 
>it.

My point is that we are not going to add a decoder library that all
userspace consumers have to use. The kernel interface is not a library.
You are relying on having just one userspace to not break userspace.

The benefit of a common format like json, toml etc is exactly that it's
a format already understood by hundreds of libraries and we can even do
something like this instead of having to write our own parser:

cat devcoredump | jq '.[] | .foo, .bar'

Lucas De Marchi

>
>John.
>
>>
>>Lucas De Marchi
>>
>>>will then remove any and all confusion over interpretation of the 
>>>file forever more. But it would require updating all userland 
>>>tools to use the appropriate helper library.
>>>
>>>John.
>>>
>>>>
>>>>Lucas De Marchi
>>>
>