[PATCH v3 1/1] drm/xe/guc: Fix GuC log/ct output via debugfs

Thu Jan 9 22:43:28 UTC 2025

On Thu, Jan 09, 2025 at 12:40:51PM -0800, John Harrison wrote:
>On 1/9/2025 07:39, Lucas De Marchi wrote:
>>On Wed, Jan 08, 2025 at 03:59:58PM -0800, John Harrison wrote:
>>>On 1/8/2025 14:11, Lucas De Marchi wrote:
>>>>+Jose
>>>>+Rodrigo
>>>>
>>>>On Wed, Jan 08, 2025 at 12:14:49PM -0800, John Harrison wrote:
>>>>>On 1/7/2025 13:10, Lucas De Marchi wrote:
>>>>>>On Tue, Jan 07, 2025 at 12:22:52PM -0800, Julia Filipchuk wrote:
>>>>>>>Change to disable asci85 GuC logging only when output to devcoredump
>>>>>>>(was temporarily disabled for all code paths).
>>>>>>>
>>>>>>>v2: Ignore only for devcoredump case (not dmesg output).
>>>>>>>v3: Rebase to resolve parent tag mismatch.
>>>>>>>
>>>>>>>Signed-off-by: Julia Filipchuk <julia.filipchuk at intel.com>
>>>>>>>---
>>>>>>>drivers/gpu/drm/xe/xe_devcoredump.c | 8 +++++---
>>>>>>>include/drm/drm_print.h             | 2 ++
>>>>>>>2 files changed, 7 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>>diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c 
>>>>>>>b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>>index 6980304c8903..8e5d1f9866a7 100644
>>>>>>>--- a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>>+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>>@@ -424,10 +424,12 @@ void xe_print_blob_ascii85(struct 
>>>>>>>drm_printer *p, const char *prefix,
>>>>>>>     * Splitting blobs across multiple lines is not 
>>>>>>>compatible with the mesa
>>>>>>>     * debug decoder tool. Note that even dropping the 
>>>>>>>explicit '\n' below
>>>>>>>     * doesn't help because the GuC log is so big some 
>>>>>>>underlying implementation
>>>>>>>-     * still splits the lines at 512K characters. So just 
>>>>>>>bail completely for
>>>>>>>-     * the moment.
>>>>>>>+     * still splits the lines at 512K characters.
>>>>>>
>>>>>>did we investigate where this is done and how we can overcome it? I
>>>>>Yes. And the comment could be updated as part of this patch to 
>>>>>make it clearer...
>>>>>
>>>>>
>>>>>     * Splitting blobs across multiple lines is not compatible 
>>>>>with the mesa
>>>>>     * debug decoder tool. Note that even dropping the 
>>>>>explicit line wrapping
>>>>>     * below doesn't help because the GuC log can be so big it 
>>>>>needs to be split
>>>>>     * into multiple 2MB chunks, each of which must be printed 
>>>>>individually and
>>>>>     * therefore will be a separate line.
>>>>
>>>>so there's no "underlying implemenation that splits like at 512K 
>>>>chars"?
>>>>Nowe it's something else: a 2MB chunk because of what?
>>>The GuC log is captured in multiple 2MB chunks because we can't 
>>>kmalloc a full 16MB block at a single time irrespective of memory 
>>>fragmentation - the kmalloc function aborts with a page size error 
>>>above 8MB, I recall. Might even have been 4MB? The 2MB chunks 
>>>become approx 512K characters of ASCII85 output (when mostly full 
>>>of zeros). Because each chunk is a separate bo object the memory 
>>>is not contiguous so each must be printed with a separate call to 
>>>xe_print_blob_ascii85(). If xe_print_blob_ascii85() includes a 
>>>terminating line feed at the end of the print then that means 
>>>multiple lines for the GuC log.
>>
>>now it makes sense... and contradicts what was said earlier about the
>>"underlying implementation still splits the lines at 512K characters".
>>It's this very function that is splitting it.
>The initial comment was written before the situation was fully 
>understood because it was a quick hack to get the mesa debug tool 
>working again.
>
>The split is a combination of the helper function adding a linefeed to 
>the end of the output and the GuC log being so large it must be split 
>into multiple malloc buffers. But it is not related to the 800 
>character line wrapping within the helper's output.
>
>>
>>>
>>>>
>>>>I don't really want to keep updating this instead of fixing it.
>>>>
>>>>>
>>>>>The dump helper could be updated to never add any line feeds, not even
>>>>
>>>>when you say "dump helper", what exactly are you talking about?
>>>>xe_print_blob_ascii85(), __xe_devcoredump_read(), __drm_puts_coredump()
>>>>or what?
>>>xe_print_blob_ascii85
>>>
>>>>
>>>>If it's about xe_print_blob_ascii85(), I think this would be the 
>>>>first step:
>>>>drop the \n you are adding. AFAICS calling the print function in chunks
>>>>of 800B would be fine, just don't add any \n. If that still adds \n
>>>>somewhere, we can check how to drop it or you then may have a pretty
>>>>solid argument to adapt the mesa tool with a newline
>>>>continuar char and send them the MR.
>>>As above, the chunking is at the next level out. So dropping the 
>>>800 byte \n won't help with the GuC log still being split into 
>>>multiple lines. You have to either move the terminating \n outside 
>>>of the helper and into the callers or add a bool parameter to 
>>>enable/disable it. Either of which is just messy.
>>
>>If the caller knows it's calling the function multiple times to continue
>>previous data, it's not unreasonable to have an arg to terminate it or
>>not. It's for example what python does with its 'end' arg to the print()
>>function.
>Just because someone else does it does not make it clean or the best 
>solution.
>
>>
>>>
>>>Or extend the helper to take an array/list of pointers in a 
>>>scatter/gather fashion.
>>>
>>>And if you drop the 800B \n then you break the dmesg output 
>>>facility. Which means we can't get logs out from CI panic type 
>>>situations.
>>
>>Or that we should have another way to get a devcoredump. I don't think
>>we are going to print the devcoredump to dmesg when the kernel panics.
>>We may print it when we fail to probe, but the cpu should still be very
>>much alive - we tie the devcoredump lifecycle to the lifecycle of the
>>device, so we don't have an opportunity to get that data.
>>
>>The workaround you add is to dump the whole thing to dmesg, but we could
>>dump it somewhere else like a debugfs that is tied to the module
>>lifecycle.
>The point is to get usable logs out of CI. There are multiple 
>situations in which CI fails to save logs. By 'panic' I was more 
>meaning situations where we have very rare bugs that can't be hit 
>locally and for which CI is not providing useful logs. I.e. we are in 
>a panic about how to debug it and dumping via dmesg is the only viable 
>option.
>
>>
>>The dmesg (kmsg) format already contains other things you have to ignore
>>to extract the message:
>>
>> * /dev/kmsg exports the structured data in the following line 
>>format:            * 
>>"<level>,<sequnum>,<timestamp>,<contflag>[,additional_values, ... 
>>];<message text>\n"
>>
>>so you will always have to preprocess it, ignore lines that don't belong
>>to the log, etc.
>I never said this was not the case. The GuC log decoding tool already 
>copes with myriad different dmesg, syslog, serial console, etc. output 
>formats.
>
>>
>>And if you print it correctly, you can even make use of the LOG_CONT to
>>mark the new line as continuation of the previous one.
>Reconstructing split lines has never been a problem for GuC decoder 
>tool. It is only the mesa tool that has a problem with line feeds.
>
>>
>>>
>>>>>at the end. And then the burden would be on the callers to add 
>>>>>line feeds as appropriate. That seems extremely messy, though. 
>>>>>And it will break the dmesg output facility. That does need 
>>>>>the line wrapping at ~800 characters which can't be done from 
>>>>>outside the helper.
>>>>>
>>>>>>
>>>>>>understand having to split it into multiple calls, but not something
>>>>>>adding a \n. particularly for the functions dealing with seq_file and
>>>>>>devcoredump.
>>>>>>
>>>>>>>+     *
>>>>>>>+     * Only disable from devcoredump output.
>>>>>>>     */
>>>>>>>-    return;
>>>>>>>+    if (p->coredump)
>>>>>>
>>>>>>but we do want the guc log to be inside the devcoredump, so 
>>>>>>rather than
>>>>>>adding more workarounds, can we fix it ?
>>>>>The simplest fix is your suggestion of adding an extra white 
>>>>>space character at the end of the line in the helper function 
>>>>>and updating the mesa tool to support multi-line output. That 
>>>>>seems like it should be a trivial update to the mesa tool and 
>>>>>it keeps the KMD side simple and consistent across all 
>>>>>potential output methods. But Jose seemed absolutely dead set 
>>>>>against any updates to the mesa tool at all :(.
>>>>>
>>>>>The more complex fix is Joonas' idea of using an official 
>>>>>structured output format of some sort - TOML, YAML, JSON or 
>>>>>whatever. Something
>>>>
>>>>it actually doesn't fix this particular issue. It fixes others, but not
>>>>this one.
>>>If we do it properly then it fixes everything. The format library 
>>>would have helper functions that the devcoredump code can call to
>>
>>but we are *NOT* going to export a library for that. You are just
>>shifting the problem to "we have only one userspace that is our own
>>library, deal with it".
>No. By library, I mean libjson or libyaml or whatever. The whole point 
>of using an externally defined file format is that you can use an 
>externally written library to read that format. Whatever userspace 
>tool only has to worry about processing data/value pairs or whatever 
>you get from the library's read interface, it no longer has to worry 
>about file formats, character encodings, line wraps, etc. All that is 
>handled by the third party library / file format definition.
>
>>
>>>print integers, bools, strings and BLOBs. The debugfs/sysfs output 
>>>would just stream the encoded output exactly as is. The dmesg 
>>>output layer would add the 800 character line wrapping in a way 
>>>that the dmesg decoder can undo before passing to the decoder 
>>>library.
>>>
>>>>
>>>>    "Guc Log": "<encoded data .........."
>>>>
>>>>if you need to split lines, you still need something to do do it in
>>>>whatever format you choose. One json I remember working on in the past
>>>>that embeds the binary is the .apj format... example:
>>>>https://firmware.ardupilot.org/Copter/2024-12/2024-12-02-13:12/CubeBlack/arducopter.apj
>>>>
>>>>
>>>>
>>>>    {
>>>>        "board_id": 9,
>>>>        "magic": "APJFWv1",
>>>>        "description": "Firmware for a STM32F427xx board",
>>>>        "image": "eNqUvAlcU1faMH5u7k1yCQGCICCgBoIaRW0EtVSxhg ...
>>>>
>>>>here "image" is base64-encoded  afair. But in your case, just printing
>>>>the json to dmesg wouldn't work because of needing to split the line.
>>>>It wouldn't fix the kind of bugs we had in the past too:  in json it
>>>>would be the equivalent of adding a field to the wrong object and then
>>>>move it elsewhere. An application trying to read that object in the
>>>>previous place would be broken regardless.
>>>>
>>>>toml is more for configuration and yaml isn't a good one for this case
>>>>as it further relies on indentation. I wouldn't oppose to use json, but
>>>>don't expect to have dmesg to magically work because of that.
>>>TOML basically says 'use base64 and print it as a string'. And it 
>>>supports multi-line strings, so the reader would have to drop 
>>>newlines within a base64 object before doing the base64 decode. Or 
>>>just use a base64 decoder that ignores whitespace.
>>>
>>>YAML seems to make even more explicit. From the 1.2.2 spec page:
>>>    picture: !!binary |
>>>     R0lGODlhDAAMAIQAAP//9/X
>>     ^ space here for line continuation, as I mentioned.
>>
>>>     17unp5WZmZgAAAOfn515eXv
>>>     Pz7Y6OjuDg4J+fn5OTk6enp
>>>     56enmleECcgggoBADs=
>>>
>>>So the userland reader just gets binary data straight out of the 
>>>decoder library with no processing required.
>>
>>A decoder library that we are *NOT* going to write.
>As above, the decoder library is definitely not written by us. It is 
>libyaml or some such.
>
>>
>>
>>>
>>>
>>>>
>>>>So let's fix the issue at hand and let the talk about xe's coredump
>>>>format for a later time.
>>>The issue at hand is that we have two fundamentally conflicting 
>>>requirements. DMesg output requires line wrapping but the mesa 
>>>tool has been coded in such a way that it breaks when given 
>>>wrapped lines. There is nothing we can do about dmesg but making 
>>>the mesa tool's parser more robust does not seem difficult.
>>>
>>>Switching to an externally defined encoding scheme that internally 
>>>manages binary objects and supports wrapped lines would fix all 
>>>issues and guarantee future compatibility.
>>>
>>>Note that line wrapping is not the only problem with the current 
>>>mesa tool. Unless it has been updated since? The tool breaks if 
>>>the context information is not part of the "GuC Communication 
>>>Transport" section. Which is totally incorrect. Context state is 
>>>very definitely not part
>>
>>We reverted it because it's not ok to just move it to the "correct"
>>place and break the users. If it's in the incorrect place and we know
>>there's this parser already using it incorrectly, we can treat as a bug
>>in the kernel and adjust the userspace (it'd be considered a bug there
>>too and backported to their earlier releases).  What we don't do is
>>simply merge the fix and break userspace.
>Can we not repeat old arguments? No-one realised that that change 
>would break userspace. Or if they did, there was confusion over what 
>exactly had or had not been tested and was causing a problem. 
>Certainly it was not an intentional "merge and everyone be damned".
>
>My point was that there are multiple issues here. Line feeds is only 
>one. But using a structured file format would prevent both of the 
>problems we have currently hit.
>
>>
>>Send the fix to the mesa tool if you want to change this in the kernel.
>When I can finally work out how to build mesa (it seems to require 
>libraries that are newer than any Ubuntu release has access to!?) and 

https://docs.mesa3d.org/install.html has the docs for that.

For ubuntu, in our internal docs we document to install deps with:

	sudo apt build-dep mesa

You can disable stuff you don't use. Our internal docs have this useful
set of args:

	meson setup \
		-Dprefix=/usr \
		-Dgallium-rusticl=false \
		-Dgallium-drivers=iris \
		-Dvulkan-drivers=intel \
		build

I think you are probably talking about the llvm dep, that can be
workarounded by the setup flags above.

>have spent time learning yet another code base and how the tool is 
>implemented...
>
>>
>>And as I said, for this particular change it doesn't matter if it's a
>>custom format, json, toml or whatever... the bug would still be the
>>same.
>Ideally not. Part of moving to a structured file format would be 
>ensuring that the structure of the data is correct.
>
>>
>>>of the GuC CT layer. That absolutely needs to be fixed. And 
>>>generally, anything we can do to make the decoding more robust and 
>>>future proof would be a very good thing.
>>>
>>>Note that the devcoredump content is highly dependent upon the 
>>>internal workings of the driver. It is going to change over time. 
>>>So the more robust the decoder, the better chance it has of not 
>>>breaking when some new piece of data is added to the dump or an 
>>>existing item removed or modified.
>>>
>>>>
>>>>>that ideally supports binary data and compression natively and 
>>>>>has a helper library that can be included in the kernel source 
>>>>>tree. That
>>>>
>>>>oh no, we'd need to define the schema - there are plenty of json 
>>>>libraries
>>>>in multiple languages for users to choose from  and that would 
>>>>be basically
>>>>then main benefit of using it.
>>>Well, we can certainly define extra formats on top of a base 
>>>format but it seems like it would be simpler if the base format 
>>>supported all the data types we need already. Doesn't really 
>>>matter either way as long as the underlying file format supports 
>>>all the uses we want of it.
>>
>>My point is that we are not going to add a decoder library that all
>>userspace consumers have to use. The kernel interface is not a library.
>>You are relying on having just one userspace to not break userspace.
>That is not at all what I am saying.
>
>>
>>The benefit of a common format like json, toml etc is exactly that it's
>>a format already understood by hundreds of libraries and we can even do
>>something like this instead of having to write our own parser:
>>
>>cat devcoredump | jq '.[] | .foo, .bar'
>That is exactly what I am saying.
>
>The mesa debug tool becomes something like:
>    #include <libyaml>
>
>    info = yaml.read_file(file_name);
>    foreach_engine(info.engines, i)
>        printf( "Got hang on %s\n", info.engines[i].name );
>
>The external library handles all the file encoding issues. The tool 
>just has to put meaning to the fields returned from that library.

to reiterate my point that it doesn't fix all the problems:

if you move something from info.guc_ct.bla to info.contexts.bla
(exactly what happened in the other bug) you may still break stuff,
regardless of using a better well-known format.

Lucas De Marchi

>
>John.
>
>>
>>Lucas De Marchi
>>
>>>
>>>John.
>>>
>>>>
>>>>Lucas De Marchi
>>>>
>>>>>will then remove any and all confusion over interpretation of 
>>>>>the file forever more. But it would require updating all 
>>>>>userland tools to use the appropriate helper library.
>>>>>
>>>>>John.
>>>>>
>>>>>>
>>>>>>Lucas De Marchi
>>>>>
>>>
>