[PATCH v3 1/1] drm/xe/guc: Fix GuC log/ct output via debugfs
Lucas De Marchi
lucas.demarchi at intel.com
Thu Jan 9 22:43:28 UTC 2025
On Thu, Jan 09, 2025 at 12:40:51PM -0800, John Harrison wrote:
>On 1/9/2025 07:39, Lucas De Marchi wrote:
>>On Wed, Jan 08, 2025 at 03:59:58PM -0800, John Harrison wrote:
>>>On 1/8/2025 14:11, Lucas De Marchi wrote:
>>>>+Jose
>>>>+Rodrigo
>>>>
>>>>On Wed, Jan 08, 2025 at 12:14:49PM -0800, John Harrison wrote:
>>>>>On 1/7/2025 13:10, Lucas De Marchi wrote:
>>>>>>On Tue, Jan 07, 2025 at 12:22:52PM -0800, Julia Filipchuk wrote:
>>>>>>>Change to disable asci85 GuC logging only when output to devcoredump
>>>>>>>(was temporarily disabled for all code paths).
>>>>>>>
>>>>>>>v2: Ignore only for devcoredump case (not dmesg output).
>>>>>>>v3: Rebase to resolve parent tag mismatch.
>>>>>>>
>>>>>>>Signed-off-by: Julia Filipchuk <julia.filipchuk at intel.com>
>>>>>>>---
>>>>>>>drivers/gpu/drm/xe/xe_devcoredump.c | 8 +++++---
>>>>>>>include/drm/drm_print.h | 2 ++
>>>>>>>2 files changed, 7 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>>diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>>b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>>index 6980304c8903..8e5d1f9866a7 100644
>>>>>>>--- a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>>+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>>@@ -424,10 +424,12 @@ void xe_print_blob_ascii85(struct
>>>>>>>drm_printer *p, const char *prefix,
>>>>>>> * Splitting blobs across multiple lines is not
>>>>>>>compatible with the mesa
>>>>>>> * debug decoder tool. Note that even dropping the
>>>>>>>explicit '\n' below
>>>>>>> * doesn't help because the GuC log is so big some
>>>>>>>underlying implementation
>>>>>>>- * still splits the lines at 512K characters. So just
>>>>>>>bail completely for
>>>>>>>- * the moment.
>>>>>>>+ * still splits the lines at 512K characters.
>>>>>>
>>>>>>did we investigate where this is done and how we can overcome it? I
>>>>>Yes. And the comment could be updated as part of this patch to
>>>>>make it clearer...
>>>>>
>>>>>
>>>>> * Splitting blobs across multiple lines is not compatible
>>>>>with the mesa
>>>>> * debug decoder tool. Note that even dropping the
>>>>>explicit line wrapping
>>>>> * below doesn't help because the GuC log can be so big it
>>>>>needs to be split
>>>>> * into multiple 2MB chunks, each of which must be printed
>>>>>individually and
>>>>> * therefore will be a separate line.
>>>>
>>>>so there's no "underlying implemenation that splits like at 512K
>>>>chars"?
>>>>Nowe it's something else: a 2MB chunk because of what?
>>>The GuC log is captured in multiple 2MB chunks because we can't
>>>kmalloc a full 16MB block at a single time irrespective of memory
>>>fragmentation - the kmalloc function aborts with a page size error
>>>above 8MB, I recall. Might even have been 4MB? The 2MB chunks
>>>become approx 512K characters of ASCII85 output (when mostly full
>>>of zeros). Because each chunk is a separate bo object the memory
>>>is not contiguous so each must be printed with a separate call to
>>>xe_print_blob_ascii85(). If xe_print_blob_ascii85() includes a
>>>terminating line feed at the end of the print then that means
>>>multiple lines for the GuC log.
>>
>>now it makes sense... and contradicts what was said earlier about the
>>"underlying implementation still splits the lines at 512K characters".
>>It's this very function that is splitting it.
>The initial comment was written before the situation was fully
>understood because it was a quick hack to get the mesa debug tool
>working again.
>
>The split is a combination of the helper function adding a linefeed to
>the end of the output and the GuC log being so large it must be split
>into multiple malloc buffers. But it is not related to the 800
>character line wrapping within the helper's output.
>
>>
>>>
>>>>
>>>>I don't really want to keep updating this instead of fixing it.
>>>>
>>>>>
>>>>>The dump helper could be updated to never add any line feeds, not even
>>>>
>>>>when you say "dump helper", what exactly are you talking about?
>>>>xe_print_blob_ascii85(), __xe_devcoredump_read(), __drm_puts_coredump()
>>>>or what?
>>>xe_print_blob_ascii85
>>>
>>>>
>>>>If it's about xe_print_blob_ascii85(), I think this would be the
>>>>first step:
>>>>drop the \n you are adding. AFAICS calling the print function in chunks
>>>>of 800B would be fine, just don't add any \n. If that still adds \n
>>>>somewhere, we can check how to drop it or you then may have a pretty
>>>>solid argument to adapt the mesa tool with a newline
>>>>continuar char and send them the MR.
>>>As above, the chunking is at the next level out. So dropping the
>>>800 byte \n won't help with the GuC log still being split into
>>>multiple lines. You have to either move the terminating \n outside
>>>of the helper and into the callers or add a bool parameter to
>>>enable/disable it. Either of which is just messy.
>>
>>If the caller knows it's calling the function multiple times to continue
>>previous data, it's not unreasonable to have an arg to terminate it or
>>not. It's for example what python does with its 'end' arg to the print()
>>function.
>Just because someone else does it does not make it clean or the best
>solution.
>
>>
>>>
>>>Or extend the helper to take an array/list of pointers in a
>>>scatter/gather fashion.
>>>
>>>And if you drop the 800B \n then you break the dmesg output
>>>facility. Which means we can't get logs out from CI panic type
>>>situations.
>>
>>Or that we should have another way to get a devcoredump. I don't think
>>we are going to print the devcoredump to dmesg when the kernel panics.
>>We may print it when we fail to probe, but the cpu should still be very
>>much alive - we tie the devcoredump lifecycle to the lifecycle of the
>>device, so we don't have an opportunity to get that data.
>>
>>The workaround you add is to dump the whole thing to dmesg, but we could
>>dump it somewhere else like a debugfs that is tied to the module
>>lifecycle.
>The point is to get usable logs out of CI. There are multiple
>situations in which CI fails to save logs. By 'panic' I was more
>meaning situations where we have very rare bugs that can't be hit
>locally and for which CI is not providing useful logs. I.e. we are in
>a panic about how to debug it and dumping via dmesg is the only viable
>option.
>
>>
>>The dmesg (kmsg) format already contains other things you have to ignore
>>to extract the message:
>>
>> * /dev/kmsg exports the structured data in the following line
>>format: *
>>"<level>,<sequnum>,<timestamp>,<contflag>[,additional_values, ...
>>];<message text>\n"
>>
>>so you will always have to preprocess it, ignore lines that don't belong
>>to the log, etc.
>I never said this was not the case. The GuC log decoding tool already
>copes with myriad different dmesg, syslog, serial console, etc. output
>formats.
>
>>
>>And if you print it correctly, you can even make use of the LOG_CONT to
>>mark the new line as continuation of the previous one.
>Reconstructing split lines has never been a problem for GuC decoder
>tool. It is only the mesa tool that has a problem with line feeds.
>
>>
>>>
>>>>>at the end. And then the burden would be on the callers to add
>>>>>line feeds as appropriate. That seems extremely messy, though.
>>>>>And it will break the dmesg output facility. That does need
>>>>>the line wrapping at ~800 characters which can't be done from
>>>>>outside the helper.
>>>>>
>>>>>>
>>>>>>understand having to split it into multiple calls, but not something
>>>>>>adding a \n. particularly for the functions dealing with seq_file and
>>>>>>devcoredump.
>>>>>>
>>>>>>>+ *
>>>>>>>+ * Only disable from devcoredump output.
>>>>>>> */
>>>>>>>- return;
>>>>>>>+ if (p->coredump)
>>>>>>
>>>>>>but we do want the guc log to be inside the devcoredump, so
>>>>>>rather than
>>>>>>adding more workarounds, can we fix it ?
>>>>>The simplest fix is your suggestion of adding an extra white
>>>>>space character at the end of the line in the helper function
>>>>>and updating the mesa tool to support multi-line output. That
>>>>>seems like it should be a trivial update to the mesa tool and
>>>>>it keeps the KMD side simple and consistent across all
>>>>>potential output methods. But Jose seemed absolutely dead set
>>>>>against any updates to the mesa tool at all :(.
>>>>>
>>>>>The more complex fix is Joonas' idea of using an official
>>>>>structured output format of some sort - TOML, YAML, JSON or
>>>>>whatever. Something
>>>>
>>>>it actually doesn't fix this particular issue. It fixes others, but not
>>>>this one.
>>>If we do it properly then it fixes everything. The format library
>>>would have helper functions that the devcoredump code can call to
>>
>>but we are *NOT* going to export a library for that. You are just
>>shifting the problem to "we have only one userspace that is our own
>>library, deal with it".
>No. By library, I mean libjson or libyaml or whatever. The whole point
>of using an externally defined file format is that you can use an
>externally written library to read that format. Whatever userspace
>tool only has to worry about processing data/value pairs or whatever
>you get from the library's read interface, it no longer has to worry
>about file formats, character encodings, line wraps, etc. All that is
>handled by the third party library / file format definition.
>
>>
>>>print integers, bools, strings and BLOBs. The debugfs/sysfs output
>>>would just stream the encoded output exactly as is. The dmesg
>>>output layer would add the 800 character line wrapping in a way
>>>that the dmesg decoder can undo before passing to the decoder
>>>library.
>>>
>>>>
>>>> "Guc Log": "<encoded data .........."
>>>>
>>>>if you need to split lines, you still need something to do do it in
>>>>whatever format you choose. One json I remember working on in the past
>>>>that embeds the binary is the .apj format... example:
>>>>https://firmware.ardupilot.org/Copter/2024-12/2024-12-02-13:12/CubeBlack/arducopter.apj
>>>>
>>>>
>>>>
>>>> {
>>>> "board_id": 9,
>>>> "magic": "APJFWv1",
>>>> "description": "Firmware for a STM32F427xx board",
>>>> "image": "eNqUvAlcU1faMH5u7k1yCQGCICCgBoIaRW0EtVSxhg ...
>>>>
>>>>here "image" is base64-encoded afair. But in your case, just printing
>>>>the json to dmesg wouldn't work because of needing to split the line.
>>>>It wouldn't fix the kind of bugs we had in the past too: in json it
>>>>would be the equivalent of adding a field to the wrong object and then
>>>>move it elsewhere. An application trying to read that object in the
>>>>previous place would be broken regardless.
>>>>
>>>>toml is more for configuration and yaml isn't a good one for this case
>>>>as it further relies on indentation. I wouldn't oppose to use json, but
>>>>don't expect to have dmesg to magically work because of that.
>>>TOML basically says 'use base64 and print it as a string'. And it
>>>supports multi-line strings, so the reader would have to drop
>>>newlines within a base64 object before doing the base64 decode. Or
>>>just use a base64 decoder that ignores whitespace.
>>>
>>>YAML seems to make even more explicit. From the 1.2.2 spec page:
>>> picture: !!binary |
>>> R0lGODlhDAAMAIQAAP//9/X
>> ^ space here for line continuation, as I mentioned.
>>
>>> 17unp5WZmZgAAAOfn515eXv
>>> Pz7Y6OjuDg4J+fn5OTk6enp
>>> 56enmleECcgggoBADs=
>>>
>>>So the userland reader just gets binary data straight out of the
>>>decoder library with no processing required.
>>
>>A decoder library that we are *NOT* going to write.
>As above, the decoder library is definitely not written by us. It is
>libyaml or some such.
>
>>
>>
>>>
>>>
>>>>
>>>>So let's fix the issue at hand and let the talk about xe's coredump
>>>>format for a later time.
>>>The issue at hand is that we have two fundamentally conflicting
>>>requirements. DMesg output requires line wrapping but the mesa
>>>tool has been coded in such a way that it breaks when given
>>>wrapped lines. There is nothing we can do about dmesg but making
>>>the mesa tool's parser more robust does not seem difficult.
>>>
>>>Switching to an externally defined encoding scheme that internally
>>>manages binary objects and supports wrapped lines would fix all
>>>issues and guarantee future compatibility.
>>>
>>>Note that line wrapping is not the only problem with the current
>>>mesa tool. Unless it has been updated since? The tool breaks if
>>>the context information is not part of the "GuC Communication
>>>Transport" section. Which is totally incorrect. Context state is
>>>very definitely not part
>>
>>We reverted it because it's not ok to just move it to the "correct"
>>place and break the users. If it's in the incorrect place and we know
>>there's this parser already using it incorrectly, we can treat as a bug
>>in the kernel and adjust the userspace (it'd be considered a bug there
>>too and backported to their earlier releases). What we don't do is
>>simply merge the fix and break userspace.
>Can we not repeat old arguments? No-one realised that that change
>would break userspace. Or if they did, there was confusion over what
>exactly had or had not been tested and was causing a problem.
>Certainly it was not an intentional "merge and everyone be damned".
>
>My point was that there are multiple issues here. Line feeds is only
>one. But using a structured file format would prevent both of the
>problems we have currently hit.
>
>>
>>Send the fix to the mesa tool if you want to change this in the kernel.
>When I can finally work out how to build mesa (it seems to require
>libraries that are newer than any Ubuntu release has access to!?) and
https://docs.mesa3d.org/install.html has the docs for that.
For ubuntu, in our internal docs we document to install deps with:
sudo apt build-dep mesa
You can disable stuff you don't use. Our internal docs have this useful
set of args:
meson setup \
-Dprefix=/usr \
-Dgallium-rusticl=false \
-Dgallium-drivers=iris \
-Dvulkan-drivers=intel \
build
I think you are probably talking about the llvm dep, that can be
workarounded by the setup flags above.
>have spent time learning yet another code base and how the tool is
>implemented...
>
>>
>>And as I said, for this particular change it doesn't matter if it's a
>>custom format, json, toml or whatever... the bug would still be the
>>same.
>Ideally not. Part of moving to a structured file format would be
>ensuring that the structure of the data is correct.
>
>>
>>>of the GuC CT layer. That absolutely needs to be fixed. And
>>>generally, anything we can do to make the decoding more robust and
>>>future proof would be a very good thing.
>>>
>>>Note that the devcoredump content is highly dependent upon the
>>>internal workings of the driver. It is going to change over time.
>>>So the more robust the decoder, the better chance it has of not
>>>breaking when some new piece of data is added to the dump or an
>>>existing item removed or modified.
>>>
>>>>
>>>>>that ideally supports binary data and compression natively and
>>>>>has a helper library that can be included in the kernel source
>>>>>tree. That
>>>>
>>>>oh no, we'd need to define the schema - there are plenty of json
>>>>libraries
>>>>in multiple languages for users to choose from and that would
>>>>be basically
>>>>then main benefit of using it.
>>>Well, we can certainly define extra formats on top of a base
>>>format but it seems like it would be simpler if the base format
>>>supported all the data types we need already. Doesn't really
>>>matter either way as long as the underlying file format supports
>>>all the uses we want of it.
>>
>>My point is that we are not going to add a decoder library that all
>>userspace consumers have to use. The kernel interface is not a library.
>>You are relying on having just one userspace to not break userspace.
>That is not at all what I am saying.
>
>>
>>The benefit of a common format like json, toml etc is exactly that it's
>>a format already understood by hundreds of libraries and we can even do
>>something like this instead of having to write our own parser:
>>
>>cat devcoredump | jq '.[] | .foo, .bar'
>That is exactly what I am saying.
>
>The mesa debug tool becomes something like:
> #include <libyaml>
>
> info = yaml.read_file(file_name);
> foreach_engine(info.engines, i)
> printf( "Got hang on %s\n", info.engines[i].name );
>
>The external library handles all the file encoding issues. The tool
>just has to put meaning to the fields returned from that library.
to reiterate my point that it doesn't fix all the problems:
if you move something from info.guc_ct.bla to info.contexts.bla
(exactly what happened in the other bug) you may still break stuff,
regardless of using a better well-known format.
Lucas De Marchi
>
>John.
>
>>
>>Lucas De Marchi
>>
>>>
>>>John.
>>>
>>>>
>>>>Lucas De Marchi
>>>>
>>>>>will then remove any and all confusion over interpretation of
>>>>>the file forever more. But it would require updating all
>>>>>userland tools to use the appropriate helper library.
>>>>>
>>>>>John.
>>>>>
>>>>>>
>>>>>>Lucas De Marchi
>>>>>
>>>
>
More information about the Intel-xe
mailing list