[PATCH v3 1/1] drm/xe/guc: Fix GuC log/ct output via debugfs
John Harrison
john.c.harrison at intel.com
Wed Jan 8 23:59:58 UTC 2025
On 1/8/2025 14:11, Lucas De Marchi wrote:
> +Jose
> +Rodrigo
>
> On Wed, Jan 08, 2025 at 12:14:49PM -0800, John Harrison wrote:
>> On 1/7/2025 13:10, Lucas De Marchi wrote:
>>> On Tue, Jan 07, 2025 at 12:22:52PM -0800, Julia Filipchuk wrote:
>>>> Change to disable asci85 GuC logging only when output to devcoredump
>>>> (was temporarily disabled for all code paths).
>>>>
>>>> v2: Ignore only for devcoredump case (not dmesg output).
>>>> v3: Rebase to resolve parent tag mismatch.
>>>>
>>>> Signed-off-by: Julia Filipchuk <julia.filipchuk at intel.com>
>>>> ---
>>>> drivers/gpu/drm/xe/xe_devcoredump.c | 8 +++++---
>>>> include/drm/drm_print.h | 2 ++
>>>> 2 files changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>> b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>> index 6980304c8903..8e5d1f9866a7 100644
>>>> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>> @@ -424,10 +424,12 @@ void xe_print_blob_ascii85(struct drm_printer
>>>> *p, const char *prefix,
>>>> * Splitting blobs across multiple lines is not compatible with
>>>> the mesa
>>>> * debug decoder tool. Note that even dropping the explicit
>>>> '\n' below
>>>> * doesn't help because the GuC log is so big some underlying
>>>> implementation
>>>> - * still splits the lines at 512K characters. So just bail
>>>> completely for
>>>> - * the moment.
>>>> + * still splits the lines at 512K characters.
>>>
>>> did we investigate where this is done and how we can overcome it? I
>> Yes. And the comment could be updated as part of this patch to make
>> it clearer...
>>
>>
>> * Splitting blobs across multiple lines is not compatible with
>> the mesa
>> * debug decoder tool. Note that even dropping the explicit line
>> wrapping
>> * below doesn't help because the GuC log can be so big it needs
>> to be split
>> * into multiple 2MB chunks, each of which must be printed
>> individually and
>> * therefore will be a separate line.
>
> so there's no "underlying implemenation that splits like at 512K chars"?
> Nowe it's something else: a 2MB chunk because of what?
The GuC log is captured in multiple 2MB chunks because we can't kmalloc
a full 16MB block at a single time irrespective of memory fragmentation
- the kmalloc function aborts with a page size error above 8MB, I
recall. Might even have been 4MB? The 2MB chunks become approx 512K
characters of ASCII85 output (when mostly full of zeros). Because each
chunk is a separate bo object the memory is not contiguous so each must
be printed with a separate call to xe_print_blob_ascii85(). If
xe_print_blob_ascii85() includes a terminating line feed at the end of
the print then that means multiple lines for the GuC log.
>
> I don't really want to keep updating this instead of fixing it.
>
>>
>> The dump helper could be updated to never add any line feeds, not even
>
> when you say "dump helper", what exactly are you talking about?
> xe_print_blob_ascii85(), __xe_devcoredump_read(), __drm_puts_coredump()
> or what?
xe_print_blob_ascii85
>
> If it's about xe_print_blob_ascii85(), I think this would be the first
> step:
> drop the \n you are adding. AFAICS calling the print function in chunks
> of 800B would be fine, just don't add any \n. If that still adds \n
> somewhere, we can check how to drop it or you then may have a pretty
> solid argument to adapt the mesa tool with a newline
> continuar char and send them the MR.
As above, the chunking is at the next level out. So dropping the 800
byte \n won't help with the GuC log still being split into multiple
lines. You have to either move the terminating \n outside of the helper
and into the callers or add a bool parameter to enable/disable it.
Either of which is just messy.
Or extend the helper to take an array/list of pointers in a
scatter/gather fashion.
And if you drop the 800B \n then you break the dmesg output facility.
Which means we can't get logs out from CI panic type situations.
>> at the end. And then the burden would be on the callers to add line
>> feeds as appropriate. That seems extremely messy, though. And it will
>> break the dmesg output facility. That does need the line wrapping at
>> ~800 characters which can't be done from outside the helper.
>>
>>>
>>> understand having to split it into multiple calls, but not something
>>> adding a \n. particularly for the functions dealing with seq_file and
>>> devcoredump.
>>>
>>>> + *
>>>> + * Only disable from devcoredump output.
>>>> */
>>>> - return;
>>>> + if (p->coredump)
>>>
>>> but we do want the guc log to be inside the devcoredump, so rather than
>>> adding more workarounds, can we fix it ?
>> The simplest fix is your suggestion of adding an extra white space
>> character at the end of the line in the helper function and updating
>> the mesa tool to support multi-line output. That seems like it should
>> be a trivial update to the mesa tool and it keeps the KMD side simple
>> and consistent across all potential output methods. But Jose seemed
>> absolutely dead set against any updates to the mesa tool at all :(.
>>
>> The more complex fix is Joonas' idea of using an official structured
>> output format of some sort - TOML, YAML, JSON or whatever. Something
>
> it actually doesn't fix this particular issue. It fixes others, but not
> this one.
If we do it properly then it fixes everything. The format library would
have helper functions that the devcoredump code can call to print
integers, bools, strings and BLOBs. The debugfs/sysfs output would just
stream the encoded output exactly as is. The dmesg output layer would
add the 800 character line wrapping in a way that the dmesg decoder can
undo before passing to the decoder library.
>
> "Guc Log": "<encoded data .........."
>
> if you need to split lines, you still need something to do do it in
> whatever format you choose. One json I remember working on in the past
> that embeds the binary is the .apj format... example:
> https://firmware.ardupilot.org/Copter/2024-12/2024-12-02-13:12/CubeBlack/arducopter.apj
>
>
> {
> "board_id": 9,
> "magic": "APJFWv1",
> "description": "Firmware for a STM32F427xx board",
> "image": "eNqUvAlcU1faMH5u7k1yCQGCICCgBoIaRW0EtVSxhg ...
>
> here "image" is base64-encoded afair. But in your case, just printing
> the json to dmesg wouldn't work because of needing to split the line.
> It wouldn't fix the kind of bugs we had in the past too: in json it
> would be the equivalent of adding a field to the wrong object and then
> move it elsewhere. An application trying to read that object in the
> previous place would be broken regardless.
>
> toml is more for configuration and yaml isn't a good one for this case
> as it further relies on indentation. I wouldn't oppose to use json, but
> don't expect to have dmesg to magically work because of that.
TOML basically says 'use base64 and print it as a string'. And it
supports multi-line strings, so the reader would have to drop newlines
within a base64 object before doing the base64 decode. Or just use a
base64 decoder that ignores whitespace.
YAML seems to make even more explicit. From the 1.2.2 spec page:
picture: !!binary |
R0lGODlhDAAMAIQAAP//9/X
17unp5WZmZgAAAOfn515eXv
Pz7Y6OjuDg4J+fn5OTk6enp
56enmleECcgggoBADs=
So the userland reader just gets binary data straight out of the decoder
library with no processing required.
>
> So let's fix the issue at hand and let the talk about xe's coredump
> format for a later time.
The issue at hand is that we have two fundamentally conflicting
requirements. DMesg output requires line wrapping but the mesa tool has
been coded in such a way that it breaks when given wrapped lines. There
is nothing we can do about dmesg but making the mesa tool's parser more
robust does not seem difficult.
Switching to an externally defined encoding scheme that internally
manages binary objects and supports wrapped lines would fix all issues
and guarantee future compatibility.
Note that line wrapping is not the only problem with the current mesa
tool. Unless it has been updated since? The tool breaks if the context
information is not part of the "GuC Communication Transport" section.
Which is totally incorrect. Context state is very definitely not part of
the GuC CT layer. That absolutely needs to be fixed. And generally,
anything we can do to make the decoding more robust and future proof
would be a very good thing.
Note that the devcoredump content is highly dependent upon the internal
workings of the driver. It is going to change over time. So the more
robust the decoder, the better chance it has of not breaking when some
new piece of data is added to the dump or an existing item removed or
modified.
>
>> that ideally supports binary data and compression natively and has a
>> helper library that can be included in the kernel source tree. That
>
> oh no, we'd need to define the schema - there are plenty of json
> libraries
> in multiple languages for users to choose from and that would be
> basically
> then main benefit of using it.
Well, we can certainly define extra formats on top of a base format but
it seems like it would be simpler if the base format supported all the
data types we need already. Doesn't really matter either way as long as
the underlying file format supports all the uses we want of it.
John.
>
> Lucas De Marchi
>
>> will then remove any and all confusion over interpretation of the
>> file forever more. But it would require updating all userland tools
>> to use the appropriate helper library.
>>
>> John.
>>
>>>
>>> Lucas De Marchi
>>
More information about the Intel-xe
mailing list