[PATCH v9 04/11] drm/xe/devcoredump: Add ASCII85 dump helper function

Fri Dec 13 16:42:09 UTC 2024

On 12/13/2024 06:18, Rodrigo Vivi wrote:
> On Thu, Dec 12, 2024 at 01:04:23PM -0800, John Harrison wrote:
>> On 12/12/2024 12:52, Lucas De Marchi wrote:
>>> On Thu, Dec 12, 2024 at 11:14:23AM -0800, John Harrison wrote:
>>>> On 12/12/2024 10:45, Lucas De Marchi wrote:
>>>>> On Thu, Dec 12, 2024 at 05:41:41PM +0000, Jose Souza wrote:
>>>>>> On Wed, 2024-10-02 at 17:46 -0700, John.C.Harrison at Intel.com wrote:
>>>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>
>>>>>>> There is a need to include the GuC log and other large binary objects
>>>>>>> in core dumps and via dmesg. So add a helper for dumping to a printer
>>>>>>> function via conversion to ASCII85 encoding.
>>>>>>>
>>>>>>> Another issue with dumping such a large buffer is that
>>>>>>> it can be slow,
>>>>>>> especially if dumping to dmesg over a serial port. So add a yield to
>>>>>>> prevent the 'task has been stuck for 120s' kernel hang check feature
>>>>>>> from firing.
>>>>>>>
>>>>>>> v2: Add a prefix to the output string. Fix memory allocation bug.
>>>>>>> v3: Correct a string size calculation and clean up a define (review
>>>>>>> feedback from Julia F).
>>>>>>>
>>>>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>>>>> Reviewed-by: Julia Filipchuk <julia.filipchuk at intel.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/xe/xe_devcoredump.c | 87
>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>   drivers/gpu/drm/xe/xe_devcoredump.h |  6 ++
>>>>>>>   2 files changed, 93 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>> b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>> index 2690f1d1cde4..0884c49942fe 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>>> @@ -6,6 +6,7 @@
>>>>>>>   #include "xe_devcoredump.h"
>>>>>>>   #include "xe_devcoredump_types.h"
>>>>>>>
>>>>>>> +#include <linux/ascii85.h>
>>>>>>>   #include <linux/devcoredump.h>
>>>>>>>   #include <generated/utsrelease.h>
>>>>>>>
>>>>>>> @@ -315,3 +316,89 @@ int xe_devcoredump_init(struct xe_device *xe)
>>>>>>>   }
>>>>>>>
>>>>>>>   #endif
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * xe_print_blob_ascii85 - print a BLOB to some useful
>>>>>>> location in ASCII85
>>>>>>> + *
>>>>>>> + * The output is split to multiple lines because some
>>>>>>> print targets, e.g. dmesg
>>>>>>> + * cannot handle arbitrarily long lines. Note also that
>>>>>>> printing to dmesg in
>>>>>>> + * piece-meal fashion is not possible, each separate
>>>>>>> call to drm_puts() has a
>>>>>>> + * line-feed automatically added! Therefore, the entire
>>>>>>> output line must be
>>>>>>> + * constructed in a local buffer first, then printed in
>>>>>>> one atomic output call.
>>>>>>> + *
>>>>>>> + * There is also a scheduler yield call to prevent the
>>>>>>> 'task has been stuck for
>>>>>>> + * 120s' kernel hang check feature from firing when
>>>>>>> printing to a slow target
>>>>>>> + * such as dmesg over a serial port.
>>>>>>> + *
>>>>>>> + * TODO: Add compression prior to the ASCII85 encoding
>>>>>>> to shrink huge buffers down.
>>>>>>> + *
>>>>>>> + * @p: the printer object to output to
>>>>>>> + * @prefix: optional prefix to add to output string
>>>>>>> + * @blob: the Binary Large OBject to dump out
>>>>>>> + * @offset: offset in bytes to skip from the front of
>>>>>>> the BLOB, must be a multiple of sizeof(u32)
>>>>>>> + * @size: the size in bytes of the BLOB, must be a
>>>>>>> multiple of sizeof(u32)
>>>>>>> + */
>>>>>>> +void xe_print_blob_ascii85(struct drm_printer *p, const
>>>>>>> char *prefix,
>>>>>>> +               const void *blob, size_t offset, size_t size)
>>>>>>> +{
>>>>>>> +    const u32 *blob32 = (const u32 *)blob;
>>>>>>> +    char buff[ASCII85_BUFSZ], *line_buff;
>>>>>>> +    size_t line_pos = 0;
>>>>>>> +
>>>>>>> +#define DMESG_MAX_LINE_LEN    800
>>>>>>> +#define MIN_SPACE        (ASCII85_BUFSZ + 2)        /*
>>>>>>> 85 + "\n\0" */
>>>>>>> +
>>>>>>> +    if (size & 3)
>>>>>>> +        drm_printf(p, "Size not word aligned: %zu", size);
>>>>>>> +    if (offset & 3)
>>>>>>> +        drm_printf(p, "Offset not word aligned: %zu", size);
>>>>>>> +
>>>>>>> +    line_buff = kzalloc(DMESG_MAX_LINE_LEN, GFP_KERNEL);
>>>>>>> +    if (IS_ERR_OR_NULL(line_buff)) {
>>>>>>> +        drm_printf(p, "Failed to allocate line buffer:
>>>>>>> %pe", line_buff);
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    blob32 += offset / sizeof(*blob32);
>>>>>>> +    size /= sizeof(*blob32);
>>>>>>> +
>>>>>>> +    if (prefix) {
>>>>>>> +        strscpy(line_buff, prefix, DMESG_MAX_LINE_LEN -
>>>>>>> MIN_SPACE - 2);
>>>>>>> +        line_pos = strlen(line_buff);
>>>>>>> +
>>>>>>> +        line_buff[line_pos++] = ':';
>>>>>>> +        line_buff[line_pos++] = ' ';
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    while (size--) {
>>>>>>> +        u32 val = *(blob32++);
>>>>>>> +
>>>>>>> +        strscpy(line_buff + line_pos, ascii85_encode(val, buff),
>>>>>>> +            DMESG_MAX_LINE_LEN - line_pos);
>>>>>>> +        line_pos += strlen(line_buff + line_pos);
>>>>>>> +
>>>>>>> +        if ((line_pos + MIN_SPACE) >= DMESG_MAX_LINE_LEN) {
>>>>>>> +            line_buff[line_pos++] = '\n';
>>>>>>> +            line_buff[line_pos++] = 0;
>>>>>> This breaks ascii85 parser that we had up to now.
>>>> It should not break the decoding of existing sections. This helper
>>>> is only being used for the new GuC objects at present. As per the
>>>> TODO, the intent is to use it for all blobs in the future. But that
>>>> was deliberately left to a future update (which hasn't been posted
>>>> yet) to avoid breaking the mesa tool.
>>> it broke nonetheless because then it fails to decode this line the way
>>> it was doing:  each line starts a new key.
>> It barfs on any line it doesn't understand? Seems like it could be made to
>> be more robust in general.
> We do not break userspace.
> When we do we do not blame userspace.
There is a difference between blaming and suggesting ways to improve. 
This is an ongoing discussion to work out exactly what went wrong and 
what is the best way to fix it. I am not blaming anyone or saying that I 
don't care and am not going to help find the proper solution.

>
> I had warned it and was ignored:
>
> "We shouldn't be breaking the current userspace tools. Any change like this would
> need to be synchronized between all the current decode tools."
>
> [1] https://lore.kernel.org/intel-xe/ZuLzSMH_hBl9RWdv@intel.com/ and ignored
>
> Next time, please respect the only Linux rule: No regression.
As per other parts of discussion, it was never actually stated that this 
would be a regression. Various things were discussed but no definitive 
"this will break because..." was ever stated. Indeed, the patches were 
specifically written to only add new data and not change any existing 
data specifically to prevent any breakage of existing tools.

> Revert sent and applying it soon...
The revert will break the driver build. Please do not apply it. It will 
also not actually fix the mesa tool's problem with section headers.

John.

>
>>>>
>>>>>> And I think there is not safe way to parse it now, how would
>>>>>> the parser know that the blob reach to end?
>>>> The GuC decoder tool simply ignores leading/trailing whitespace and
>>>> keeps decoding until it finds a line with a new field - i.e.
>>>> anything with a non-ASCII85 character such as ':' or '*'. It is
>>>> totally
>>> character set for ascii85 is [33, 117] and 122, which includes both ':'
>>> and '*'. Hopefully it's hard to hit a collision, but we shouldn't design
>>> it in a way that is possible.
>> Sorry, getting myself confused - it was a while ago when I was writing this.
>> Yes, punctuation marks are part of the ASCII85 character set.
>>
>> The GuC decoder looks for white space - a blank line or a line with a space
>> in it. Any new field is guaranteed to be "name: data" so will have a space.
>> And sections are delimited with blank lines, plus section headers are "****
>> name ****" so also guaranteed to have a space.
>>
>> John.
>>
>>
>>>
>>>> reliable for me.
>>>>
>>>>> Just looked at this code to check if we also had a "Size" as I remember
>>>>> seeing it in previous related patches, but we don't. If
>>>>> we did then you could use that to calculate how much you still had to
>>>>> read, ignoring the \n. With the current scheme I think one
>>>>> way would be to continue the previous key when the line doesn't start
>>>>> with a new key. Awful, yes, and not future proof.
>>>> Any new field is guaranteed to have a colon in it and any new
>>>> section header is guaranteed to have a star in it. Sounds pretty
>>>> reliable to me.
>>> that's what I said by "continue the previous key when the line doesn't
>>> start with a new one".
>>>
>>>
>>>> The other option would be to have an explicit prefix at the start of
>>>> each continuation line. E.g. "A85:".
>>> let's not make it worse than it already is. And not future proof by
>>> given the encoding algorithm the meaning of "continuation line".
>>>
>>> Lucas De Marchi
>>>
>>>> The problem with a size field is that ASCII85 encoding is data
>>>> dependent - it has extremely basic run length encoding of strings of
>>>> zeros. So you only know the size after the encoding is complete.
>>>> Which means either the size field is in the dump after the encoded
>>>> data (which is therefore useless) or you can't stream the encode and
>>>> must encode to an in-memory buffer first, then print the size, then
>>>> print your pre-encoded data.
>>>>
>>>>
>>>>> John, I explicitly said *we can't break the existent users*,
>>>>> pointed to the one in the mesa repo and confirmed with José it would be
>>>>> indeed a breakage. Please check again the past email thread at
>>>>> https://lore.kernel.org/all/3jexgpnh3br3gqi4ol4c3hx3tyhwevq5nqo5xssyie3xglqohz@e7mnj4dewupu/
>>>>>
>>>>>
>>>> That discussion was a question for Jose not a statement. As there
>>>> was no response for several weeks, I assumed that it wasn't a
>>>> problem after all.
>>>>
>>>>> Did you at least prepare the mesa parser for that additional \n?
>>>>>
>>>>> We shouldn't have merged the kernel patch as is with the excuse
>>>>> it reads
>>>>> better in dmesg, when we also said we should not print that garbage to
>>>>> dmesg :(
>>>> Not following. Should not print what garbage in dmesg? As I keep
>>>> saying, none of this goes to dmesg by default except in the case of
>>>> catastrophic CT failure. And in that case, it is extremely useful
>>>> and the only way to debug issues.
>>>>
>>>> John.
>>>>
>>>>> Lucas De Marchi