[PATCH v9 04/11] drm/xe/devcoredump: Add ASCII85 dump helper function

Fri Dec 13 14:18:04 UTC 2024

On Thu, Dec 12, 2024 at 01:04:23PM -0800, John Harrison wrote:
> On 12/12/2024 12:52, Lucas De Marchi wrote:
> > On Thu, Dec 12, 2024 at 11:14:23AM -0800, John Harrison wrote:
> > > On 12/12/2024 10:45, Lucas De Marchi wrote:
> > > > On Thu, Dec 12, 2024 at 05:41:41PM +0000, Jose Souza wrote:
> > > > > On Wed, 2024-10-02 at 17:46 -0700, John.C.Harrison at Intel.com wrote:
> > > > > > From: John Harrison <John.C.Harrison at Intel.com>
> > > > > > 
> > > > > > There is a need to include the GuC log and other large binary objects
> > > > > > in core dumps and via dmesg. So add a helper for dumping to a printer
> > > > > > function via conversion to ASCII85 encoding.
> > > > > > 
> > > > > > Another issue with dumping such a large buffer is that
> > > > > > it can be slow,
> > > > > > especially if dumping to dmesg over a serial port. So add a yield to
> > > > > > prevent the 'task has been stuck for 120s' kernel hang check feature
> > > > > > from firing.
> > > > > > 
> > > > > > v2: Add a prefix to the output string. Fix memory allocation bug.
> > > > > > v3: Correct a string size calculation and clean up a define (review
> > > > > > feedback from Julia F).
> > > > > > 
> > > > > > Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> > > > > > Reviewed-by: Julia Filipchuk <julia.filipchuk at intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_devcoredump.c | 87
> > > > > > +++++++++++++++++++++++++++++
> > > > > >  drivers/gpu/drm/xe/xe_devcoredump.h |  6 ++
> > > > > >  2 files changed, 93 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c
> > > > > > b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > > > > index 2690f1d1cde4..0884c49942fe 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > > > > @@ -6,6 +6,7 @@
> > > > > >  #include "xe_devcoredump.h"
> > > > > >  #include "xe_devcoredump_types.h"
> > > > > > 
> > > > > > +#include <linux/ascii85.h>
> > > > > >  #include <linux/devcoredump.h>
> > > > > >  #include <generated/utsrelease.h>
> > > > > > 
> > > > > > @@ -315,3 +316,89 @@ int xe_devcoredump_init(struct xe_device *xe)
> > > > > >  }
> > > > > > 
> > > > > >  #endif
> > > > > > +
> > > > > > +/**
> > > > > > + * xe_print_blob_ascii85 - print a BLOB to some useful
> > > > > > location in ASCII85
> > > > > > + *
> > > > > > + * The output is split to multiple lines because some
> > > > > > print targets, e.g. dmesg
> > > > > > + * cannot handle arbitrarily long lines. Note also that
> > > > > > printing to dmesg in
> > > > > > + * piece-meal fashion is not possible, each separate
> > > > > > call to drm_puts() has a
> > > > > > + * line-feed automatically added! Therefore, the entire
> > > > > > output line must be
> > > > > > + * constructed in a local buffer first, then printed in
> > > > > > one atomic output call.
> > > > > > + *
> > > > > > + * There is also a scheduler yield call to prevent the
> > > > > > 'task has been stuck for
> > > > > > + * 120s' kernel hang check feature from firing when
> > > > > > printing to a slow target
> > > > > > + * such as dmesg over a serial port.
> > > > > > + *
> > > > > > + * TODO: Add compression prior to the ASCII85 encoding
> > > > > > to shrink huge buffers down.
> > > > > > + *
> > > > > > + * @p: the printer object to output to
> > > > > > + * @prefix: optional prefix to add to output string
> > > > > > + * @blob: the Binary Large OBject to dump out
> > > > > > + * @offset: offset in bytes to skip from the front of
> > > > > > the BLOB, must be a multiple of sizeof(u32)
> > > > > > + * @size: the size in bytes of the BLOB, must be a
> > > > > > multiple of sizeof(u32)
> > > > > > + */
> > > > > > +void xe_print_blob_ascii85(struct drm_printer *p, const
> > > > > > char *prefix,
> > > > > > +               const void *blob, size_t offset, size_t size)
> > > > > > +{
> > > > > > +    const u32 *blob32 = (const u32 *)blob;
> > > > > > +    char buff[ASCII85_BUFSZ], *line_buff;
> > > > > > +    size_t line_pos = 0;
> > > > > > +
> > > > > > +#define DMESG_MAX_LINE_LEN    800
> > > > > > +#define MIN_SPACE        (ASCII85_BUFSZ + 2)        /*
> > > > > > 85 + "\n\0" */
> > > > > > +
> > > > > > +    if (size & 3)
> > > > > > +        drm_printf(p, "Size not word aligned: %zu", size);
> > > > > > +    if (offset & 3)
> > > > > > +        drm_printf(p, "Offset not word aligned: %zu", size);
> > > > > > +
> > > > > > +    line_buff = kzalloc(DMESG_MAX_LINE_LEN, GFP_KERNEL);
> > > > > > +    if (IS_ERR_OR_NULL(line_buff)) {
> > > > > > +        drm_printf(p, "Failed to allocate line buffer:
> > > > > > %pe", line_buff);
> > > > > > +        return;
> > > > > > +    }
> > > > > > +
> > > > > > +    blob32 += offset / sizeof(*blob32);
> > > > > > +    size /= sizeof(*blob32);
> > > > > > +
> > > > > > +    if (prefix) {
> > > > > > +        strscpy(line_buff, prefix, DMESG_MAX_LINE_LEN -
> > > > > > MIN_SPACE - 2);
> > > > > > +        line_pos = strlen(line_buff);
> > > > > > +
> > > > > > +        line_buff[line_pos++] = ':';
> > > > > > +        line_buff[line_pos++] = ' ';
> > > > > > +    }
> > > > > > +
> > > > > > +    while (size--) {
> > > > > > +        u32 val = *(blob32++);
> > > > > > +
> > > > > > +        strscpy(line_buff + line_pos, ascii85_encode(val, buff),
> > > > > > +            DMESG_MAX_LINE_LEN - line_pos);
> > > > > > +        line_pos += strlen(line_buff + line_pos);
> > > > > > +
> > > > > > +        if ((line_pos + MIN_SPACE) >= DMESG_MAX_LINE_LEN) {
> > > > > > +            line_buff[line_pos++] = '\n';
> > > > > > +            line_buff[line_pos++] = 0;
> > > > > 
> > > > > This breaks ascii85 parser that we had up to now.
> > > It should not break the decoding of existing sections. This helper
> > > is only being used for the new GuC objects at present. As per the
> > > TODO, the intent is to use it for all blobs in the future. But that
> > > was deliberately left to a future update (which hasn't been posted
> > > yet) to avoid breaking the mesa tool.
> > 
> > it broke nonetheless because then it fails to decode this line the way
> > it was doing:  each line starts a new key.
> It barfs on any line it doesn't understand? Seems like it could be made to
> be more robust in general.

We do not break userspace.
When we do we do not blame userspace.

I had warned it and was ignored:

"We shouldn't be breaking the current userspace tools. Any change like this would
need to be synchronized between all the current decode tools."

[1] https://lore.kernel.org/intel-xe/ZuLzSMH_hBl9RWdv@intel.com/ and ignored 

Next time, please respect the only Linux rule: No regression.

Revert sent and applying it soon...

> 
> > 
> > > 
> > > 
> > > > > And I think there is not safe way to parse it now, how would
> > > > > the parser know that the blob reach to end?
> > > The GuC decoder tool simply ignores leading/trailing whitespace and
> > > keeps decoding until it finds a line with a new field - i.e.
> > > anything with a non-ASCII85 character such as ':' or '*'. It is
> > > totally
> > 
> > character set for ascii85 is [33, 117] and 122, which includes both ':'
> > and '*'. Hopefully it's hard to hit a collision, but we shouldn't design
> > it in a way that is possible.
> Sorry, getting myself confused - it was a while ago when I was writing this.
> Yes, punctuation marks are part of the ASCII85 character set.
> 
> The GuC decoder looks for white space - a blank line or a line with a space
> in it. Any new field is guaranteed to be "name: data" so will have a space.
> And sections are delimited with blank lines, plus section headers are "****
> name ****" so also guaranteed to have a space.
> 
> John.
> 
> 
> > 
> > 
> > > reliable for me.
> > > 
> > > > 
> > > > Just looked at this code to check if we also had a "Size" as I remember
> > > > seeing it in previous related patches, but we don't. If
> > > > we did then you could use that to calculate how much you still had to
> > > > read, ignoring the \n. With the current scheme I think one
> > > > way would be to continue the previous key when the line doesn't start
> > > > with a new key. Awful, yes, and not future proof.
> > > Any new field is guaranteed to have a colon in it and any new
> > > section header is guaranteed to have a star in it. Sounds pretty
> > > reliable to me.
> > 
> > that's what I said by "continue the previous key when the line doesn't
> > start with a new one".
> > 
> > 
> > > 
> > > The other option would be to have an explicit prefix at the start of
> > > each continuation line. E.g. "A85:".
> > 
> > let's not make it worse than it already is. And not future proof by
> > given the encoding algorithm the meaning of "continuation line".
> > 
> > Lucas De Marchi
> > 
> > > 
> > > The problem with a size field is that ASCII85 encoding is data
> > > dependent - it has extremely basic run length encoding of strings of
> > > zeros. So you only know the size after the encoding is complete.
> > > Which means either the size field is in the dump after the encoded
> > > data (which is therefore useless) or you can't stream the encode and
> > > must encode to an in-memory buffer first, then print the size, then
> > > print your pre-encoded data.
> > > 
> > > 
> > > > 
> > > > John, I explicitly said *we can't break the existent users*,
> > > > pointed to the one in the mesa repo and confirmed with José it would be
> > > > indeed a breakage. Please check again the past email thread at
> > > > https://lore.kernel.org/all/3jexgpnh3br3gqi4ol4c3hx3tyhwevq5nqo5xssyie3xglqohz@e7mnj4dewupu/
> > > > 
> > > > 
> > > That discussion was a question for Jose not a statement. As there
> > > was no response for several weeks, I assumed that it wasn't a
> > > problem after all.
> > > 
> > > > 
> > > > Did you at least prepare the mesa parser for that additional \n?
> > > > 
> > > > We shouldn't have merged the kernel patch as is with the excuse
> > > > it reads
> > > > better in dmesg, when we also said we should not print that garbage to
> > > > dmesg :(
> > > Not following. Should not print what garbage in dmesg? As I keep
> > > saying, none of this goes to dmesg by default except in the case of
> > > catastrophic CT failure. And in that case, it is extremely useful
> > > and the only way to debug issues.
> > > 
> > > John.
> > > 
> > > > 
> > > > Lucas De Marchi
> > > 
>