[PATCH v2] drm/print: Introduce drm_line_printer

Thu May 30 18:47:24 UTC 2024

On 5/30/2024 02:33, Michal Wajdeczko wrote:
> On 30.05.2024 09:49, Jani Nikula wrote:
>> On Wed, 29 May 2024, John Harrison <john.c.harrison at intel.com> wrote:
>>> On 5/28/2024 06:06, Michal Wajdeczko wrote:
>>>> This drm printer wrapper can be used to increase the robustness of
>>>> the captured output generated by any other drm_printer to make sure
>>>> we didn't lost any intermediate lines of the output by adding line
>>>> numbers to each output line. Helpful for capturing some crash data.
>>>>
>>>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>>> Cc: Jani Nikula <jani.nikula at intel.com>
>>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>>> ---
>>>> v2: don't abuse prefix, use union instead (Jani)
>>>>       don't use 'dp' as name, prefer 'p' (Jani)
>>>>       add support for unique series identifier (John)
>>>> ---
>>>>    drivers/gpu/drm/drm_print.c | 14 ++++++++
>>>>    include/drm/drm_print.h     | 68 ++++++++++++++++++++++++++++++++++++-
>>>>    2 files changed, 81 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/drm_print.c b/drivers/gpu/drm/drm_print.c
>>>> index cf2efb44722c..be9cbebff5b3 100644
>>>> --- a/drivers/gpu/drm/drm_print.c
>>>> +++ b/drivers/gpu/drm/drm_print.c
>>>> @@ -214,6 +214,20 @@ void __drm_printfn_err(struct drm_printer *p, struct va_format *vaf)
>>>>    }
>>>>    EXPORT_SYMBOL(__drm_printfn_err);
>>>>    
>>>> +void __drm_printfn_line(struct drm_printer *p, struct va_format *vaf)
>>>> +{
>>>> +	unsigned int counter = ++p->line.counter;
>>> Wrong units, but see below anyway...
> it really doesn't matter as it is temporary var used in printf()
> actual 'short' counter will wrap on its own unit boundary
It should still match. Otherwise the code is ambiguous. Was it meant to 
be an int? Was it meant to be a short? Just because code compiles 
doesn't mean it is good.

>
>>>> +	const char *prefix = p->prefix ?: "";
>>>> +	const char *pad = p->prefix ? " " : "";
>>>> +
>>>> +	if (p->line.series)
>>>> +		drm_printf(p->arg, "%s%s%u.%u: %pV",
>>>> +			   prefix, pad, p->line.series, counter, vaf);
>>>> +	else
>>>> +		drm_printf(p->arg, "%s%s%u: %pV", prefix, pad, counter, vaf);
>>>> +}
>>>> +EXPORT_SYMBOL(__drm_printfn_line);
>>>> +
>>>>    /**
>>>>     * drm_puts - print a const string to a &drm_printer stream
>>>>     * @p: the &drm printer
>>>> diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
>>>> index 089950ad8681..f4d9b98d7909 100644
>>>> --- a/include/drm/drm_print.h
>>>> +++ b/include/drm/drm_print.h
>>>> @@ -176,7 +176,13 @@ struct drm_printer {
>>>>    	void (*puts)(struct drm_printer *p, const char *str);
>>>>    	void *arg;
>>>>    	const char *prefix;
>>>> -	enum drm_debug_category category;
>>>> +	union {
>>>> +		enum drm_debug_category category;
>>>> +		struct {
>>>> +			unsigned short series;
>>>> +			unsigned short counter;
>>> Any particular reason for using 'short' rather than 'int'? Short is only
> didn't want to increase the size of the struct drm_printer and with
> little luck sizeof two shorts will be less/equal sizeof enum
Depending on the compiler, the architecture and what values have been 
defined within it, an enum is possibly (likely?) to be a char.

>
>>> 16bits right? That might seem huge but a GuC log buffer with 16MB debug
>>> log (and minimum sizes for the other sections) when dumped via the
>>> original debugfs hexdump mechanism is 1,245,444 lines. That line count
> if your capture relies on collecting all 1,245,444 lines then likely you
> have other problem that needs solving than line counter overflow
Have you ever used a full 16MB GuC log? And read it out via debugfs? 
Then that was 1.2 million lines of text that you read out. Did you have 
other problems that meant reading that file was a waste of your time? Or 
did it allow you to debug the issue you were working on?

The purpose of this patch is to 'improve' the fully working version that 
was already posted. Causing unwanted wraps in the line count is not an 
improvement. It is very definitely a bug. And now your argument is that 
we shouldn't be doing this in the first place? That's a given! Dumping 
huge streams of data to dmesg is a total hack. We absolutely should not 
be doing it. But we have no choice because there is no other way 
(without adding even bigger and more complicated mechanisms involving 
external debug modules or something).

>
>>> goes down a lot when you start using longer lines for the dump, but it
>>> is still in the tens of thousands of lines.  So limiting to 16 bits is
>>> an overflow just waiting to happen.
> but even in case of an overflow it should be relatively easy to teach
> any parser to deal with line .0 as indicator to restart any tracker
Wasn't your earlier argument that trivially parsing out the line count 
prefix from a debugfs file was far too much effort and cannot possibly 
be done by a developer. Now you are saying that coping with a broken 
count is "easy to teach a parser". The one single purpose of this entire 
change is to guarantee a valid dump can be extracted from a log. 
Anything that potentially prevents that from working is a fundamental 
failure.

>
> and it is highly unlikely that any captured logs will miss exactly
> 65,536 contiguous lines, but even then it should be noticeable gap
Or we could just use an integer count that is not going to wrap and be 
ambiguous.

>
>>>> +		} line;
>>>> +	};
>>>>    };
>>>>    
>>>>    void __drm_printfn_coredump(struct drm_printer *p, struct va_format *vaf);
>>>> @@ -186,6 +192,7 @@ void __drm_puts_seq_file(struct drm_printer *p, const char *str);
>>>>    void __drm_printfn_info(struct drm_printer *p, struct va_format *vaf);
>>>>    void __drm_printfn_dbg(struct drm_printer *p, struct va_format *vaf);
>>>>    void __drm_printfn_err(struct drm_printer *p, struct va_format *vaf);
>>>> +void __drm_printfn_line(struct drm_printer *p, struct va_format *vaf);
>>>>    
>>>>    __printf(2, 3)
>>>>    void drm_printf(struct drm_printer *p, const char *f, ...);
>>>> @@ -357,6 +364,65 @@ static inline struct drm_printer drm_err_printer(struct drm_device *drm,
>>>>    	return p;
>>>>    }
>>>>    
>>>> +/**
>>>> + * drm_line_printer - construct a &drm_printer that prefixes outputs with line numbers
>>>> + * @p: the &struct drm_printer which actually generates the output
>>>> + * @prefix: optional output prefix, or NULL for no prefix
>>>> + * @series: optional unique series identifier, or 0 to omit identifier in the output
>>>> + *
>>>> + * This printer can be used to increase the robustness of the captured output
>>>> + * to make sure we didn't lost any intermediate lines of the output. Helpful
>>>> + * while capturing some crash data.
>>>> + *
>>>> + * Example 1::
>>>> + *
>>>> + *	void crash_dump(struct drm_device *drm)
>>>> + *	{
>>>> + *		static unsigned short id;
>>>> + *		struct drm_printer p = drm_err_printer(drm, "crash");
>>>> + *		struct drm_printer lp = drm_line_printer(&p, "dump", ++id);
>>> Is there any benefit or other reason to split the prefix across two
>>> separate printers? It seems like a source of confusion. As in, the code
> it's not any kind of the required 'split', as both printers used here
> can treat prefix as optional if NULL, but rather a way to show how to
> pass two potentially separated prefixes, as one of them could be already
> prepared (like engine name or any other alias) or if the primary printer
> does not accept any prefix at all (and this a limitation of our existing
> custom GT oriented printers [1] [2])
>
> [1]
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/xe/xe_gt_printk.h#L66
> [2]
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/xe/xe_gt_printk.h#L81
As per earlier comments, my point is not that you should change the 
patch to remove one of the prefixes from the code. My point is that the 
documentation is confusing. You are explicitly splitting a single phrase 
"crash dump" across two separate printer objects with no explanation as 
to why. And as you just pointed out, there are many use cases where 
there would not be the option to split it. So it would be much, much 
clearer to pass NULL to your drm_err_printer example and have a single 
line comment saying that multiple prefixes could be used if allowed by 
the printer objects and if useful in the situation. Rather than having a 
bizarrely split string with no explanation as to why it has been split.

>
>>> will allow a double prefix, there is not much you can do about that
>>> because losing the prefix from drm_line_printer would mean no prefix at
> but why would we loose the prefix from the primary printer ?
I don't know what you mean by the primary printer? As above, I was 
simply trying to say that I am not requesting a code change but just a 
clarification of the documentation.

>
>>> all when not using drm_err_printer underneath. But why explicitly split
>>> the message across both printers in the usage example? This is saying
>>> that this is the recommended way to use the interface, but with no
>>> explanation of why the split is required or how the split should be done.
> the drm_line_printer is flexible and can be used in many configurations,
> above is just one of the potential uses that shows full output
>
>> You could have a printer, and then add two separate line counted blocks.
>>
>> 	struct drm_printer p = drm_err_printer(drm, "parent");
>> 	struct drm_printer lp1 = drm_line_printer(&p, "child 1", 0);
>>
>> 	...
>>
>> 	struct drm_printer lp2 = drm_line_printer(&p, "child 2", 0);
>>
>> 	...
>>
>> p could be defined elsewhere and passed into separate functions which
>> each have the line printing. The two prefixes can be useful.
> didn't considered that, but as stated above, drm_line_printer is
> flexible and can be used in many different ways, like this new one
And you really do not need to list them all out as massively verbose 
examples with confusing differences between them that are not explained. 
A single example plus a couple of lines of description would be much 
clearer.

>
>>> Also, there is really no specific connection to crashes. The purpose of
>>> this is for allowing the dumping of multi-line blocks of data. One use
>>> is for debugging crashes. But there are many debug operations that
>>> require the same dump and do not involve a crash. And this support is
>>> certainly not intended to be used on general end-user GPU hangs. For
>>> those, everything should be part of the devcoredump core file that is
>>> produced and accessible via sysfs. We absolutely should not be dumping
>>> huge data blocks in customer release drivers except in very extreme
>>> circumstances.
> if you are trying to convince me that we don't need any custom
> drm_printer that would take care of tracking and printing line numbers
> in kind of more robust way and instead we should be doing such line
> prints in a error prone way on it's own as you did in [3], then sorry,
> I'm not convinced, unless you just feel that it shouldn't be part of the
> drm yet, but then decision is drm maintainer hands (and also in the Xe
> maintainers who don't want to fall into i915-ish private solutions trap)
>
> [3] https://patchwork.freedesktop.org/patch/594021/?series=133349&rev=2
No. I am saying that your example use case seems to be implying a much 
greater usage for this mechanism than should be expected. I'm saying 
that it should never occur in an end user system because dumping 
megabytes of data to dmesg is a bad user experience. It absolutely 
should never be a standard part of handling a GPU hang type 'crash'. The 
primary purpose is for internal debug by developers only. If a use case 
gets shipped upstream then it should be an extremely hard to hit corner 
case for which we are desperate to get any useful debug logs by any 
means necessary.

As for error prone, I am not seeing how the original trivial (and 
working) code is error prone but this complex abstraction of it is less 
so. Especially given the integer truncation problem. I mean seriously, 
how 'error prone' can it be to add a "%d, line++" to a print?! And how 
much of a 'private solutions trap' is it to add such a trivial prefix to 
a couple of prints in a single function that is really a big ugly hack 
for getting logs out in a really dodgy manner anyway?

As you say, it is up to the DRM maintainers as to whether they want this 
support in the generic DRM layers. If it lands and it is functional 
(i.e. does not break its sole reason for being by truncating counts 
partway through a dump) then sure, I'll use it. I just don't see that it 
is even remotely worth the effort given that it is single use only and 
given how trivial the local version is.

John.

>
>> A devcoredump implementation could use a drm_printer too?
>>
>> Is this only about bikeshedding the example? I'm not sure I get the
>> point here.
>>
>>>> + *
>>>> + *		drm_printf(&lp, "foo");
>>>> + *		drm_printf(&lp, "bar");
>>>> + *	}
>>>> + *
>>>> + * Above code will print into the dmesg something like::
>>>> + *
>>>> + *	[ ] 0000:00:00.0: [drm] *ERROR* crash dump 1.1: foo
>>>> + *	[ ] 0000:00:00.0: [drm] *ERROR* crash dump 1.2: bar
>>>> + *
>>>> + * Example 2::
>>>> + *
>>>> + *	void line_dump(struct device *dev)
>>>> + *	{
>>>> + *		struct drm_printer p = drm_info_printer(dev);
>>>> + *		struct drm_printer lp = drm_line_printer(&p, NULL, 0);
>>>> + *
>>>> + *		drm_printf(&lp, "foo");
>>>> + *		drm_printf(&lp, "bar");
>>>> + *	}
>>>> + *
>>>> + * Above code will print::
>>>> + *
>>>> + *	[ ] 0000:00:00.0: [drm] 1: foo
>>>> + *	[ ] 0000:00:00.0: [drm] 2: bar
>>> Not really seeing the point of having two examples listed. The first
>>> includes all the optional extras, the second is just repeating with no
>>> new information.
>> You see how the "series" param behaves?
> exactly
>
>> BR,
>> Jani.
>>
>>> John.
>>>
>>>> + *
>>>> + * RETURNS:
>>>> + * The &drm_printer object
>>>> + */
>>>> +static inline struct drm_printer drm_line_printer(struct drm_printer *p,
>>>> +						  const char *prefix,
>>>> +						  unsigned short series)
>>>> +{
>>>> +	struct drm_printer lp = {
>>>> +		.printfn = __drm_printfn_line,
>>>> +		.arg = p,
>>>> +		.prefix = prefix,
>>>> +		.line = { .series = series, },
>>>> +	};
>>>> +	return lp;
>>>> +}
>>>> +
>>>>    /*
>>>>     * struct device based logging
>>>>     *