[Intel-gfx] [PATCH v2 4/5] drm/i915/guc: Add GuC CT specific debug print wrappers

Fri Dec 2 20:14:04 UTC 2022

On 12/1/2022 04:01, Tvrtko Ursulin wrote:
> On 01/12/2022 11:56, Michal Wajdeczko wrote:
>> On 01.12.2022 01:41, John Harrison wrote:
>>> On 11/23/2022 12:45, Michal Wajdeczko wrote:
>>>> On 23.11.2022 02:25, John Harrison wrote:
>>>>> On 11/22/2022 09:54, Michal Wajdeczko wrote:
>>>>>> On 18.11.2022 02:58, John.C.Harrison at Intel.com wrote:
>>>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>
>>>>>>> Re-work the existing GuC CT printers and extend as required to 
>>>>>>> match
>>>>>>> the new wrapping scheme.
>>>>>>>
>>>>>>>
>>>>>>> [snip]...
>>>>>>>
>>>>>>>       /**
>>>>>>>      * DOC: CTB Blob
>>>>>>> @@ -170,7 +188,7 @@ static int ct_control_enable(struct 
>>>>>>> intel_guc_ct
>>>>>>> *ct, bool enable)
>>>>>>>         err = guc_action_control_ctb(ct_to_guc(ct), enable ?
>>>>>>>                          GUC_CTB_CONTROL_ENABLE :
>>>>>>> GUC_CTB_CONTROL_DISABLE);
>>>>>>>         if (unlikely(err))
>>>>>>> -        CT_PROBE_ERROR(ct, "Failed to control/%s CTB (%pe)\n",
>>>>>>> +        ct_probe_error(ct, "Failed to control/%s CTB (%pe)\n",
>>>>>>>                        str_enable_disable(enable), ERR_PTR(err));
>>>>>> btw, shouldn't we change all messages to start with lowercase ?
>>>>>>
>>>>>> was:
>>>>>>       "CT0: Failed to control/%s CTB (%pe)"
>>>>>> is:
>>>>>>       "GT0: GuC CT Failed to control/%s CTB (%pe)"
>>>>>>
>>>>>> unless we keep colon (as suggested by Tvrtko) as then:
>>>>>>
>>>>>>       "GT0: GuC CT: Failed to control/%s CTB (%pe)"
>>>>> Blanket added the colon makes it messy when a string actually 
>>>>> wants to
>>>>> start with the prefix. The rule I've been using is lower case word 
>>>>> when
>>>>> the prefix was part of the string, upper case word when the prefix is
>>>> Hmm, I'm not sure that we should attempt to have such a flexible 
>>>> rule as
>>>> we shouldn't rely too much on actual format of the prefix as it 
>>>> could be
>>>> changed any time.  All we should know about final log message is 
>>>> that it
>>>> _will_ properly identify the "GT" or "GuC" that this log is related 
>>>> to.
>>>>
>>>> So I would suggest to be just consistent and probably always start 
>>>> with
>>>> upper case, as that seems to be mostly used in kernel error logs, and
>>>> just make sure that any prefix will honor that (by including colon, or
>>>> braces), so this will always work like:
>>>>
>>>> "[drm] *ERROR* GT0: Failed to foo (-EIO)"
>>>> "[drm] *ERROR* GT0: GUC: Failed to foo (-EIO)"
>>>> "[drm] *ERROR* GT0: GUC: CT: Failed to foo (-EIO)"
>>>>
>>>> or
>>>>
>>>> "[drm] *ERROR* GT0: Failed to foo (-EIO)"
>>>> "[drm] *ERROR* GT0: [GUC] Failed to foo (-EIO)"
>>>> "[drm] *ERROR* GT0: [GUC] CT: Failed to foo (-EIO)"
>>>>
>>>> and even for:
>>>>
>>>> "[drm] *ERROR* GT(root) Failed to foo (-EIO)"
>>>> "[drm] *ERROR* GuC(media) Failed to foo (-EIO)"
>>>> "[drm] *ERROR* GT0 [GuC:CT] Failed to foo (-EIO)"
>>> All of which are hideous/complex/verbose/inconsistent. 'GT0: GUC: CT:'?
>>> Really? Or 'GT0: [GUC] CT:'? Why the random mix of separators? And how
>>> would you implement '[GUC:CT]' without having a CT definition that is
>>> entirely self contained and does chain on to the GuC level version?
>>
>> you missed the point, as those were just examples of different possible
>> prefixes that one could define, to show that actual message shall not
>> make any assumption how such prefix will look like or how it will end
>> (like with or w/o colon, with "GuC" or "GT" tag or whatever)
The point is that none of those are ever likely to happen so are 
meaningless to prepare for.

>>
>>>
>>> This is pointless bikeshedding. If you want to re-write every single
>>> debug print (yet again) and invent much more complicated macro
>>
>> the opposite, I want clear understanding how messages should be written
>> to *avoid* rewriting them if (for some reason) we decide to change or
>> update the prefix in the future
You say that like there is any consistency or regulation at all on how 
messages are currently written.

>>
>>> definitions then feel free to take over the patch set. If not can we
>>> just approve the v3 version and move on to doing some actual work?
>>
>> if everyone is happy that there is inconsistency in use between gt_xxx
>> messages where we shall be using messages starting with upper case
>> (since prefix ends with colon) and guc/ct_xxx messages where we shall be
>> using lower case message (since there is a known prefix without colon,
>> either "GuC" or "CT") then I'll be also fine, but for now that bothers
>> me a little, hence asking for clarifications/agreement
>>
I don't think anyone is happy with anything.

Personally, I don't like the idea of adding fixed prefixes for every 
subsystem just for the sake of doing so. Having a wrapper that abstracts 
out the messy structure mangling of getting from a 'ct' object back to a 
'gt' object maybe has use. But with regards to fixed prefixes, all I 
ever wanted was to add GT# (because that adds useful information that is 
other completely lacking) and, ideally, a display prefix for all of the 
display prints (because they completely swamp all other output in CI 
dmesg logs, so being able to trivially filter them out would be 
incredibly useful). Beyond that, it seems like adding work and forced 
formatting just for the sake of it. If a print wants to say 'Error 
received on CT read channel' then why should it be forced to be 'GuC CT 
Error received on read channel' which is coded as 'Error received on 
read channel' ? To me, that seems less clear both in dmesg and in code.

And for the record, I didn't want to add the GT thing as message 
'prefix' either. I would rather have just extended the PCI address that 
is printed by the dev level printer. But that would have been a much 
more invasive change at many levels outside of i915. As in, "i915 
0000:00:02.0: [drm]" -> "i915 0000:00:02.0:GT0 [drm]".

And here we are a month down the road and arguing over whether a three 
level deep prefix should have a colon or braces around each level's 
forced message prefix!? Meanwhile, it is still impossible to know which 
GT produced a given error in a multi-GT system and the ability to filter 
out display messages has been completely nacked for reasons I don't 
understand at all before I even got to suggest it.

>> and while for dbg level messages it doesn't matter, I assume we should
>> be consistent for err/warn/info messages (as those will eventually show
>> up to the end user) so let maintainers decide here what is 
>> expectation here
>
> Could we have some examples pasted here, of the end result of this 
> series, for all message "categories" (origins, macros, whatever)?

GT initialisation:
gt_err(gt, "Failed to allocate scratch page\n");
i915 0000:00:02.0: [drm] GT0: Failed to allocate scratch page

G2H notification handler:
guc_err(guc, "notification: Invalid length %u for deregister done\n", len);
i915 0000:00:02.0: [drm] GT0: GuC notification: Invalid length 0 for 
deregister done

CTB initialisation:
ct_probe_error(ct, "Failed to control/%s CTB (%pe)\n", 
str_enable_disable(enable), ERR_PTR(err));
i915 0000:00:02.0: [drm] GT0: GuC CT Failed to control/enable CTB (EINVAL)

Random meaningless (to me) message that is apparently a display thing:
drm_dbg_kms(&dev_priv->drm, "disabling %s\n", pll->info->name);
i915 0000:00:02.0: [drm:intel_disable_shared_dpll [i915]] disabling PORT 
PLL B

I'm sure you can extrapolate to all other forms of dbg, notice, info, 
etc. without me having to manually type each one out, given that they 
are all identical.

Personally, I think the above should be just:
gt_err(gt, "Failed to allocate scratch page\n");
i915 0000:00:02.0: [drm] GT0: Failed to allocate scratch page

gt_err(guc_to_gt(guc), "G2H: Invalid length for deregister done: %u\n", 
len);
i915 0000:00:02.0: [drm] GT0: G2H: Invalid length for deregister done: 0

gt_probe_error(ct_to_gt(ct), "Failed to %s CT %d buffer (%pe)\n", 
str_enable_disable(enable), send ? "SEND" : "RECV", ERR_PTR(err));
i915 0000:00:02.0: [drm] GT0: Failed to enable CT SEND buffer (EINVAL)

drm_dbg_kms(&dev_priv->drm, "disabling %s\n", pll->info->name);
i915 0000:00:02.0: [drm:intel_disable_shared_dpll [i915-KMS]] disabling 
PORT PLL B
But presumably that requires finishing the plan of splitting out the 
display code into a separate driver. So for now, something like this 
would still be a massive improvement:
kms_dbg(dev_priv, "disabling %s\n", pll->info->name);
i915 0000:00:02.0: [drm:intel_disable_shared_dpll [i915]] KMS: disabling 
PORT PLL B

John.

>
> Regards,
>
> Tvrtko