[Intel-gfx] [PATCH 1/2] drm/i915/gt: Add GT oriented dmesg output
John Harrison
john.c.harrison at intel.com
Mon Nov 14 22:14:54 UTC 2022
On 11/10/2022 01:43, Tvrtko Ursulin wrote:
> On 09/11/2022 17:46, John Harrison wrote:
>> On 11/9/2022 03:05, Tvrtko Ursulin wrote:
>>> On 08/11/2022 20:15, John Harrison wrote:
>>>> On 11/8/2022 01:01, Tvrtko Ursulin wrote:
>>>>> On 07/11/2022 19:14, John Harrison wrote:
>>>>>> On 11/7/2022 08:17, Tvrtko Ursulin wrote:
>>>>>>> On 07/11/2022 09:33, Tvrtko Ursulin wrote:
>>>>>>>> On 05/11/2022 01:03, Ceraolo Spurio, Daniele wrote:
>>>>>>>>> On 11/4/2022 10:25 AM, John.C.Harrison at Intel.com wrote:
>>>>>>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>>>>
>>>>>>>>>> When trying to analyse bug reports from CI, customers, etc.
>>>>>>>>>> it can be
>>>>>>>>>> difficult to work out exactly what is happening on which GT in a
>>>>>>>>>> multi-GT system. So add GT oriented debug/error message
>>>>>>>>>> wrappers. If
>>>>>>>>>> used instead of the drm_ equivalents, you get the same output
>>>>>>>>>> but with
>>>>>>>>>> a GT# prefix on it.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>>>
>>>>>>>>> The only downside to this is that we'll print "GT0: " even on
>>>>>>>>> single-GT devices. We could introduce a gt->info.name and
>>>>>>>>> print that, so we could have it different per-platform, but
>>>>>>>>> IMO it's not worth the effort.
>>>>>>>>>
>>>>>>>>> Reviewed-by: Daniele Ceraolo Spurio
>>>>>>>>> <daniele.ceraolospurio at intel.com>
>>>>>>>>>
>>>>>>>>> I think it might be worth getting an ack from one of the
>>>>>>>>> maintainers to make sure we're all aligned on transitioning to
>>>>>>>>> these new logging macro for gt code.
>>>>>>>>
>>>>>>>> Idea is I think a very good one. First I would suggest
>>>>>>>> standardising to lowercase GT in logs because:
>>>>>>>>
>>>>>>>> $ grep "GT%" i915/ -r
>>>>>>>> $ grep "gt%" i915/ -r
>>>>>>>> i915/gt/intel_gt_sysfs.c: gt->i915->sysfs_gt, "gt%d",
>>>>>>>> gt->info.id))
>>>>>>>> i915/gt/intel_gt_sysfs.c: "failed to initialize
>>>>>>>> gt%d sysfs root\n", gt->info.id);
>>>>>>>> i915/gt/intel_gt_sysfs_pm.c: "failed to create gt%u RC6 sysfs
>>>>>>>> files (%pe)\n",
>>>>>>>> i915/gt/intel_gt_sysfs_pm.c: "failed to create gt%u RC6p sysfs
>>>>>>>> files (%pe)\n",
>>>>>>>> i915/gt/intel_gt_sysfs_pm.c: "failed to create gt%u RPS sysfs
>>>>>>>> files (%pe)",
>>>>>>>> i915/gt/intel_gt_sysfs_pm.c: "failed to create gt%u
>>>>>>>> punit_req_freq_mhz sysfs (%pe)",
>>>>>>>> i915/gt/intel_gt_sysfs_pm.c: "failed to create gt%u throttle
>>>>>>>> sysfs files (%pe)",
>>>>>>>> i915/gt/intel_gt_sysfs_pm.c: "failed to create gt%u
>>>>>>>> media_perf_power_attrs sysfs (%pe)\n",
>>>>>>>> i915/gt/intel_gt_sysfs_pm.c: "failed to add gt%u rps defaults
>>>>>>>> (%pe)\n",
>>>>>>>> i915/i915_driver.c: drm_err(>->i915->drm, "gt%d:
>>>>>>>> intel_pcode_init failed %d\n", id, ret);
>>>>>>>> i915/i915_hwmon.c: snprintf(ddat_gt->name,
>>>>>>>> sizeof(ddat_gt->name), "i915_gt%u", i);
>>>>>>>>
>>>>>>
>>>>>> Just because there are 11 existing instances of one form doesn't
>>>>>> mean that the 275 instances that are waiting to be converted
>>>>>> should be done incorrectly. GT is an acronym and should be
>>>>>> capitalised.
>>>>>
>>>>> Okay just make it consistent then.
>>>>>
>>>>>> Besides:
>>>>>> grep -r "GT " i915 | grep '"'
>>>>>> i915/vlv_suspend.c: drm_err(&i915->drm, "timeout disabling GT
>>>>>> waking\n");
>>>>>> i915/vlv_suspend.c: "timeout waiting for GT
>>>>>> wells to go %s\n",
>>>>>> i915/vlv_suspend.c: drm_dbg(&i915->drm, "GT register access
>>>>>> while GT waking disabled\n");
>>>>>> i915/i915_gpu_error.c: err_printf(m, "GT awake: %s\n",
>>>>>> str_yes_no(gt->awake));
>>>>>> i915/i915_debugfs.c: seq_printf(m, "GT awake? %s [%d], %llums\n",
>>>>>> i915/selftests/i915_gem_evict.c: pr_err("Failed to idle GT (on
>>>>>> %s)", engine->name);
>>>>>> i915/intel_uncore.c: "GT thread status wait
>>>>>> timed out\n");
>>>>>> i915/gt/uc/selftest_guc_multi_lrc.c: drm_err(>->i915->drm, "GT
>>>>>> failed to idle: %d\n", ret);
>>>>>> i915/gt/uc/selftest_guc.c: drm_err(>->i915->drm, "GT failed to
>>>>>> idle: %d\n", ret);
>>>>>> i915/gt/uc/selftest_guc.c: drm_err(>->i915->drm, "GT failed to
>>>>>> idle: %d\n", ret);
>>>>>> i915/gt/intel_gt_mcr.c: * Some GT registers are designed as
>>>>>> "multicast" or "replicated" registers:
>>>>>> i915/gt/selftest_rps.c: pr_info("%s: rps counted
>>>>>> %d C0 cycles [%lldns] in %lldns [%d cycles], using GT clock
>>>>>> frequency of %uKHz\n",
>>>>>> i915/gt/selftest_hangcheck.c: pr_err("[%s] GT is wedged!\n",
>>>>>> engine->name);
>>>>>> i915/gt/selftest_hangcheck.c: pr_err("GT is wedged!\n");
>>>>>> i915/gt/intel_gt_clock_utils.c: "GT clock
>>>>>> frequency changed, was %uHz, now %uHz!\n",
>>>>>> i915/gt/selftest_engine_pm.c: pr_err("Unable to flush
>>>>>> GT pm before test\n");
>>>>>> i915/gt/selftest_engine_pm.c: pr_err("GT failed to idle\n");
>>>>>> i915/i915_sysfs.c: "failed to register GT
>>>>>> sysfs directory\n");
>>>>>> i915/intel_uncore.h: * of the basic non-engine GT registers
>>>>>> (referred to as "GSI" on
>>>>>> i915/intel_uncore.h: * newer platforms, or "GT block" on
>>>>>> older platforms)? If so, we'll
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Then there is a question of naming. Are we okay with GT_XXX or,
>>>>>>>> do we want intel_gt_, or something completely different. I
>>>>>>>> don't have a strong opinion at the moment so I'll add some more
>>>>>>>> folks to Cc.
>>>>>>>
>>>>>> You mean GT_ERR("msg") vs intel_gt_err("msg")? Personally, I
>>>>>> would prefer just gt_err("msg") to keep it as close to the
>>>>>> official drm_* versions as possible. Print lines tend to be
>>>>>> excessively long already. Taking a 'gt' parameter instead of a
>>>>>> '>->i915->drm' parameter does help with that but it seems like
>>>>>> calling the wrapper intel_gt_* is shooting ourselves in the foot
>>>>>> on that one. And GT_ERR vs gt_err just comes down to the fact
>>>>>> that it is a macro wrapper and therefore is required to be in
>>>>>> upper case.
>>>>>>
>>>>>>> There was a maintainer level mini-discussion on this topic which
>>>>>>> I will try to summarise.
>>>>>>>
>>>>>>> Main contention point was the maintenance cost and generally an
>>>>>>> undesirable pattern of needing to add many
>>>>>>> subsystem/component/directory specific macros. Which then
>>>>>>> typically need extra flavours and so on. But over verbosity of the
>>>>>> How many versions are you expecting to add? Beyond the tile
>>>>>> instance, what further addressing requirements are there? The
>>>>>> card instance is already printed as part of the PCI address. The
>>>>>> only other reason to add per component wrappers would be to wrap
>>>>>> the mechanism for getting from some random per component object
>>>>>> back to the intel_gt structure. But that is hardware a new issue
>>>>>> being added by this wrapper. It is also not a requirement. Much
>>>>>> of the code has a gt pointer already. For the parts that don't,
>>>>>> some of it would be a trivial engine->gt type dereference, some
>>>>>> of it is a more complex container_of type construction. But for
>>>>>> those, the given file will already have multiple instances of
>>>>>> that already (usually as the first or second line of the function
>>>>>> - 'intel_gt *gt = fancy_access_method(my_obj)' so adding one or
>>>>>> two more of those as necessary is not making the code harder to
>>>>>> read.
>>>>>>
>>>>>>> code is obviously also bad, so one compromise idea was to add a
>>>>>>> macro which builds the GT string and use drm logging helpers
>>>>>>> directly. This would be something like:
>>>>>>>
>>>>>>> drm_err(GT_LOG("something went wrong ret=%d\n", gt), ret);
>>>>>>> drm_info(GT_LOG(...same...));
>>>>>> Seriously? As above, some of these lines are already way too
>>>>>> long, this version makes them even longer with no obvious
>>>>>> benefit. Worse, it makes it harder to read what is going on. It
>>>>>> is much less intuitive to read than just replacing the drm_err
>>>>>> itself. And having two sets of parenthesis with some parameters
>>>>>> inside the first and some only inside the second is really
>>>>>> horrid! Also, putting the 'gt' parameter in the middle just
>>>>>> confuses it with the rest of the printf arguments even though
>>>>>> there is no %d in the string for it. So now a quick glances tells
>>>>>> you that your code is wrong because you have three format
>>>>>> specifiers but four parameters.
>>>>>>
>>>>>> Whereas, just replacing drm_err with gt_err (or GT_ERR or
>>>>>> intel_gt_err) keeps everything else consistent. The first
>>>>>> parameter changes from 'drm' to 'gt' but is still the master
>>>>>> object parameter and it matches the function/macro prefix so
>>>>>> inherently looks correct. Then you have your message plus
>>>>>> parameters. No confusing orders, no confusing parenthesis, no
>>>>>> excessive macro levels, no confusion at all. Just nice simple,
>>>>>> easy to read, easy to maintain code.
>>>>>
>>>>> I am personally okay with gt_err/GT_ERR some other folks might
>>>>> object though. And I can also understand the argument why it is
>>>>> better to not have to define gt_err, gt_warn, gt_info, gt_notice,
>>>>> gt_debug, gt_err_ratelimited, gt_warn_once.. and instead have only
>>>>> one macro.
>>>> A small set of trivial macro definitions vs a complicated and
>>>> unreadable construct on every single print? Erm, isn't that the
>>>> very definition of abstracting to helpers as generally required by
>>>> every code review ever?
>>>>
>>>> And what 'other folks might object'? People already CC'd? People
>>>> outside of i915?
>>>>
>>>>
>>>>>
>>>>> Because of that I was passing on to you the compromise option.
>>>>>
>>>>> It maybe still has net space savings since we wouldn't have to be
>>>>> repeating the gt->i915->drm whatever and gt->info.id on every line.
>>>>>
>>>>> You are free to try the most compact one and see how hard those
>>>>> objections will be.
>>>> Um. I already did. This patch. And you are the only person to have
>>>> objected in any manner at all.
>>>
>>> Where I have objected?
>> Only in everything you have written. Or are you saying that actually I
>
> ???
>
>> could have just taken the r-b from Daniele and merged it while
>> completely ignoring everything you have said because you didn't say
>> 'NACK' at the top of your email?
>
> No, but because we don't do that while there is an ongoing discussion.
Which is exactly my point. If it is not an ack or r-b then by definition
it is an objection.
>
>>> I was a) asking to convert all gt/ within one kernel release, b)
>>> transferring the maintainer discussion from IRC to this email chain
>>> to outlay one alternative, for which I said I could see the pros and
>>> cons of both, and c) raised the naming question early since that can
>>> usually become a churn point later on when we have large scale code
>>> transformations.
>>>
>>> As said, FWIW you have my ack for GT_XXX naming and approach, but
>>> please do convert the whole of gt/ so we don't ship with a mish-mash
>>> of log messages.
>>>
>> That sounds like a quite a lot of objections to me - don't do it that
>> way, do it this way; fine do it that way but expect lots of
>> complaints when you do; oh all right, do it that way but re-write the
>> entire driver all at once.
>
> No, I'll repeat what I said one last time.
>
> I said it's good idea, raised the topic of naming, since it is good to
> get consensus before starting a wide renaming exercise. I also chatted
> with other maintainers and passed on here one alternative proposed,
> with its pros and cons. Finally I said your initial approach has my
> ack at least, even though not everyone liked it. And I asked to
> convert all gt/, since I think it is a good idea, and frankly it's not
> a very big ask. It is just how it works.
>
>> I really don't get the 'we must not ship with a mish-mash of log
>> messages' argument. It's hardly like the current total mish-mash of
>> DRM_DEBUG, DRM_DEBUG_DRIVER, pr_dbg, drm_dbg, ... is even remotely
>> consistent. It's also not like the patch is massively changing the
>> format of the output messages. So some prints get a useful extra
>> three characters in advance of others. Does anyone care? Would anyone
>> except us even notice?
>>
>> Is it really a problem to merge this patch now to get the process
>> started? And other sub-components get updated as and when people get
>> the time to do them? You could maybe even help rather than posting
>
> It can be a problem. You opened up a topic which leads to an overall
> improvement and if we merge churn before we decide on the final
> direction, then we just get more churn later.
>
>> completely conflicting patch sets that basically duplicate all the
>> effort for no actual benefit.
>
> Completely conflicting?
>
> Yours:
>
> drivers/gpu/drm/i915/gt/uc/intel_guc.c | 25 +++--
> drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c | 9 +-
> .../gpu/drm/i915/gt/uc/intel_guc_capture.c | 50 ++++------
> drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 9 +-
> drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c | 17 ++--
> drivers/gpu/drm/i915/gt/uc/intel_guc_log.c | 49 +++++-----
> drivers/gpu/drm/i915/gt/uc/intel_guc_rc.c | 3 +-
> drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 6 +-
> .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 56 ++++++------
> drivers/gpu/drm/i915/gt/uc/intel_huc.c | 20 ++--
> drivers/gpu/drm/i915/gt/uc/intel_uc.c | 84 ++++++++---------
> drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 91 +++++++++----------
> drivers/gpu/drm/i915/gt/uc/selftest_guc.c | 36 ++++----
> .../drm/i915/gt/uc/selftest_guc_hangcheck.c | 22 ++---
> .../drm/i915/gt/uc/selftest_guc_multi_lrc.c | 10 +-
> 15 files changed, 228 insertions(+), 259 deletions(-)
>
> Mine:
>
> drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 +-
> .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 23 ++++++----
> .../drm/i915/gt/intel_execlists_submission.c | 13 +++---
> drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c | 4 +-
> drivers/gpu/drm/i915/gt/intel_gt.c | 4 +-
> drivers/gpu/drm/i915/gt/intel_gt_irq.c | 8 ++--
> drivers/gpu/drm/i915/gt/intel_rps.c | 6 ++-
> drivers/gpu/drm/i915/gt/intel_workarounds.c | 43 +++++++++++--------
> .../gpu/drm/i915/gt/intel_workarounds_types.h | 4 ++
> .../gpu/drm/i915/gt/selftest_workarounds.c | 4 +-
> drivers/gpu/drm/i915/i915_debugfs.c | 4 +-
> drivers/gpu/drm/i915/i915_gem.c | 2 +-
> drivers/gpu/drm/i915/i915_getparam.c | 2 +-
> drivers/gpu/drm/i915/i915_irq.c | 12 +++---
> drivers/gpu/drm/i915/i915_perf.c | 14 +++---
> drivers/gpu/drm/i915/i915_query.c | 12 +++---
> drivers/gpu/drm/i915/i915_sysfs.c | 3 +-
> drivers/gpu/drm/i915/i915_vma.c | 16 ++++---
> drivers/gpu/drm/i915/intel_uncore.c | 21 +++++----
> 19 files changed, 116 insertions(+), 81 deletions(-)
>
> I deliberately did not touch gt/uc so they don't touch a single same
> file!
>
> Unless you are referring to the fact gt logging will need to touch
> some of the same lines my patch touches. Well yes, some, not all, and
> I reckoned it will be much quicker since there aren't really any
> contentious points with DRM_XXX -> drm_xxx. Plus I added struct
> intel_gt or whatever or it wasn't before so it would make your patch
> easier/smaller.
Except that it is quite clearly a change that is 'still under
discussion' and is quite clearly changing exactly the same lines of code
in a number of places. Therefore, by your definition, your patch set
cannot be merged because it is causing 'churn'. Alternatively, your set
can be merged because we don't actually care that a few lines will get
changed one way and then quickly changed another way in a follow up
patch and the overall benefit is worth the 'churn'. In which case
exactly the same rule should apply to this patch set. Merge something
that adds immediate benefit and if necessary update it to a better
version later.
Of course, there is also the view that you have just taken the task over
by posting a more encompassing change that cleans up debug prints in the
whole of i915. Therefore it is now up to you to update all
sub-components to use appropriate prints - GT, KMS, GuC, Gem, etc..
because "frankly it's not a very big ask. It is just how it works".
John.
>
> Regards,
>
> Tvrtko
More information about the Intel-gfx
mailing list