[PATCH v9 03/11] drm/xe/devcoredump: Improve section headings and add tile info
John Harrison
john.c.harrison at intel.com
Thu Dec 12 20:38:06 UTC 2024
+Matthew B
On 12/12/2024 12:30, Souza, Jose wrote:
> On Thu, 2024-12-12 at 12:06 -0800, John Harrison wrote:
>> On 12/12/2024 11:31, Souza, Jose wrote:
>>> On Thu, 2024-12-12 at 10:59 -0800, John Harrison wrote:
>>>> On 12/12/2024 10:17, Souza, Jose wrote:
>>>>> On Wed, 2024-10-02 at 17:46 -0700, John.C.Harrison at Intel.com wrote:
>>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>>
>>>>>> The xe_guc_exec_queue_snapshot is not really a GuC internal thing and
>>>>>> is definitely not a GuC CT thing. So give it its own section heading.
>>>>>> The snapshot itself is really a capture of the submission backend's
>>>>>> internal state. Although all it currently prints out is the submission
>>>>>> contexts. So label it as 'Contexts'. If more general state is added
>>>>>> later then it could be change to 'Submission backend' or some such.
>>>>>>
>>>>>> Further, everything from the GuC CT section onwards is GT specific but
>>>>>> there was no indication of which GT it was related to (and that is
>>>>>> impossible to work out from the other fields that are given). So add a
>>>>>> GT section heading. Also include the tile id of the GT, because again
>>>>>> significant information.
>>>>>>
>>>>>> Lastly, drop a couple of unnecessary line feeds within sections.
>>>>>>
>>>>>> v2: Add GT section heading, add tile id to device section.
>>>>>>
>>>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>>>> Reviewed-by: Julia Filipchuk <julia.filipchuk at intel.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/xe/xe_devcoredump.c | 5 +++++
>>>>>> drivers/gpu/drm/xe/xe_devcoredump_types.h | 3 ++-
>>>>>> drivers/gpu/drm/xe/xe_device.c | 1 +
>>>>>> drivers/gpu/drm/xe/xe_guc_submit.c | 2 +-
>>>>>> drivers/gpu/drm/xe/xe_hw_engine.c | 1 -
>>>>>> 5 files changed, 9 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>> index d23719d5c2a3..2690f1d1cde4 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>>> @@ -96,8 +96,13 @@ static ssize_t __xe_devcoredump_read(char *buffer, size_t count,
>>>>>> drm_printf(&p, "Process: %s\n", ss->process_name);
>>>>>> xe_device_snapshot_print(xe, &p);
>>>>>>
>>>>>> + drm_printf(&p, "\n**** GT #%d ****\n", ss->gt->info.id);
>>>>>> + drm_printf(&p, "\tTile: %d\n", ss->gt->tile->id);
>>>>>> +
>>>>>> drm_puts(&p, "\n**** GuC CT ****\n");
>>>>>> xe_guc_ct_snapshot_print(ss->ct, &p);
>>>>>> +
>>>>>> + drm_puts(&p, "\n**** Contexts ****\n");
>>>>>> xe_guc_exec_queue_snapshot_print(ss->ge, &p);
>>>>> This broke Mesa parser!
>>>>> It can't now parse the exec_queue context because it was expected to be on the '**** GuC CT ****' section.
>>>> Then the mesa parse needs to be updated. That was clearly a bug - exec
>>>> queue contexts are absolutely not GuC CT data and should not be in the
>>>> GuC CT section.
>>> Don't matter if it is a bug or not, it broke the parser.
>>> If this is not reverted we will have older Kernel versions that don't work with newer Mesa and newer Kernel versions that don't with old Mesa.
>> Debug tools cannot count as UAPI that must never change.
> That is not my understating from previous threads.
>
> Imagine that a big costumer file a bug to us and attach the devcoredump of a older kernel version.
> devcoredump parser will not work. If the developer is aware of this "contract" break he can checkout to a older UMD version, build it and then parse
> the devcoredump. Then checkout again to main/master branch and work on the fix... Not viable at all.
>
> At least UMD teams should be notified. At the moment Mesa debugging is blocked because of this patches.
The alternative is we can never update the devcoredump output to add new
information, remove old entries that no longer make sense due to driver
re-work, etc.? That is even less viable.
For this particular issue, the fix is presumably trivial. The mesa tool
can be updated to look in either the old incorrect section header (I'm
assuming the exec queue info was actually just not in any section at all
previously, because it was really not ever part of the GuC CT info) or
the new correct one. Then the new build will work on both the current
kernel or the old one.
Going forwards, as I said, we can start adding format version numbers
but my memory is that was argued against.
John.
>
>> The devcoredump contains much information that is essentially the
>> internals of the kernel. It is going to change. That is about the only
>> guarantee that we can make about it. And saying that we must
>> intentionally break the output of a developer only debug feature in
>> order to support older mesa is plain wrong. End users do not care about
>> debug tools. All user applications will still work just perfectly.
>>
>> We can start adding version numbers to the devcoredump format if we
>> really need to. But that was already shot down as a bad idea. It is
>> debug information and not UAPI. So version incompatibilities are
>> expected from time to time.
>>
>> John.
>>
>>
>>>> John.
>>>>
>>>>>>
>>>>>> drm_puts(&p, "\n**** Job ****\n");
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump_types.h b/drivers/gpu/drm/xe/xe_devcoredump_types.h
>>>>>> index 440d05d77a5a..3cc2f095fdfb 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_devcoredump_types.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_devcoredump_types.h
>>>>>> @@ -37,7 +37,8 @@ struct xe_devcoredump_snapshot {
>>>>>> /* GuC snapshots */
>>>>>> /** @ct: GuC CT snapshot */
>>>>>> struct xe_guc_ct_snapshot *ct;
>>>>>> - /** @ge: Guc Engine snapshot */
>>>>>> +
>>>>>> + /** @ge: GuC Submission Engine snapshot */
>>>>>> struct xe_guc_submit_exec_queue_snapshot *ge;
>>>>>>
>>>>>> /** @hwe: HW Engine snapshot array */
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>>>>>> index 09a7ad830e69..030cf703e970 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>> @@ -961,6 +961,7 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p)
>>>>>>
>>>>>> for_each_gt(gt, xe, id) {
>>>>>> drm_printf(p, "GT id: %u\n", id);
>>>>>> + drm_printf(p, "\tTile: %u\n", gt->tile->id);
>>>>>> drm_printf(p, "\tType: %s\n",
>>>>>> gt->info.type == XE_GT_TYPE_MAIN ? "main" : "media");
>>>>>> drm_printf(p, "\tIP ver: %u.%u.%u\n",
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>>>> index 0ac4a19ec9cc..8690df699170 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>>>> @@ -2240,7 +2240,7 @@ xe_guc_exec_queue_snapshot_print(struct xe_guc_submit_exec_queue_snapshot *snaps
>>>>>> if (!snapshot)
>>>>>> return;
>>>>>>
>>>>>> - drm_printf(p, "\nGuC ID: %d\n", snapshot->guc.id);
>>>>>> + drm_printf(p, "GuC ID: %d\n", snapshot->guc.id);
>>>>>> drm_printf(p, "\tName: %s\n", snapshot->name);
>>>>>> drm_printf(p, "\tClass: %d\n", snapshot->class);
>>>>>> drm_printf(p, "\tLogical mask: 0x%x\n", snapshot->logical_mask);
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_hw_engine.c b/drivers/gpu/drm/xe/xe_hw_engine.c
>>>>>> index ea6d9ef7fab6..6c9c27304cdc 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_hw_engine.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_hw_engine.c
>>>>>> @@ -1084,7 +1084,6 @@ void xe_hw_engine_snapshot_print(struct xe_hw_engine_snapshot *snapshot,
>>>>>> if (snapshot->hwe->class == XE_ENGINE_CLASS_COMPUTE)
>>>>>> drm_printf(p, "\tRCU_MODE: 0x%08x\n",
>>>>>> snapshot->reg.rcu_mode);
>>>>>> - drm_puts(p, "\n");
>>>>>> }
>>>>>>
>>>>>> /**
More information about the Intel-xe
mailing list