[PATCH v9 03/11] drm/xe/devcoredump: Improve section headings and add tile info

Souza, Jose jose.souza at intel.com
Thu Dec 12 20:30:59 UTC 2024


On Thu, 2024-12-12 at 12:06 -0800, John Harrison wrote:
> On 12/12/2024 11:31, Souza, Jose wrote:
> > On Thu, 2024-12-12 at 10:59 -0800, John Harrison wrote:
> > > On 12/12/2024 10:17, Souza, Jose wrote:
> > > > On Wed, 2024-10-02 at 17:46 -0700, John.C.Harrison at Intel.com wrote:
> > > > > From: John Harrison <John.C.Harrison at Intel.com>
> > > > > 
> > > > > The xe_guc_exec_queue_snapshot is not really a GuC internal thing and
> > > > > is definitely not a GuC CT thing. So give it its own section heading.
> > > > > The snapshot itself is really a capture of the submission backend's
> > > > > internal state. Although all it currently prints out is the submission
> > > > > contexts. So label it as 'Contexts'. If more general state is added
> > > > > later then it could be change to 'Submission backend' or some such.
> > > > > 
> > > > > Further, everything from the GuC CT section onwards is GT specific but
> > > > > there was no indication of which GT it was related to (and that is
> > > > > impossible to work out from the other fields that are given). So add a
> > > > > GT section heading. Also include the tile id of the GT, because again
> > > > > significant information.
> > > > > 
> > > > > Lastly, drop a couple of unnecessary line feeds within sections.
> > > > > 
> > > > > v2: Add GT section heading, add tile id to device section.
> > > > > 
> > > > > Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> > > > > Reviewed-by: Julia Filipchuk <julia.filipchuk at intel.com>
> > > > > ---
> > > > >    drivers/gpu/drm/xe/xe_devcoredump.c       | 5 +++++
> > > > >    drivers/gpu/drm/xe/xe_devcoredump_types.h | 3 ++-
> > > > >    drivers/gpu/drm/xe/xe_device.c            | 1 +
> > > > >    drivers/gpu/drm/xe/xe_guc_submit.c        | 2 +-
> > > > >    drivers/gpu/drm/xe/xe_hw_engine.c         | 1 -
> > > > >    5 files changed, 9 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > > > index d23719d5c2a3..2690f1d1cde4 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > > > @@ -96,8 +96,13 @@ static ssize_t __xe_devcoredump_read(char *buffer, size_t count,
> > > > >    	drm_printf(&p, "Process: %s\n", ss->process_name);
> > > > >    	xe_device_snapshot_print(xe, &p);
> > > > >    
> > > > > +	drm_printf(&p, "\n**** GT #%d ****\n", ss->gt->info.id);
> > > > > +	drm_printf(&p, "\tTile: %d\n", ss->gt->tile->id);
> > > > > +
> > > > >    	drm_puts(&p, "\n**** GuC CT ****\n");
> > > > >    	xe_guc_ct_snapshot_print(ss->ct, &p);
> > > > > +
> > > > > +	drm_puts(&p, "\n**** Contexts ****\n");
> > > > >    	xe_guc_exec_queue_snapshot_print(ss->ge, &p);
> > > > This broke Mesa parser!
> > > > It can't now parse the exec_queue context because it was expected to be on the '**** GuC CT ****' section.
> > > Then the mesa parse needs to be updated. That was clearly a bug - exec
> > > queue contexts are absolutely not GuC CT data and should not be in the
> > > GuC CT section.
> > Don't matter if it is a bug or not, it broke the parser.
> > If this is not reverted we will have older Kernel versions that don't work with newer Mesa and newer Kernel versions that don't with old Mesa.
> Debug tools cannot count as UAPI that must never change.

That is not my understating from previous threads.

Imagine that a big costumer file a bug to us and attach the devcoredump of a older kernel version.
devcoredump parser will not work. If the developer is aware of this "contract" break he can checkout to a older UMD version, build it and then parse
the devcoredump. Then checkout again to main/master branch and work on the fix... Not viable at all.

At least UMD teams should be notified. At the moment Mesa debugging is blocked because of this patches.

> 
> The devcoredump contains much information that is essentially the 
> internals of the kernel. It is going to change. That is about the only 
> guarantee that we can make about it. And saying that we must 
> intentionally break the output of a developer only debug feature in 
> order to support older mesa is plain wrong. End users do not care about 
> debug tools. All user applications will still work just perfectly.
> 
> We can start adding version numbers to the devcoredump format if we 
> really need to. But that was already shot down as a bad idea. It is 
> debug information and not UAPI. So version incompatibilities are 
> expected from time to time.
> 
> John.
> 
> 
> > 
> > > John.
> > > 
> > > > >    
> > > > >    	drm_puts(&p, "\n**** Job ****\n");
> > > > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump_types.h b/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > > > > index 440d05d77a5a..3cc2f095fdfb 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > > > > @@ -37,7 +37,8 @@ struct xe_devcoredump_snapshot {
> > > > >    	/* GuC snapshots */
> > > > >    	/** @ct: GuC CT snapshot */
> > > > >    	struct xe_guc_ct_snapshot *ct;
> > > > > -	/** @ge: Guc Engine snapshot */
> > > > > +
> > > > > +	/** @ge: GuC Submission Engine snapshot */
> > > > >    	struct xe_guc_submit_exec_queue_snapshot *ge;
> > > > >    
> > > > >    	/** @hwe: HW Engine snapshot array */
> > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > > > > index 09a7ad830e69..030cf703e970 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > @@ -961,6 +961,7 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p)
> > > > >    
> > > > >    	for_each_gt(gt, xe, id) {
> > > > >    		drm_printf(p, "GT id: %u\n", id);
> > > > > +		drm_printf(p, "\tTile: %u\n", gt->tile->id);
> > > > >    		drm_printf(p, "\tType: %s\n",
> > > > >    			   gt->info.type == XE_GT_TYPE_MAIN ? "main" : "media");
> > > > >    		drm_printf(p, "\tIP ver: %u.%u.%u\n",
> > > > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > index 0ac4a19ec9cc..8690df699170 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > @@ -2240,7 +2240,7 @@ xe_guc_exec_queue_snapshot_print(struct xe_guc_submit_exec_queue_snapshot *snaps
> > > > >    	if (!snapshot)
> > > > >    		return;
> > > > >    
> > > > > -	drm_printf(p, "\nGuC ID: %d\n", snapshot->guc.id);
> > > > > +	drm_printf(p, "GuC ID: %d\n", snapshot->guc.id);
> > > > >    	drm_printf(p, "\tName: %s\n", snapshot->name);
> > > > >    	drm_printf(p, "\tClass: %d\n", snapshot->class);
> > > > >    	drm_printf(p, "\tLogical mask: 0x%x\n", snapshot->logical_mask);
> > > > > diff --git a/drivers/gpu/drm/xe/xe_hw_engine.c b/drivers/gpu/drm/xe/xe_hw_engine.c
> > > > > index ea6d9ef7fab6..6c9c27304cdc 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_hw_engine.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_hw_engine.c
> > > > > @@ -1084,7 +1084,6 @@ void xe_hw_engine_snapshot_print(struct xe_hw_engine_snapshot *snapshot,
> > > > >    	if (snapshot->hwe->class == XE_ENGINE_CLASS_COMPUTE)
> > > > >    		drm_printf(p, "\tRCU_MODE: 0x%08x\n",
> > > > >    			   snapshot->reg.rcu_mode);
> > > > > -	drm_puts(p, "\n");
> > > > >    }
> > > > >    
> > > > >    /**
> 



More information about the Intel-xe mailing list