[Intel-xe] [PATCH v3 1/3] drm/xe: Store xe_he_engine in xe_hw_engine_snapshot

Fri Oct 13 20:40:49 UTC 2023

On Thu, Oct 12, 2023 at 10:30:36AM -0700, Matt Roper wrote:
> On Thu, Oct 12, 2023 at 10:20:33AM -0700, Souza, Jose wrote:
> > On Thu, 2023-10-12 at 10:16 -0700, Matt Roper wrote:
> > > On Tue, Oct 10, 2023 at 11:41:49AM -0700, José Roberto de Souza wrote:
> > > > A future patch will require gt and xe device structs, so here
> > > > replacing class by hwe.
> > > > 
> > > > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > Cc: Matt Roper <matthew.d.roper at intel.com>
> > > > Signed-off-by: José Roberto de Souza <jose.souza at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_hw_engine.c       | 6 +++---
> > > >  drivers/gpu/drm/xe/xe_hw_engine_types.h | 4 ++--
> > > >  2 files changed, 5 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_hw_engine.c b/drivers/gpu/drm/xe/xe_hw_engine.c
> > > > index b5b0845908887..37b3ee1268090 100644
> > > > --- a/drivers/gpu/drm/xe/xe_hw_engine.c
> > > > +++ b/drivers/gpu/drm/xe/xe_hw_engine.c
> > > > @@ -684,7 +684,7 @@ xe_hw_engine_snapshot_capture(struct xe_hw_engine *hwe)
> > > >  	if (snapshot->name)
> > > >  		strscpy(snapshot->name, hwe->name, len);
> > > >  
> > > > -	snapshot->class = hwe->class;
> > > > +	snapshot->hwe = hwe;
> > > 
> > > I haven't really looked into the devcoredump stuff much, but it's not
> > > clear to me from a really quick skim whether it's safe to save away the
> > > hwe like this.  Is there ever a case where the snapshot can persist
> > > after the engine itself has been destroyed (e.g., if we capture an error
> > > encountered during a device probe that results in the device being torn
> > > back down)?  If so, then you'd probably need to make sure that the hwe
> > > field here only gets used during the capture, and the 'print' routine
> > > would still need a cached snapshot->class to avoid use-after-free
> > > errors.

This is very real! We need to cache everything we can and consider the
device itself is dead at print. But here we are at the capture no?!

> > 
> > Xe devcoredump dump is also removed with Xe module, so no problems with this.
> 
> With the module?  Or the device?  Even if probe fails on a single
> device, the module may remain loaded (and might even still be driving
> other GPUs).  If the dumps for failed device(s) are still accessible,
> there might be a problem?

Well, this is a bad devcoredump limitation at this moment. It doesn't go
away with the device or with the module. It blocks for 5 seconds before
it is gone. I have couple patches [1] to suggest to devcoredump itself so the
driver can also decide when the dump device is deleted. But I'm trying to
get us first in upstream for a proper handling of that.

[1] https://gitlab.freedesktop.org/rodrigovivi/drm-xe/-/commits/devcoredump-removal/

> 
> 
> Matt
> 
> > 
> > > 
> > > Rodrigo probably has better insight as to whether that's something that
> > > we need to worry about.
> > > 
> > > 
> > > Matt
> > > 
> > > >  	snapshot->logical_instance = hwe->logical_instance;
> > > >  	snapshot->forcewake.domain = hwe->domain;
> > > >  	snapshot->forcewake.ref = xe_force_wake_ref(gt_to_fw(hwe->gt),
> > > > @@ -731,7 +731,7 @@ xe_hw_engine_snapshot_capture(struct xe_hw_engine *hwe)
> > > >  		hw_engine_mmio_read32(hwe, RING_DMA_FADD(0));
> > > >  	snapshot->reg.ipehr = hw_engine_mmio_read32(hwe, RING_IPEHR(0));
> > > >  
> > > > -	if (snapshot->class == XE_ENGINE_CLASS_COMPUTE)
> > > > +	if (snapshot->hwe->class == XE_ENGINE_CLASS_COMPUTE)
> > > >  		snapshot->reg.rcu_mode = xe_mmio_read32(hwe->gt, RCU_MODE);
> > > >  
> > > >  	return snapshot;
> > > > @@ -786,7 +786,7 @@ void xe_hw_engine_snapshot_print(struct xe_hw_engine_snapshot *snapshot,
> > > >  		   snapshot->reg.ring_dma_fadd_udw,
> > > >  		   snapshot->reg.ring_dma_fadd);
> > > >  	drm_printf(p, "\tIPEHR: 0x%08x\n\n", snapshot->reg.ipehr);
> > > > -	if (snapshot->class == XE_ENGINE_CLASS_COMPUTE)
> > > > +	if (snapshot->hwe->class == XE_ENGINE_CLASS_COMPUTE)
> > > >  		drm_printf(p, "\tRCU_MODE: 0x%08x\n",
> > > >  			   snapshot->reg.rcu_mode);
> > > >  }
> > > > diff --git a/drivers/gpu/drm/xe/xe_hw_engine_types.h b/drivers/gpu/drm/xe/xe_hw_engine_types.h
> > > > index 5d4ee29042407..71c0a0169e62a 100644
> > > > --- a/drivers/gpu/drm/xe/xe_hw_engine_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_hw_engine_types.h
> > > > @@ -156,8 +156,8 @@ struct xe_hw_engine {
> > > >  struct xe_hw_engine_snapshot {
> > > >  	/** @name: name of the hw engine */
> > > >  	char *name;
> > > > -	/** @class: class of this hw engine */
> > > > -	enum xe_engine_class class;
> > > > +	/** @hwe: hw engine */
> > > > +	struct xe_hw_engine *hwe;
> > > >  	/** @logical_instance: logical instance of this hw engine */
> > > >  	u16 logical_instance;
> > > >  	/** @forcewake: Force Wake information snapshot */
> > > > -- 
> > > > 2.42.0
> > > > 
> > > 
> > 
> 
> -- 
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation