[Intel-xe] [PATCH 03/14] drm/xe: Do not take any action if our device was removed.

Tue May 2 23:06:02 UTC 2023

On Tue, May 02, 2023 at 01:21:43PM -0400, Rodrigo Vivi wrote:
> On Tue, May 02, 2023 at 03:40:50PM +0000, Matthew Brost wrote:
> > On Wed, Apr 26, 2023 at 04:57:02PM -0400, Rodrigo Vivi wrote:
> > > Unfortunately devcoredump infrastructure does not provide and
> > > interface for us to force the device removal upon the pci_remove
> > > time of our device.
> > > 
> > > The devcoredump is linked at the device level, so when in use
> > > it will prevent the module removal, but it doesn't prevent the
> > > call of the pci_remove callback. This callback cannot fail
> > > anyway and we end up clearing and freeing the entire pci device.
> > > 
> > > Hence, after we removed the pci device, we shouldn't allow any
> > > read or free operations to avoid segmentation fault.
> > > 
> > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_devcoredump.c | 19 ++++++++++++++++---
> > >  1 file changed, 16 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > index d9531183f03a..a08929c01b75 100644
> > > --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> > > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > @@ -42,6 +42,11 @@
> > >   * hang capture.
> > >   */
> > >  
> > > +static struct xe_device *coredump_to_xe(const struct xe_devcoredump *coredump)
> > > +{
> > > +	return container_of(coredump, struct xe_device, devcoredump);
> > 
> > Confused how still would ever return NULL, can you explain?
> 
> Very good question! I'm honestly still confused myself.
> 
> There's something not quite right with the device relationship that
> is getting created with the failling_device and the virtual coredump device.
> 
> Once failing_device is removed, the devcoredump should be removed as well,
> or both removals blocked. However this is not what happens.
> 
> On rmmod xe, the device removal is called and we free all xe structs.
> The pci device removal is a void function. We cannot fail. The module
> removal ends up blocked because the relationship, but that doesn't
> saves the day since the device itself is already gone, by the pci
> removal function.
> 
> But the devcoredump device is there and active. There's no callback on
> devcoredump infra that we could call to force the device removal. Then
> any read function will hit a NULL xe device and BOOM!
> 
> This is one of the things I planned to tackle on the devcoredump side
> after we get this basic infra in use in our driver. This patch allows
> us to be protected from this scenario while we don't fix this at the
> devcoredump side.
> 
> It's worth saying that the devcoredump virtual device is auto removed
> after a time elapsed... couple minutes? (I can't remember by heart now).
> 

After some serious thought, I think I get this. With that:

Reviewed-by: Matthew Brost <matthew.brost at intel.com>

But agree this goofy and try to fix this properly after this is merged.

> > 
> > Matt
> > 
> > > +}
> > > +
> > >  static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> > >  				   size_t count, void *data, size_t datalen)
> > >  {
> > > @@ -51,6 +56,10 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> > >  	struct drm_print_iterator iter;
> > >  	struct timespec64 ts;
> > >  
> > > +	/* Our device is gone already... */
> > > +	if (!data || !coredump_to_xe(coredump))
> > > +		return -ENODEV;
> > > +
> > >  	iter.data = buffer;
> > >  	iter.offset = 0;
> > >  	iter.start = offset;
> > > @@ -80,12 +89,16 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> > >  static void xe_devcoredump_free(void *data)
> > >  {
> > >  	struct xe_devcoredump *coredump = data;
> > > -	struct xe_device *xe = container_of(coredump, struct xe_device,
> > > -					    devcoredump);
> > > +
> > > +	/* Our device is gone. Nothing to do... */
> > > +	if (!data || !coredump_to_xe(coredump))
> > > +		return;
> > > +
> > >  	mutex_lock(&coredump->lock);
> > >  
> > >  	coredump->faulty_engine = NULL;
> > > -	drm_info(&xe->drm, "Xe device coredump has been deleted.\n");
> > > +	drm_info(&coredump_to_xe(coredump)->drm,
> > > +		 "Xe device coredump has been deleted.\n");
> > >  
> > >  	mutex_unlock(&coredump->lock);
> > >  }
> > > -- 
> > > 2.39.2
> > >