[Intel-xe] [PATCH 03/14] drm/xe: Do not take any action if our device was removed.
Matthew Brost
matthew.brost at intel.com
Tue May 2 23:06:02 UTC 2023
On Tue, May 02, 2023 at 01:21:43PM -0400, Rodrigo Vivi wrote:
> On Tue, May 02, 2023 at 03:40:50PM +0000, Matthew Brost wrote:
> > On Wed, Apr 26, 2023 at 04:57:02PM -0400, Rodrigo Vivi wrote:
> > > Unfortunately devcoredump infrastructure does not provide and
> > > interface for us to force the device removal upon the pci_remove
> > > time of our device.
> > >
> > > The devcoredump is linked at the device level, so when in use
> > > it will prevent the module removal, but it doesn't prevent the
> > > call of the pci_remove callback. This callback cannot fail
> > > anyway and we end up clearing and freeing the entire pci device.
> > >
> > > Hence, after we removed the pci device, we shouldn't allow any
> > > read or free operations to avoid segmentation fault.
> > >
> > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > ---
> > > drivers/gpu/drm/xe/xe_devcoredump.c | 19 ++++++++++++++++---
> > > 1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > index d9531183f03a..a08929c01b75 100644
> > > --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> > > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > @@ -42,6 +42,11 @@
> > > * hang capture.
> > > */
> > >
> > > +static struct xe_device *coredump_to_xe(const struct xe_devcoredump *coredump)
> > > +{
> > > + return container_of(coredump, struct xe_device, devcoredump);
> >
> > Confused how still would ever return NULL, can you explain?
>
> Very good question! I'm honestly still confused myself.
>
> There's something not quite right with the device relationship that
> is getting created with the failling_device and the virtual coredump device.
>
> Once failing_device is removed, the devcoredump should be removed as well,
> or both removals blocked. However this is not what happens.
>
> On rmmod xe, the device removal is called and we free all xe structs.
> The pci device removal is a void function. We cannot fail. The module
> removal ends up blocked because the relationship, but that doesn't
> saves the day since the device itself is already gone, by the pci
> removal function.
>
> But the devcoredump device is there and active. There's no callback on
> devcoredump infra that we could call to force the device removal. Then
> any read function will hit a NULL xe device and BOOM!
>
> This is one of the things I planned to tackle on the devcoredump side
> after we get this basic infra in use in our driver. This patch allows
> us to be protected from this scenario while we don't fix this at the
> devcoredump side.
>
> It's worth saying that the devcoredump virtual device is auto removed
> after a time elapsed... couple minutes? (I can't remember by heart now).
>
After some serious thought, I think I get this. With that:
Reviewed-by: Matthew Brost <matthew.brost at intel.com>
But agree this goofy and try to fix this properly after this is merged.
> >
> > Matt
> >
> > > +}
> > > +
> > > static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> > > size_t count, void *data, size_t datalen)
> > > {
> > > @@ -51,6 +56,10 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> > > struct drm_print_iterator iter;
> > > struct timespec64 ts;
> > >
> > > + /* Our device is gone already... */
> > > + if (!data || !coredump_to_xe(coredump))
> > > + return -ENODEV;
> > > +
> > > iter.data = buffer;
> > > iter.offset = 0;
> > > iter.start = offset;
> > > @@ -80,12 +89,16 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> > > static void xe_devcoredump_free(void *data)
> > > {
> > > struct xe_devcoredump *coredump = data;
> > > - struct xe_device *xe = container_of(coredump, struct xe_device,
> > > - devcoredump);
> > > +
> > > + /* Our device is gone. Nothing to do... */
> > > + if (!data || !coredump_to_xe(coredump))
> > > + return;
> > > +
> > > mutex_lock(&coredump->lock);
> > >
> > > coredump->faulty_engine = NULL;
> > > - drm_info(&xe->drm, "Xe device coredump has been deleted.\n");
> > > + drm_info(&coredump_to_xe(coredump)->drm,
> > > + "Xe device coredump has been deleted.\n");
> > >
> > > mutex_unlock(&coredump->lock);
> > > }
> > > --
> > > 2.39.2
> > >
More information about the Intel-xe
mailing list