[PATCH v2] drm/xe/devcoredump: Defer devcoredump initialization during probe

Mon Jul 28 19:01:22 UTC 2025

On Mon, Jul 28, 2025 at 01:56:07PM -0400, Summers, Stuart wrote:
> On Mon, 2025-07-28 at 14:17 +0530, Balasubramani Vivekanandan wrote:
> > Doing devcoredump initializing before GT though look harmless, it
> > leads
> > to problem during driver unbind. Because of this order, GT/Engine
> > release functions will be called before xe devcoredump release
> > function
> > (xe_driver_devcoredump_fini) leading to the following kernel crash[1]
> > because the devcoredump functions might still use GT/Engine
> > datastructures after those are freed.
> > 
> > The following crash is observed while running the IGT
> > xe_wedged at wedged-at-any-timeout. The test forces a wedged state by
> > submitting a worload which hangs. Then does a unbind/rebind of the
> > driver to recover from the wedged state.
> > The hanged worload leads to a devcoredump. The following crash is
> > noticed when the devcoredump capture races with the driver unbind.
> > During driver unbind, the release function hw_engine_fini() will be
> > called which assigns NULL to hwe->gt. But the same data structure is
> > accessed during the coredump capture in the function
> > xe_engine_snapshot_print by reading snapshot->hwe->gt.
> > 
> > With this patch, we make sure the devcoredump is stopped before
> > deinitializing the core driver functions.
> > 
> > [1]:
> > BUG: kernel NULL pointer dereference, address: 0000000000000000
> > Workqueue: events_unbound xe_devcoredump_deferred_snap_work [xe]
> > RIP: 0010:xe_engine_snapshot_print+0x47/0x420 [xe]
> > Call Trace:
> >  <TASK>
> >  ? drm_printf+0x64/0x90
> >  __xe_devcoredump_read+0x23f/0x2d0 [xe]
> >  ? __pfx___drm_printfn_coredump+0x10/0x10
> >  ? __pfx___drm_puts_coredump+0x10/0x10
> >  xe_devcoredump_deferred_snap_work+0x17a/0x190 [xe]
> >  process_one_work+0x22e/0x6f0
> >  worker_thread+0x1e8/0x3d0
> >  ? __pfx_worker_thread+0x10/0x10
> >  kthread+0x11f/0x250
> >  ? __pfx_kthread+0x10/0x10
> >  ret_from_fork+0x47/0x70
> >  ? __pfx_kthread+0x10/0x10
> >  ret_from_fork_asm+0x1a/0x30
> > 
> > v2: Detailed commit description (Rodrigo)

Thanks for that, now I could see the path, but now I agree with
Stuart below...

> > 
> > Fixes: 4209d635a823 ("drm/xe: Remove devcoredump during driver
> > release")
> > Signed-off-by: Balasubramani Vivekanandan
> > <balasubramani.vivekanandan at intel.com>
> 
> So I can see how this fixes the problem from your description and
> looking over the code. I thought generally though we were trying to
> decouple the devcoredump from the underlying structures.
> xe_engine_snapshot_print() is grabbing a lot of information from the GT
> at the time of the print rather than purely as a snapshot which doesn't
> seem right to me - we should be taking the snapshot at the time of the
> error and the print should just be relaying that info.
> 
> So not that your change is bad, but I think it masks a problem we have
> in the implementation of that engine print. If we call
> xe_guc_capture_get_reg_desc_list() at the time of failure rather than
> from the print itself, do we still see the same problem?

Indeed the real fix is to entirely decouple the capture from the read.
capture should be done at the snapshot time.
Read should not depend on the gt. Although this might not be the only
case and we probably need some quick fix for now.

Perhaps we go with this patch, but mark as a FIXME comment and ensure
we have a gitlab/issue + VLK opened for this work...

> 
> Thanks,
> Stuart
> 
> > ---
> >  drivers/gpu/drm/xe/xe_device.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > b/drivers/gpu/drm/xe/xe_device.c
> > index d04a0ae018e6..ae48cd3c7bf0 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -821,10 +821,6 @@ int xe_device_probe(struct xe_device *xe)
> >                         return err;
> >         }
> >  
> > -       err = xe_devcoredump_init(xe);
> > -       if (err)
> > -               return err;
> > -
> >         /*
> >          * From here on, if a step fails, make sure a Driver-FLR is
> > triggereed
> >          */
> > @@ -889,6 +885,10 @@ int xe_device_probe(struct xe_device *xe)
> >             XE_WA(xe->tiles->media_gt, 15015404425_disable))
> >                 XE_DEVICE_WA_DISABLE(xe, 15015404425);
> >  
> > +       err = xe_devcoredump_init(xe);
> > +       if (err)
> > +               return err;
> > +
> >         xe_nvm_init(xe);
> >  
> >         err = xe_heci_gsc_init(xe);
>