[Intel-xe] [PATCH 2/2] drm/xe: Add coredump to wa_bb timeouts

Summers, Stuart stuart.summers at intel.com
Thu Oct 5 15:39:27 UTC 2023


On Wed, 2023-10-04 at 09:40 -0400, Rodrigo Vivi wrote:
> On Tue, Sep 26, 2023 at 09:20:56PM +0000, Stuart Summers wrote:
> > We're seeing some hangs during driver load on some platforms
> > in CI which are hard to catch manually. As such, add the dump
> > at the time of the hang.
> > 
> > Signed-off-by: Stuart Summers <stuart.summers at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c
> > b/drivers/gpu/drm/xe/xe_gt.c
> > index 1aa44d4f9ac1..80ea076197e5 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -46,6 +46,7 @@
> >  #include "xe_vm.h"
> >  #include "xe_wa.h"
> >  #include "xe_wopcm.h"
> > +#include "xe_devcoredump.h"
> >  
> >  struct xe_gt *xe_gt_alloc(struct xe_tile *tile)
> >  {
> > @@ -187,8 +188,10 @@ static int emit_wa_job(struct xe_gt *gt,
> > struct xe_exec_queue *q)
> 
> please notice that xe_devcoredump doesn't have any kind of locking
> mechanism, because it relies on the serialization of the gt_reset.
> 
> Once you start calling from other places, then we should probably
> add some data protection there.
> 
> But also, maybe we should define and print some kind of 'type' var
> that is and argument to xe_devcoredump() and that gets printed on
> top to ensure that we have a clear indication from when they are
> coming from a gt_reset and from other timeouts?

Yeah definitely not a bad idea, almost like a trace point. I'll think
it over and repost.

Also thanks for the comments on the data protection. I'll review a
little closer on the next rev.

Thanks,
Stuart

> 
> Cc: Maarten
> 
> >         xe_bb_free(bb, NULL);
> >         if (timeout < 0)
> >                 return timeout;
> > -       else if (!timeout)
> > +       else if (!timeout) {
> > +               xe_devcoredump(q);
> >                 return -ETIME;
> > +       }
> >  
> >         return 0;
> >  }
> > -- 
> > 2.34.1
> > 



More information about the Intel-xe mailing list