[PATCH] drm/xe: skip error capture when exec queue is killed

Upadhyay, Tejas tejas.upadhyay at intel.com
Tue Apr 30 05:19:48 UTC 2024



> -----Original Message-----
> From: Vivi, Rodrigo <rodrigo.vivi at intel.com>
> Sent: Tuesday, April 30, 2024 2:02 AM
> To: Brost, Matthew <matthew.brost at intel.com>; Maarten Lankhorst
> <maarten.lankhorst at linux.intel.com>
> Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> xe at lists.freedesktop.org
> Subject: Re: [PATCH] drm/xe: skip error capture when exec queue is killed
> 
> On Thu, Apr 25, 2024 at 04:23:27PM +0000, Matthew Brost wrote:
> > On Thu, Apr 25, 2024 at 05:59:31PM +0530, Tejas Upadhyay wrote:
> > > When user closes exec queue soon after job submission, we are
> > > generating error coredump. Instead check if exec queue is killed
> > > during job timeout then skip error coredump capture, just free the
> > > job and return proper scheduler state.
> > >
> > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_guc_submit.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > index 93e1ee183e4a..376a2c04e899 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > @@ -971,7 +971,8 @@ guc_exec_queue_timedout_job(struct
> drm_sched_job *drm_job)
> > >  	 * TDR has fired before free job worker. Common if exec queue
> > >  	 * immediately closed after last fence signaled.
> > >  	 */
> > > -	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) {
> > > +	if (exec_queue_killed(q) ||
> >
> > You still need to timeout the job if the DMA_FENCE_FLAG_SIGNALED_BIT
> > is clear otherwise will never signal.
> >
> > So it should be something like this:
> >
> > -       simple_error_capture(q);
> > -       xe_devcoredump(job);
> > +       if (!exec_queue_killed(q)) {
> > +               simple_error_capture(q);
> > +               xe_devcoredump(job);
> > +       }

Ok, makes sense, I will change accordingly considering agreement from other folks. I will send out v2.

Thanks,
Tejas
> >
> > I think I've convinced myself skipping error the capture if correct in
> > this case. e.g. If a user ctrl-c an app, we shouldn't do an job
> > capture on the jobs which the KMD kills.
> >
> > @Rodrigo, @Jose, Thoughts? I know both you when done a bit of work
> here.
> 
> Cc: @Maarten
> 
> yeap, it does make sense to me to skip the error capture on canceled jobs.
> 
> >
> > Matt
> >
> > > +	    test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) {
> > >  		guc_exec_queue_free_job(drm_job);
> > >
> > >  		return DRM_GPU_SCHED_STAT_NOMINAL;
> > > --
> > > 2.25.1
> > >


More information about the Intel-xe mailing list