[PATCH] drm/v3d: Add job to pending list if the reset was skipped

Tue Apr 29 11:00:36 UTC 2025

Never mind, somehow I missed you're already doing that in this patch.

Reviewed-by: Iago Toral Quiroga <itoral at igalia.com>

El lun, 28-04-2025 a las 07:45 +0200, Iago Toral escribió:
> Hi Maira,
> 
> Looks good to me, but don't we need to do the same in
> v3d_csd_job_timedout?
> 
> Iago
> 
> El dom, 27-04-2025 a las 17:28 -0300, Maíra Canal escribió:
> > When a CL/CSD job times out, we check if the GPU has made any
> > progress
> > since the last timeout. If so, instead of resetting the hardware,
> > we
> > skip
> > the reset and let the timer get rearmed. This gives long-running
> > jobs
> > a
> > chance to complete.
> > 
> > However, when `timedout_job()` is called, the job in question is
> > removed
> > from the pending list, which means it won't be automatically freed
> > through
> > `free_job()`. Consequently, when we skip the reset and keep the job
> > running, the job won't be freed when it finally completes.
> > 
> > This situation leads to a memory leak, as exposed in [1].
> > 
> > Similarly to commit 704d3d60fec4 ("drm/etnaviv: don't block
> > scheduler
> > when
> > GPU is still active"), this patch ensures the job is put back on
> > the
> > pending list when extending the timeout.
> > 
> > Cc: stable at vger.kernel.org # 6.0
> > Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227 [1]
> > Reported-by: Daivik Bhatia <dtgs1208 at gmail.com>
> > Signed-off-by: Maíra Canal <mcanal at igalia.com>
> > ---
> >  drivers/gpu/drm/v3d/v3d_sched.c | 18 +++++++++++-------
> >  1 file changed, 11 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
> > b/drivers/gpu/drm/v3d/v3d_sched.c
> > index b3be08b0ca91..a98dcf9d75b1 100644
> > --- a/drivers/gpu/drm/v3d/v3d_sched.c
> > +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> > @@ -734,11 +734,6 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d,
> > struct drm_sched_job *sched_job)
> >  	return DRM_GPU_SCHED_STAT_NOMINAL;
> >  }
> >  
> > -/* If the current address or return address have changed, then the
> > GPU
> > - * has probably made progress and we should delay the reset.  This
> > - * could fail if the GPU got in an infinite loop in the CL, but
> > that
> > - * is pretty unlikely outside of an i-g-t testcase.
> > - */
> >  static enum drm_gpu_sched_stat
> >  v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum
> > v3d_queue
> > q,
> >  		    u32 *timedout_ctca, u32 *timedout_ctra)
> > @@ -748,9 +743,16 @@ v3d_cl_job_timedout(struct drm_sched_job
> > *sched_job, enum v3d_queue q,
> >  	u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(q));
> >  	u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(q));
> >  
> > +	/* If the current address or return address have changed,
> > then the GPU
> > +	 * has probably made progress and we should delay the
> > reset.
> > This
> > +	 * could fail if the GPU got in an infinite loop in the
> > CL,
> > but that
> > +	 * is pretty unlikely outside of an i-g-t testcase.
> > +	 */
> >  	if (*timedout_ctca != ctca || *timedout_ctra != ctra) {
> >  		*timedout_ctca = ctca;
> >  		*timedout_ctra = ctra;
> > +
> > +		list_add(&sched_job->list, &sched_job->sched-
> > > pending_list);
> >  		return DRM_GPU_SCHED_STAT_NOMINAL;
> >  	}
> >  
> > @@ -790,11 +792,13 @@ v3d_csd_job_timedout(struct drm_sched_job
> > *sched_job)
> >  	struct v3d_dev *v3d = job->base.v3d;
> >  	u32 batches = V3D_CORE_READ(0, V3D_CSD_CURRENT_CFG4(v3d-
> > > ver));
> >  
> > -	/* If we've made progress, skip reset and let the timer
> > get
> > -	 * rearmed.
> > +	/* If we've made progress, skip reset, add the job to the
> > pending
> > +	 * list, and let the timer get rearmed.
> >  	 */
> >  	if (job->timedout_batches != batches) {
> >  		job->timedout_batches = batches;
> > +
> > +		list_add(&sched_job->list, &sched_job->sched-
> > > pending_list);
> >  		return DRM_GPU_SCHED_STAT_NOMINAL;
> >  	}
> >  
>