[PATCH v2] drm/v3d: Add job to pending list if the reset was skipped

Tvrtko Ursulin tvrtko.ursulin at igalia.com
Thu May 1 08:48:10 UTC 2025


On 30/04/2025 21:51, Maíra Canal wrote:
> When a CL/CSD job times out, we check if the GPU has made any progress
> since the last timeout. If so, instead of resetting the hardware, we skip
> the reset and let the timer get rearmed. This gives long-running jobs a
> chance to complete.
> 
> However, when `timedout_job()` is called, the job in question is removed
> from the pending list, which means it won't be automatically freed through
> `free_job()`. Consequently, when we skip the reset and keep the job
> running, the job won't be freed when it finally completes.
> 
> This situation leads to a memory leak, as exposed in [1] and [2].
> 
> Similarly to commit 704d3d60fec4 ("drm/etnaviv: don't block scheduler when
> GPU is still active"), this patch ensures the job is put back on the
> pending list when extending the timeout.
> 
> Cc: stable at vger.kernel.org # 6.0
> Reported-by: Daivik Bhatia <dtgs1208 at gmail.com>
> Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227 [1]
> Closes: https://github.com/raspberrypi/linux/issues/6817 [2]
> Signed-off-by: Maíra Canal <mcanal at igalia.com>
> Reviewed-by: Iago Toral Quiroga <itoral at igalia.com>
> ---
> 
> Hi,
> 
> While we typically strive to avoid exposing the scheduler's internals
> within the drivers, I'm proposing this fix as an interim solution. I'm aware
> that a comprehensive fix will need some adjustments on the DRM sched side,
> and I plan to address that soon.
> 
> However, it would be hard to justify the backport of such patches to the
> stable branches and this issue is affecting users in the moment.
> Therefore, I'd like to push this patch to drm-misc-fixes in order to
> address this leak as soon as possible, while working in a more generic
> solution.

Acked-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>

Regards,

Tvrtko

> 
> Best Regards,
> - Maíra
> 
> v1 -> v2: https://lore.kernel.org/dri-devel/20250427202907.94415-2-mcanal@igalia.com/
> 
> - Protect the pending list by using its spinlock.
> - Add the URL of another downstream issue related to this patch.
> 
>   drivers/gpu/drm/v3d/v3d_sched.c | 28 +++++++++++++++++++++-------
>   1 file changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index b3be08b0ca91..c612363181df 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -734,11 +734,16 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>   }
>   
> -/* If the current address or return address have changed, then the GPU
> - * has probably made progress and we should delay the reset.  This
> - * could fail if the GPU got in an infinite loop in the CL, but that
> - * is pretty unlikely outside of an i-g-t testcase.
> - */
> +static void
> +v3d_sched_skip_reset(struct drm_sched_job *sched_job)
> +{
> +	struct drm_gpu_scheduler *sched = sched_job->sched;
> +
> +	spin_lock(&sched->job_list_lock);
> +	list_add(&sched_job->list, &sched->pending_list);
> +	spin_unlock(&sched->job_list_lock);
> +}
> +
>   static enum drm_gpu_sched_stat
>   v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q,
>   		    u32 *timedout_ctca, u32 *timedout_ctra)
> @@ -748,9 +753,16 @@ v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q,
>   	u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(q));
>   	u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(q));
>   
> +	/* If the current address or return address have changed, then the GPU
> +	 * has probably made progress and we should delay the reset. This
> +	 * could fail if the GPU got in an infinite loop in the CL, but that
> +	 * is pretty unlikely outside of an i-g-t testcase.
> +	 */
>   	if (*timedout_ctca != ctca || *timedout_ctra != ctra) {
>   		*timedout_ctca = ctca;
>   		*timedout_ctra = ctra;
> +
> +		v3d_sched_skip_reset(sched_job);
>   		return DRM_GPU_SCHED_STAT_NOMINAL;
>   	}
>   
> @@ -790,11 +802,13 @@ v3d_csd_job_timedout(struct drm_sched_job *sched_job)
>   	struct v3d_dev *v3d = job->base.v3d;
>   	u32 batches = V3D_CORE_READ(0, V3D_CSD_CURRENT_CFG4(v3d->ver));
>   
> -	/* If we've made progress, skip reset and let the timer get
> -	 * rearmed.
> +	/* If we've made progress, skip reset, add the job to the pending
> +	 * list, and let the timer get rearmed.
>   	 */
>   	if (job->timedout_batches != batches) {
>   		job->timedout_batches = batches;
> +
> +		v3d_sched_skip_reset(sched_job);
>   		return DRM_GPU_SCHED_STAT_NOMINAL;
>   	}
>   



More information about the dri-devel mailing list