[PATCH v2 6/8] drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset

Mon Jun 2 11:59:14 UTC 2025

On Mon, 2025-06-02 at 08:36 -0300, Maíra Canal wrote:
> Hi Philipp,
> 
> On 02/06/25 04:28, Philipp Stanner wrote:
> > On Fri, 2025-05-30 at 11:01 -0300, Maíra Canal wrote:
> 
> [...]
> 
> > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > index
> > > 7146069a98492f5fab2a49d96e2054f649e1fe3d..46f5391e84a12232b247886
> > > cf13
> > > 11f8e09f42f04 100644
> > > --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > @@ -40,11 +40,11 @@ static enum drm_gpu_sched_stat
> > > etnaviv_sched_timedout_job(struct drm_sched_job
> > >   	int change;
> > >   
> > >   	/*
> > > -	 * If the GPU managed to complete this jobs fence, the
> > > timout is
> > > -	 * spurious. Bail out.
> > > +	 * If the GPU managed to complete this jobs fence, the
> > > timeout has
> > > +	 * fired before free-job worker. The timeout is
> > > spurious, so
> > > bail out.
> > >   	 */
> > >   	if (dma_fence_is_signaled(submit->out_fence))
> > > -		goto out_no_timeout;
> > > +		return DRM_GPU_SCHED_STAT_NO_HANG;
> > >   
> > >   	/*
> > >   	 * If the GPU is still making forward progress on the
> > > front-
> > > end (which
> > > @@ -70,7 +70,7 @@ static enum drm_gpu_sched_stat
> > > etnaviv_sched_timedout_job(struct drm_sched_job
> > >   		gpu->hangcheck_dma_addr = dma_addr;
> > >   		gpu->hangcheck_primid = primid;
> > >   		gpu->hangcheck_fence = gpu->completed_fence;
> > > -		goto out_no_timeout;
> > > +		return DRM_GPU_SCHED_STAT_NO_HANG;
> > >   	}
> > >   
> > >   	/* block scheduler */
> > > @@ -86,10 +86,7 @@ static enum drm_gpu_sched_stat
> > > etnaviv_sched_timedout_job(struct drm_sched_job
> > >   	drm_sched_resubmit_jobs(&gpu->sched);
> > >   
> > >   	drm_sched_start(&gpu->sched, 0);
> > > -	return DRM_GPU_SCHED_STAT_RESET;
> > >   
> > > -out_no_timeout:
> > > -	list_add(&sched_job->list, &sched_job->sched-
> > > >pending_list);
> > 
> > Here you actually remove the manipulation of the scheduler
> > internals,
> > but you didn't in v3d. Just to point that out.
> > 
> > 
> > And BTW I'm just seeing that the pending_list gets manipulated here
> > with the scheduler's workqueues running and no locks being hold.
> > 
> > Oh man :(
> > 
> > That is most certainly a bug, and I recommend that the etnaviv
> > maintainers at least add the appropriate lock here and backport
> > that
> > since it can race any time.
> > 
> > 
> > But thx for working on that, Maíra. Good that we can remove the
> > stuff
> > this way.
> > 
> > Thinking about it, this patch even fixes a bug. So could contain a
> > Fixes: tag. But I'm not sure if it's worth it to mark the entire
> > series
> > for Stable. Opinions?
> 
> I believe the best way to fix this bug would be doing something
> similar
> to what I did to v3d: send a temporary fix adding the locks, which
> will
> be backported to stable. I'll send a fix to Etnaviv today.

Having the locking is significantly better than not having it, and
adding it gets my blessing, but it still doesn't solve all issues.

As Matt pointed out and as you address in your patch №2, a job could
still leak if no one makes sure that free_job_work gets kicked off. And
exposing yet another scheduler function publicly is obviously
unacceptable, so it's either live with the potential leak, or use the
new status code.

P.

> 
> Thanks for the review, Phillip!
> 
> Best Regards,
> - Maíra
> 
> > 
> > 
> > P.
> > 
> > 
> > >   	return DRM_GPU_SCHED_STAT_RESET;
> > >   }
> > >   
> > > 
> > 
>