[PATCH v3] drm/panthor: Make the timeout per-queue instead of per-job

Mon Apr 14 16:56:23 UTC 2025

On Mon, 14 Apr 2025 17:44:27 +0100
Ashley Smith <ashley.smith at collabora.com> wrote:

> On Fri, 11 Apr 2025 16:51:52 +0100 Steven Price  wrote:
>  > Hi Ashley, 
>  >  
>  > On 10/04/2025 13:57, Ashley Smith wrote:   
>  > > The timeout logic provided by drm_sched leads to races when we try 
>  > > to suspend it while the drm_sched workqueue queues more jobs. Let's 
>  > > overhaul the timeout handling in panthor to have our own delayed work 
>  > > that's resumed/suspended when a group is resumed/suspended. When an 
>  > > actual timeout occurs, we call drm_sched_fault() to report it 
>  > > through drm_sched, still. But otherwise, the drm_sched timeout is 
>  > > disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of 
>  > > how we protect modifications on the timer. 
>  > > 
>  > > One issue seems to be when we call drm_sched_suspend_timeout() from 
>  > > both queue_run_job() and tick_work() which could lead to races due to 
>  > > drm_sched_suspend_timeout() not having a lock. Another issue seems to 
>  > > be in queue_run_job() if the group is not scheduled, we suspend the 
>  > > timeout again which undoes what drm_sched_job_begin() did when calling 
>  > > drm_sched_start_timeout(). So the timeout does not reset when a job 
>  > > is finished. 
>  > > 
>  > > Co-developed-by: Boris Brezillon boris.brezillon at collabora.com> 
>  > > Signed-off-by: Boris Brezillon boris.brezillon at collabora.com> 
>  > > Tested-by: Daniel Stone daniels at collabora.com> 
>  > > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") 
>  > > Signed-off-by: Ashley Smith ashley.smith at collabora.com> 
>  > > --- 
>  > >  drivers/gpu/drm/panthor/panthor_sched.c | 244 +++++++++++++++++------- 
>  > >  1 file changed, 177 insertions(+), 67 deletions(-) 
>  > > 
>  > > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c 
>  > > index 446ec780eb4a..32f5a75bc4f6 100644 
>  > > --- a/drivers/gpu/drm/panthor/panthor_sched.c 
>  > > +++ b/drivers/gpu/drm/panthor/panthor_sched.c   
>  >  
>  > [...] 
>  >    
>  > > @@ -2727,8 +2784,17 @@ void panthor_sched_suspend(struct panthor_device *ptdev) 
>  > >               * automatically terminate all active groups, so let's 
>  > >               * force the state to halted here. 
>  > >               */ 
>  > > -            if (csg_slot->group->state != PANTHOR_CS_GROUP_TERMINATED) 
>  > > +            if (csg_slot->group->state != PANTHOR_CS_GROUP_TERMINATED) { 
>  > >                  csg_slot->group->state = PANTHOR_CS_GROUP_TERMINATED; 
>  > > + 
>  > > +                /* Reset the queue slots manually if the termination 
>  > > +                * request failed. 
>  > > +                */ 
>  > > +                for (i = 0; i queue_count; i++) { 
>  > > +                    if (group->queues[i]) 
>  > > +                        cs_slot_reset_locked(ptdev, csg_id, i); 
>  > > +                } 
>  > > +            } 
>  > >              slot_mask &= ~BIT(csg_id); 
>  > >          } 
>  > >      }   
>  >  
>  > So this seems to be the only change from v2 (a changelog can be 
>  > helpful!). And I'm not convinced it belongs in this patch? It's not just 
>  > "[making] the timeout per-queue instead of per-job". 
>  >  
>  > I haven't dug through the details, but I think this belongs in a 
>  > separate patch.

Actually, it's related, but I agree it's not clear: we call
cs_slot_reset_locked(), but the thing we're really interested in is the
cancellation of the timeout work. Before the timeout changes, the
timeout work was part of drm_sched and was cancelled inside
drm_sched_stop(). Now, maybe we do need to reset the CS slot
regardless, in which case it might make sense to have that done in a
separate fix happening before the timeout changes.