[PATCH v4] drm/panthor: Make the timeout per-queue instead of per-job

Sat May 24 15:03:37 UTC 2025

Hi Ashley,

On Fri, 23 May 2025 at 16:10, Ashley Smith <ashley.smith at collabora.com> wrote:
> The timeout logic provided by drm_sched leads to races when we try
> to suspend it while the drm_sched workqueue queues more jobs. Let's
> overhaul the timeout handling in panthor to have our own delayed work
> that's resumed/suspended when a group is resumed/suspended. When an
> actual timeout occurs, we call drm_sched_fault() to report it
> through drm_sched, still. But otherwise, the drm_sched timeout is
> disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of
> how we protect modifications on the timer.
>
> One issue seems to be when we call drm_sched_suspend_timeout() from
> both queue_run_job() and tick_work() which could lead to races due to
> drm_sched_suspend_timeout() not having a lock. Another issue seems to
> be in queue_run_job() if the group is not scheduled, we suspend the
> timeout again which undoes what drm_sched_job_begin() did when calling
> drm_sched_start_timeout(). So the timeout does not reset when a job
> is finished.
>
> Co-developed-by: Boris Brezillon <boris.brezillon at collabora.com>
> Signed-off-by: Boris Brezillon <boris.brezillon at collabora.com>
> Tested-by: Daniel Stone <daniels at collabora.com>
> Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")

Unfortunately I have to revoke my T-b as we're seeing a pile of
failures in a CI stress test with this, e.g.
https://gitlab.freedesktop.org/daniels/mesa/-/jobs/77004047

Cheers,
Daniel