[PATCH 1/1] drm/xe: Don't short circuit TDR on jobs not started

Wed Oct 23 16:47:05 UTC 2024

On Tue, 2024-10-22 at 16:27 -0700, Matthew Brost wrote:
> Short circuiting TDR on jobs not started is an optimization which is not
> required. On LNL we are facing an issue where jobs do not get scheduled
> by the GuC for an unknown reason. Removing this optimization allows jobs
> to get scheduled after TDR fire once which is a big improvement. Remove
> this optimization for now while root causing job scheduling issue on
> LNL.

I just tested it and it seems to do what it promises. Thanks! Having a
5 second hiccup is still horribly bad, but it is - checks math notes -
infinitely better than waiting forever for a syncobj that will never be
signaled.

This patch will *tremendously* help Mesa CI, since we can reproduce
this bug all the time with Vulkan CTS tests.

Suggestions:

- Can we get a message on dmesg every time this hiccup happens? We're
not sure if it's happening on real workloads on people's machines, so
maybe having some sort of indication "oops, we just unstuck the batch
you submitted 300 frames ago!" would help.

- Since we don't know how long until the real fix, can this be tagged
for stable? If it turns out this requires special GuC, it would be even
more valuable to have this in stable since those tend to take more to
propagate to people's machines.

Thanks a lot!

> 
> Cc: Paulo Zanoni <paulo.r.zanoni at intel.com>
> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 0b81972ff651..25ab675e9c7d 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1052,10 +1052,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>  		exec_queue_killed_or_banned_or_wedged(q) ||
>  		exec_queue_destroyed(q);
>  
> -	/* Job hasn't started, can't be timed out */
> -	if (!skip_timeout_check && !xe_sched_job_started(job))
> -		goto rearm;
> -
>  	/*
>  	 * If devcoredump not captured and GuC capture for the job is not ready
>  	 * do manual capture first and decide later if we need to use it