[PATCH v3 09/13] drm/sched: Submit job before starting TDR

Thu Sep 14 02:56:10 UTC 2023

On 2023-09-11 22:16, Matthew Brost wrote:
> If the TDR is set to a value, it can fire before a job is submitted in
> drm_sched_main. The job should be always be submitted before the TDR
> fires, fix this ordering.
> 
> v2:
>   - Add to pending list before run_job, start TDR after (Luben, Boris)
> 
> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index c627d3e6494a..9dbfab7be2c6 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -498,7 +498,6 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>  
>  	spin_lock(&sched->job_list_lock);
>  	list_add_tail(&s_job->list, &sched->pending_list);
> -	drm_sched_start_timeout(sched);
>  	spin_unlock(&sched->job_list_lock);
>  }
>  
> @@ -1234,6 +1233,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>  		fence = sched->ops->run_job(sched_job);
>  		complete_all(&entity->entity_idle);
>  		drm_sched_fence_scheduled(s_fence, fence);
> +		drm_sched_start_timeout_unlocked(sched);
>  
>  		if (!IS_ERR_OR_NULL(fence)) {
>  			/* Drop for original kref_init of the fence */

So, sched->ops->run_job(), is a "job inflection point" from the point of view of
the DRM scheduler. After that call, DRM has relinquished control of the job to the
firmware/hardware.

Putting the job in the pending list, before submitting it to down to the firmware/hardware,
goes along with starting a timeout timer for the job. The timeout always includes
time for the firmware/hardware to get it prepped, as well as time for the actual
execution of the job (task). Thus, we want to do this:
	1. Put the job in pending list. "Pending list" means "pends in hardware".
	2. Start a timeout timer for the job.
	3. Start executing the job/task. This usually involves giving it to firmware/hardware,
	   i.e. ownership of the job/task changes to another domain. In our case this is accomplished
	   by calling sched->ops->run_job().
Perhaps move drm_sched_start_timeout() closer to sched->ops->run_job() from above and/or increase
the timeout value?
-- 
Regards,
Luben