[PATCH v2] drm/scheduler: fix timeout worker setup for out of order job completions

Christian König ckoenig.leichtzumerken at gmail.com
Mon Aug 6 08:12:47 UTC 2018


Am 03.08.2018 um 19:31 schrieb Lucas Stach:
> Am Montag, den 06.08.2018, 14:57 +0200 schrieb Christian König:
>> Am 03.08.2018 um 16:29 schrieb Lucas Stach:
>>> drm_sched_job_finish() is a work item scheduled for each finished job on
>>> a unbound system workqueue. This means the workers can execute out of order
>>> with regard to the real hardware job completions.
>>>
>>> If this happens queueing a timeout worker for the first job on the ring
>>> mirror list is wrong, as this may be a job which has already finished
>>> executing. Fix this by reorganizing the code to always queue the worker
>>> for the next job on the list, if this job hasn't finished yet. This is
>>> robust against a potential reordering of the finish workers.
>>>
>>> Also move out the timeout worker cancelling, so that we don't need to
>>> take the job list lock twice. As a small optimization list_del is used
>>> to remove the job from the ring mirror list, as there is no need to
>>> reinit the list head in the job we are about to free.
>>>
>>>>> Signed-off-by: Lucas Stach <l.stach at pengutronix.de>
>>> ---
>>> v2: - properly handle last job in the ring
>>>       - check correct fence for compeletion
>>> ---
>>>    drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 ++++++++++------------
>>>    1 file changed, 10 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> index 44d480768dfe..574875e2c206 100644
>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> @@ -452,24 +452,22 @@ static void drm_sched_job_finish(struct work_struct *work)
>>>>>    						   finish_work);
>>>>>    	struct drm_gpu_scheduler *sched = s_job->sched;
>>>    
>>>>> -	/* remove job from ring_mirror_list */
>>>>> -	spin_lock(&sched->job_list_lock);
>>>>> -	list_del_init(&s_job->node);
>>>>> -	if (sched->timeout != MAX_SCHEDULE_TIMEOUT) {
>>>>> -		struct drm_sched_job *next;
>>> -
>>>>> -		spin_unlock(&sched->job_list_lock);
>>>>> +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT)
>>>    		cancel_delayed_work_sync(&s_job->work_tdr);
>> That is unfortunately still racy here.
>>
>> Between canceling the job and removing it from the list someone could
>> actually start the time (in theory) :)
>>
>> Cancel it, remove it from the list and cancel it again.
> I don't see how. If we end up in this worker the finished fence of the
> job is already certainly signaled (as this is what triggers queueing of
> the worker). So even if some other worker manages to find this job as
> the next job in the list, the dma_fence_is_signaled check should
> prevent the timeout worker from getting scheduled again.

Well that makes sense, but is a bit hard to understand.

Anyway, please remove the extra "if" check. With that done the patch is 
Reviewed-by: Christian König <christian.koenig at amd.com>.

Thanks,
Christian.


>
>> BTW: You could completely drop the "if (sched->timeout !=
>> MAX_SCHEDULE_TIMEOUT)" here cause canceling is harmless as long as the
>> structure is initialized.
> Right.
>
> Regards,
> Lucas
>
>> Christian.
>>
>>>>> -		spin_lock(&sched->job_list_lock);
>>>    
>>>>> -		/* queue TDR for next job */
>>>>> -		next = list_first_entry_or_null(&sched->ring_mirror_list,
>>>>> -						struct drm_sched_job, node);
>>>>> +	spin_lock(&sched->job_list_lock);
>>>>> +	/* queue TDR for next job */
>>>>> +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>> +	    !list_is_last(&s_job->node, &sched->ring_mirror_list)) {
>>>>> +		struct drm_sched_job *next = list_next_entry(s_job, node);
>>>    
>>>>> -		if (next)
>>>>> +		if (!dma_fence_is_signaled(&next->s_fence->finished))
>>>>>    			schedule_delayed_work(&next->work_tdr, sched->timeout);
>>>>>    	}
>>>>> +	/* remove job from ring_mirror_list */
>>>>> +	list_del(&s_job->node);
>>>>>    	spin_unlock(&sched->job_list_lock);
>>> +
>>>>>    	dma_fence_put(&s_job->s_fence->finished);
>>>>>    	sched->ops->free_job(s_job);
>>>    }
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the dri-devel mailing list