[PATCH 2/2] drm/sched: serialize job_timeout and scheduler

Tue Aug 31 15:06:45 UTC 2021

On 2021-08-31 08:59, Daniel Vetter wrote:
> Can we please have some actual commit message here, with detailed
> explanation of the race/bug/whatever, how you fix it and why this is the
> best option?

I agree with Daniel--a narrative form of a commit message is so much easier
for humans to digest. The "[what]"/"[why]"/"[how]" and "issue"/"fix" format is
somewhat dry and uninformative, and leaves much to be desired.

Regards,
Luben

>
> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
>> tested-by: jingwen chen <jingwen.chen at amd.com>
>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> Signed-off-by: jingwen chen <jingwen.chen at amd.com>
>> ---
>>  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>>  1 file changed, 4 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index ecf8140..894fdb24 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>  	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>  
>>  	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
>> +	if (!__kthread_should_park(sched->thread))
> This is a __ function, i.e. considered internal, and it's lockless atomic,
> i.e. unordered. And you're not explaining why this works.
>
> Iow it's probably buggy, and an just unconditionally parking the kthread
> is probably the right thing to do. If it's not the right thing to do,
> there's a bug here for sure.
> -Daniel
>
>> +		kthread_park(sched->thread);
>> +
>>  	spin_lock(&sched->job_list_lock);
>>  	job = list_first_entry_or_null(&sched->pending_list,
>>  				       struct drm_sched_job, list);
>>  
>>  	if (job) {
>> -		/*
>> -		 * Remove the bad job so it cannot be freed by concurrent
>> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>> -		 * is parked at which point it's safe.
>> -		 */
>> -		list_del_init(&job->list);
>>  		spin_unlock(&sched->job_list_lock);
>>  
>> +		/* vendor's timeout_job should call drm_sched_start() */
>>  		status = job->sched->ops->timedout_job(job);
>>  
>>  		/*
>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>  	kthread_park(sched->thread);
>>  
>>  	/*
>> -	 * Reinsert back the bad job here - now it's safe as
>> -	 * drm_sched_get_cleanup_job cannot race against us and release the
>> -	 * bad job at this point - we parked (waited for) any in progress
>> -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>> -	 * now until the scheduler thread is unparked.
>> -	 */
>> -	if (bad && bad->sched == sched)
>> -		/*
>> -		 * Add at the head of the queue to reflect it was the earliest
>> -		 * job extracted.
>> -		 */
>> -		list_add(&bad->list, &sched->pending_list);
>> -
>> -	/*
>>  	 * Iterate the job list from later to  earlier one and either deactive
>>  	 * their HW callbacks or remove them from pending list if they already
>>  	 * signaled.
>> -- 
>> 2.7.4
>>