[PATCH 2/2] drm/sched: serialize job_timeout and scheduler
Luben Tuikov
luben.tuikov at amd.com
Tue Aug 31 15:06:45 UTC 2021
On 2021-08-31 08:59, Daniel Vetter wrote:
> Can we please have some actual commit message here, with detailed
> explanation of the race/bug/whatever, how you fix it and why this is the
> best option?
I agree with Daniel--a narrative form of a commit message is so much easier
for humans to digest. The "[what]"/"[why]"/"[how]" and "issue"/"fix" format is
somewhat dry and uninformative, and leaves much to be desired.
Regards,
Luben
>
> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
>> tested-by: jingwen chen <jingwen.chen at amd.com>
>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> Signed-off-by: jingwen chen <jingwen.chen at amd.com>
>> ---
>> drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>> 1 file changed, 4 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index ecf8140..894fdb24 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>> sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>
>> /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
>> + if (!__kthread_should_park(sched->thread))
> This is a __ function, i.e. considered internal, and it's lockless atomic,
> i.e. unordered. And you're not explaining why this works.
>
> Iow it's probably buggy, and an just unconditionally parking the kthread
> is probably the right thing to do. If it's not the right thing to do,
> there's a bug here for sure.
> -Daniel
>
>> + kthread_park(sched->thread);
>> +
>> spin_lock(&sched->job_list_lock);
>> job = list_first_entry_or_null(&sched->pending_list,
>> struct drm_sched_job, list);
>>
>> if (job) {
>> - /*
>> - * Remove the bad job so it cannot be freed by concurrent
>> - * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>> - * is parked at which point it's safe.
>> - */
>> - list_del_init(&job->list);
>> spin_unlock(&sched->job_list_lock);
>>
>> + /* vendor's timeout_job should call drm_sched_start() */
>> status = job->sched->ops->timedout_job(job);
>>
>> /*
>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>> kthread_park(sched->thread);
>>
>> /*
>> - * Reinsert back the bad job here - now it's safe as
>> - * drm_sched_get_cleanup_job cannot race against us and release the
>> - * bad job at this point - we parked (waited for) any in progress
>> - * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>> - * now until the scheduler thread is unparked.
>> - */
>> - if (bad && bad->sched == sched)
>> - /*
>> - * Add at the head of the queue to reflect it was the earliest
>> - * job extracted.
>> - */
>> - list_add(&bad->list, &sched->pending_list);
>> -
>> - /*
>> * Iterate the job list from later to earlier one and either deactive
>> * their HW callbacks or remove them from pending list if they already
>> * signaled.
>> --
>> 2.7.4
>>
More information about the amd-gfx
mailing list