[PATCH v3 1/5] drm/scheduler: rework job destruction

Tue Apr 16 09:47:37 UTC 2019

Am 15.04.19 um 23:17 schrieb Eric Anholt:
> Andrey Grodzovsky <andrey.grodzovsky at amd.com> writes:
>
>> From: Christian König <christian.koenig at amd.com>
>>
>> We now destroy finished jobs from the worker thread to make sure that
>> we never destroy a job currently in timeout processing.
>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>> which should solve a deadlock reported by a user.
>>
>> v2: Remove unused variable.
>>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>   drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>   drivers/gpu/drm/scheduler/sched_main.c     | 138 +++++++++++++++++------------
>>   drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
> Missing corresponding panfrost and lima updates.  You should probably
> pull in drm-misc for hacking on the scheduler.
>
>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
>> index ce7c737b..8efb091 100644
>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
>>   
>>   	/* block scheduler */
>>   	for (q = 0; q < V3D_MAX_QUEUES; q++)
>> -		drm_sched_stop(&v3d->queue[q].sched);
>> +		drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>   
>>   	if(sched_job)
>>   		drm_sched_increase_karma(sched_job);
>>   
>> +	/*
>> +	 * Guilty job did complete and hence needs to be manually removed
>> +	 * See drm_sched_stop doc.
>> +	 */
>> +	if (list_empty(&sched_job->node))
>> +		sched_job->sched->ops->free_job(sched_job);
> If the if (sched_job) is necessary up above, then this should clearly be
> under it.
>
> But, can we please have a core scheduler thing we call here instead of
> drivers all replicating it?

Yeah that's also something I noted before.

Essential problem is that we remove finished jobs from the mirror list 
and so need to destruct them because we otherwise leak them.

Alternative approach here would be to keep the jobs on the ring mirror 
list, but not submit them again.

Regards,
Christian.

>
>> +
>>   	/* get the GPU back into the init state */
>>   	v3d_reset(v3d);
>>