[PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

Mon Oct 9 08:02:09 UTC 2017

> For gpu reset patches (already submitted to pub) I would make kernel return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) founded as error, that way UMD would run into robust extension path and considering the GPU hang occurred,
Well that is only closed source behavior which is completely irrelevant 
for upstream development.

As far as I know we haven't pushed the change to return -ENODEV upstream.

Regards,
Christian.

Am 09.10.2017 um 08:42 schrieb Liu, Monk:
> Christian
>
>> It would be really nice to have an error code set on s_fence->finished before it is signaled, use dma_fence_set_error() for this.
> For gpu reset patches (already submitted to pub) I would make kernel return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) founded as error, that way UMD would run into robust extension path and considering the GPU hang occurred,
>
> Don't know if this is expected for the case of normal process being killed or crashed like Nicolai hit ... since there is no gpu hang hit
>
>
> BR Monk
>
>
>
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of Christian K?nig
> Sent: 2017年9月28日 23:01
> To: Nicolai Hähnle <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org
> Cc: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>
> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini
>
> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle:
>> From: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>
>> Highly concurrent Piglit runs can trigger a race condition where a
>> pending SDMA job on a buffer object is never executed because the
>> corresponding process is killed (perhaps due to a crash). Since the
>> job's fences were never signaled, the buffer object was effectively
>> leaked. Worse, the buffer was stuck wherever it happened to be at the time, possibly in VRAM.
>>
>> The symptom was user space processes stuck in interruptible waits with
>> kernel stacks like:
>>
>>       [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
>>       [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
>>       [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300
>>       [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm]
>>       [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
>>       [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
>>       [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
>>       [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
>>       [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu]
>>       [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
>>       [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
>>       [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
>>       [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
>>       [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
>>       [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
>>       [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
>>       [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
>>       [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com>
>> Acked-by: Christian König <christian.koenig at amd.com>
>> ---
>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++-
>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> index 54eb77cffd9b..32a99e980d78 100644
>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
>>    					amd_sched_entity_is_idle(entity));
>>    	amd_sched_rq_remove_entity(rq, entity);
>>    	if (r) {
>>    		struct amd_sched_job *job;
>>    
>>    		/* Park the kernel for a moment to make sure it isn't processing
>>    		 * our enity.
>>    		 */
>>    		kthread_park(sched->thread);
>>    		kthread_unpark(sched->thread);
>> -		while (kfifo_out(&entity->job_queue, &job, sizeof(job)))
>> +		while (kfifo_out(&entity->job_queue, &job, sizeof(job))) {
>> +			struct amd_sched_fence *s_fence = job->s_fence;
>> +			amd_sched_fence_scheduled(s_fence);
> It would be really nice to have an error code set on s_fence->finished before it is signaled, use dma_fence_set_error() for this.
>
> Additional to that it would be nice to note in the subject line that this is a rather important bug fix.
>
> With that fixed the whole series is Reviewed-by: Christian König <christian.koenig at amd.com>.
>
> Regards,
> Christian.
>
>> +			amd_sched_fence_finished(s_fence);
>> +			dma_fence_put(&s_fence->finished);
>>    			sched->ops->free_job(job);
>> +		}
>>    
>>    	}
>>    	kfifo_free(&entity->job_queue);
>>    }
>>    
>>    static void amd_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb)
>>    {
>>    	struct amd_sched_entity *entity =
>>    		container_of(cb, struct amd_sched_entity, cb);
>>    	entity->dependency = NULL;
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx