[PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

Thu Oct 12 08:05:25 UTC 2017

Am 11.10.2017 um 18:30 schrieb Michel Dänzer:
> On 28/09/17 04:55 PM, Nicolai Hähnle wrote:
>> From: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>
>> Highly concurrent Piglit runs can trigger a race condition where a pending
>> SDMA job on a buffer object is never executed because the corresponding
>> process is killed (perhaps due to a crash). Since the job's fences were
>> never signaled, the buffer object was effectively leaked. Worse, the
>> buffer was stuck wherever it happened to be at the time, possibly in VRAM.
>>
>> The symptom was user space processes stuck in interruptible waits with
>> kernel stacks like:
>>
>>      [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
>>      [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
>>      [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300
>>      [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm]
>>      [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
>>      [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
>>      [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
>>      [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
>>      [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu]
>>      [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
>>      [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
>>      [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
>>      [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
>>      [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
>>      [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
>>      [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
>>      [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
>>      [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com>
>> Acked-by: Christian König <christian.koenig at amd.com>
> Since Christian's commit which introduced the problem (6af0883ed977
> "drm/amdgpu: discard commands of killed processes") is in 4.14, we need
> a solution for that. Should we backport Nicolai's five commits fixing
> the problem, or revert 6af0883ed977?
>
>
> While looking into this, I noticed that the following commits by
> Christian in 4.14 each also cause hangs for me when running the piglit
> gpu profile on Tonga:
>
> 457e0fee04b0 "drm/amdgpu: remove the GART copy hack"
> 1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind"
>
> Are there fixes for these that can be backported to 4.14, or do they
> need to be reverted there?
Well I'm not aware that any of those two can cause problems.

For "drm/amdgpu: remove the GART copy hack" I also don't have the 
slightest idea how that could be an issue. It just removes an unused 
code path.

Is amd-staging-drm-next stable for you?

Thanks,
Christian.