[PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

Wed Oct 11 16:30:15 UTC 2017

On 28/09/17 04:55 PM, Nicolai Hähnle wrote:
> From: Nicolai Hähnle <nicolai.haehnle at amd.com>
> 
> Highly concurrent Piglit runs can trigger a race condition where a pending
> SDMA job on a buffer object is never executed because the corresponding
> process is killed (perhaps due to a crash). Since the job's fences were
> never signaled, the buffer object was effectively leaked. Worse, the
> buffer was stuck wherever it happened to be at the time, possibly in VRAM.
> 
> The symptom was user space processes stuck in interruptible waits with
> kernel stacks like:
> 
>     [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
>     [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
>     [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300
>     [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm]
>     [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
>     [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
>     [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
>     [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
>     [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu]
>     [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
>     [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
>     [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
>     [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
>     [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
>     [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
>     [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
>     [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
>     [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com>
> Acked-by: Christian König <christian.koenig at amd.com>

Since Christian's commit which introduced the problem (6af0883ed977
"drm/amdgpu: discard commands of killed processes") is in 4.14, we need
a solution for that. Should we backport Nicolai's five commits fixing
the problem, or revert 6af0883ed977?

While looking into this, I noticed that the following commits by
Christian in 4.14 each also cause hangs for me when running the piglit
gpu profile on Tonga:

457e0fee04b0 "drm/amdgpu: remove the GART copy hack"
1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind"

Are there fixes for these that can be backported to 4.14, or do they
need to be reverted there?

-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer