[PATCH 1/2] Revert "drm/sched: fix timeout handling v2"

Thu Nov 8 16:48:48 UTC 2018

Am 08.11.18 um 17:19 schrieb Eric Anholt:
> "Koenig, Christian" <Christian.Koenig at amd.com> writes:
>
>> Am 08.11.18 um 17:04 schrieb Eric Anholt:
>>> This reverts commit 0efd2d2f68cd5dbddf4ecd974c33133257d16a8e.  Fixes
>>> this failure in V3D GPU reset:
>>>
>>> [ 1418.227796] Unable to handle kernel NULL pointer dereference at virtual address 00000018
>>> [ 1418.235947] pgd = dc4c55ca
>>> [ 1418.238695] [00000018] *pgd=80000040004003, *pmd=00000000
>>> [ 1418.244132] Internal error: Oops: 206 [#1] SMP ARM
>>> [ 1418.248934] Modules linked in:
>>> [ 1418.252001] CPU: 0 PID: 10253 Comm: kworker/0:0 Not tainted 4.19.0-rc6+ #486
>>> [ 1418.259058] Hardware name: Broadcom STB (Flattened Device Tree)
>>> [ 1418.265002] Workqueue: events drm_sched_job_timedout
>>> [ 1418.269986] PC is at dma_fence_remove_callback+0x8/0x50
>>> [ 1418.275218] LR is at drm_sched_job_timedout+0x4c/0x118
>>> ...
>>> [ 1418.415891] [<c086b754>] (dma_fence_remove_callback) from [<c06e7e6c>] (drm_sched_job_timedout+0x4c/0x118)
>>> [ 1418.425571] [<c06e7e6c>] (drm_sched_job_timedout) from [<c0242500>] (process_one_work+0x2c8/0x7bc)
>>> [ 1418.434552] [<c0242500>] (process_one_work) from [<c0242a38>] (worker_thread+0x44/0x590)
>>> [ 1418.442663] [<c0242a38>] (worker_thread) from [<c0249b10>] (kthread+0x160/0x168)
>>> [ 1418.450076] [<c0249b10>] (kthread) from [<c02010ac>] (ret_from_fork+0x14/0x28)
>>>
>>> Cc: Christian König <christian.koenig at amd.com>
>>> Cc: Nayan Deshmukh <nayan26deshmukh at gmail.com>
>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>> Signed-off-by: Eric Anholt <eric at anholt.net>
>> Well NAK. The problem here is that fence->parent is NULL which is most
>> likely caused by an issue somewhere else.
>>
>> We could easily work around that with an extra NULL check, but reverting
>> the patch would break GPU recovery again.
> My GPU recovery works with the revert and reliably doesn't work without
> it, so my idea of "break GPU recovery" is the opposite of yours.  Can
> you help figure out what in this change broke my driver?

The problem is here:

> -	list_for_each_entry_reverse(job, &sched->ring_mirror_list, node) {
> -		struct drm_sched_fence *fence = job->s_fence;
> -
> -		if (!dma_fence_remove_callback(fence->parent, &fence->cb))
> -			goto already_signaled;

dma_fence_remove_callback() will fault if fence->parent is NULL. A 
simple "if (!fence->parent) continue;" should be enough to work around that.

But I'm not sure how exactly fence->parent became NULL in the first place.

Going to double check the code once more,
Christian.