[PATCH] drm/amd/display: Fix deadlock with display during hanged ring recovery.
Kazlauskas, Nicholas
Nicholas.Kazlauskas at amd.com
Wed Feb 13 19:16:51 UTC 2019
On 2/13/19 2:10 PM, Grodzovsky, Andrey wrote:
>
> On 2/13/19 2:00 PM, Kazlauskas, Nicholas wrote:
>> On 2/13/19 1:58 PM, Andrey Grodzovsky wrote:
>>> When ring hang happens amdgpu_dm_commit_planes during flip is holding
>>> the BO reserved and then stack waiting for fences to signal in
>>> reservation_object_wait_timeout_rcu (which won't signal because there
>>> was a hnag). Then when we try to shutdown display block during reset
>>> recovery from drm_atomic_helper_suspend we also try to reserve the BO
>>> from dm_plane_helper_cleanup_fb ending in deadlock.
>>> Also remove useless WARN_ON
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------
>>> 1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> index acc4ff8..f8dec36 100644
>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> @@ -4802,14 +4802,21 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>> */
>>> abo = gem_to_amdgpu_bo(fb->obj[0]);
>>> r = amdgpu_bo_reserve(abo, true);
>>> - if (unlikely(r != 0)) {
>>> + if (unlikely(r != 0))
>>> DRM_ERROR("failed to reserve buffer before flip\n");
>>> - WARN_ON(1);
>>> - }
>>>
>>> - /* Wait for all fences on this FB */
>>> - WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
>>> - MAX_SCHEDULE_TIMEOUT) < 0);
>>> + /*
>>> + * Wait for all fences on this FB. Do limited wait to avoid
>>> + * deadlock during GPU reset when this fence will not signal
>>> + * but we hold reservation lock for the BO.
>>> + */
>>> + r = reservation_object_wait_timeout_rcu(abo->tbo.resv,
>>> + true, false,
>>> + msecs_to_jiffies(5000));
>>> + if (unlikely(r == 0))
>>> + DRM_ERROR("Waiting for fences timed out.");
>>> +
>>> +
>>>
>>> amdgpu_bo_get_tiling_flags(abo, &tiling_flags);
>>>
>>>
>> Is it safe that we're just continuing like this? It's probably better to
>> just unreserve the buffer and continue to the next plane, no?
>>
>> Nicholas Kazlauskas
>
> As far as I see it should be safe as you are simply flipping to a buffer
> for which rendering hasn't finished (or stack actually in this case) so
> you might see visual corruption but that the least of your problems if
> after 5s the BO still not finalized for presentation, the system is
> already probably in very bad shape. Also, in case we do want to do
> error handling we should also take care of amdgpu_bo_reserve failure
> just before that.
>
> Andrey
>
>
Yeah, I guess this whole blocks needs to be cleaned up in that case.
This is a good first step at least. Technically
reservation_object_wait_timeout_rcu will return < 0 when it's been
interrupted too as an error code but I guess that will just be silently
ignored here.
If you want you can change the condition to:
if (unlikely(r >= 0))
DRM_ERROR("Waiting for FB fence failed: id=%d res=%d\n",
plane->id, r);
But with or without that change this patch is:
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas at amd.com>
More information about the amd-gfx
mailing list