[PATCH] drm/amd/display: Fix deadlock with display during hanged ring recovery.

Kazlauskas, Nicholas Nicholas.Kazlauskas at amd.com
Wed Feb 13 19:16:51 UTC 2019


On 2/13/19 2:10 PM, Grodzovsky, Andrey wrote:
> 
> On 2/13/19 2:00 PM, Kazlauskas, Nicholas wrote:
>> On 2/13/19 1:58 PM, Andrey Grodzovsky wrote:
>>> When ring hang happens amdgpu_dm_commit_planes during flip is holding
>>> the BO reserved and then stack waiting for fences to signal in
>>> reservation_object_wait_timeout_rcu (which won't signal because there
>>> was a hnag). Then when we try to shutdown display block during reset
>>> recovery from drm_atomic_helper_suspend we also try to reserve the BO
>>> from dm_plane_helper_cleanup_fb ending in deadlock.
>>> Also remove useless WARN_ON
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------
>>>     1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> index acc4ff8..f8dec36 100644
>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> @@ -4802,14 +4802,21 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>>     			 */
>>>     			abo = gem_to_amdgpu_bo(fb->obj[0]);
>>>     			r = amdgpu_bo_reserve(abo, true);
>>> -			if (unlikely(r != 0)) {
>>> +			if (unlikely(r != 0))
>>>     				DRM_ERROR("failed to reserve buffer before flip\n");
>>> -				WARN_ON(1);
>>> -			}
>>>     
>>> -			/* Wait for all fences on this FB */
>>> -			WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
>>> -										    MAX_SCHEDULE_TIMEOUT) < 0);
>>> +			/*
>>> +			 * Wait for all fences on this FB. Do limited wait to avoid
>>> +			 * deadlock during GPU reset when this fence will not signal
>>> +			 * but we hold reservation lock for the BO.
>>> +			 */
>>> +			r = reservation_object_wait_timeout_rcu(abo->tbo.resv,
>>> +								true, false,
>>> +								msecs_to_jiffies(5000));
>>> +			if (unlikely(r == 0))
>>> +				DRM_ERROR("Waiting for fences timed out.");
>>> +
>>> +
>>>     
>>>     			amdgpu_bo_get_tiling_flags(abo, &tiling_flags);
>>>     
>>>
>> Is it safe that we're just continuing like this? It's probably better to
>> just unreserve the buffer and continue to the next plane, no?
>>
>> Nicholas Kazlauskas
> 
> As far as I see it should be safe as you are simply flipping to a buffer
> for which rendering hasn't finished (or stack actually in this case) so
> you might see visual corruption but that the least of your problems if
> after 5s the BO still not finalized for presentation, the system is
> already probably in very bad shape. Also, in case we do want to  do
> error handling we should also take care of  amdgpu_bo_reserve failure
> just before that.
> 
> Andrey
> 
> 

Yeah, I guess this whole blocks needs to be cleaned up in that case. 
This is a good first step at least. Technically 
reservation_object_wait_timeout_rcu will return < 0 when it's been 
interrupted too as an error code but I guess that will just be silently 
ignored here.

If you want you can change the condition to:

if (unlikely(r >= 0))
      DRM_ERROR("Waiting for FB fence failed: id=%d res=%d\n", 
plane->id, r);

But with or without that change this patch is:

Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas at amd.com>


More information about the amd-gfx mailing list