[BUG] gpu: drm: radeon: two possible deadlocks involving locking and waiting
Jia-Ju Bai
baijiaju1990 at gmail.com
Sat Feb 5 01:55:36 UTC 2022
Hi Christian,
Thanks for the reply :)
On 2022/2/1 15:56, Christian König wrote:
> Hi Jia-Ju,
>
> interesting that you have found those issues with an automated tool.
>
> And yes that is a well design flaw within the radeon driver which can
> happen on hardware faults, e.g. when radeon_ring_backup() needs to be
> called.
In fact, my tool finds dozens of similar possible deadlocks caused by
wait_event_timeout() in radeon_fence_wait_seq_timeout().
There are three other examples in Linux 5.16:
#BUG 1
radeon_dpm_change_power_state_locked()
mutex_lock(&rdev->ring_lock); --> Line 1133 (Lock A)
radeon_fence_wait_empty()
radeon_fence_wait_seq_timeout()
wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
radeon_fence_driver_fini()
mutex_lock(&rdev->ring_lock); --> Line 917 (Lock A)
wake_up_all(&rdev->fence_queue); --> Line 927 (Wake X)
#BUG 2
radeon_set_pm_profile()
mutex_lock(&rdev->pm.mutex); --> Line 382 (Lock A)
radeon_pm_set_clocks()
radeon_fence_wait_empty()
radeon_fence_wait_seq_timeout()
wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
radeon_dynpm_idle_work_handler()
mutex_lock(&rdev->pm.mutex); --> Line 1861 (Lock A)
radeon_fence_count_emitted()
radeon_fence_process()
wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
#BUG 3
radeon_pm_fini_old()
mutex_lock(&rdev->pm.mutex); --> Line 1642 (Lock A)
radeon_pm_set_clocks()
radeon_fence_wait_empty()
radeon_fence_wait_seq_timeout()
wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
radeon_dynpm_idle_work_handler()
mutex_lock(&rdev->pm.mutex); --> Line 1861 (Lock A)
radeon_fence_count_emitted()
radeon_fence_process()
wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
Thus, to fix these possible deadlocks, we could moditify the code
related to radeon_fence_wait_seq_timeout().
But I am not quite familar with the radeon driver, so I am not sure how
to moditify the code properly.
>
> But that happens so rarely and the driver is not developed further
> that we decided to not address this any more.
Ah, okay.
>
> Regards,
> Christian.
>
> Am 01.02.22 um 08:40 schrieb Jia-Ju Bai:
>> Hello,
>>
>> My static analysis tool reports a possible deadlock in the radeon
>> driver in Linux 5.16:
>>
>> #BUG 1
>> radeon_dpm_change_power_state_locked()
>> mutex_lock(&rdev->ring_lock); --> Line 1133 (Lock A)
>> radeon_fence_wait_empty()
>> radeon_fence_wait_seq_timeout()
>> wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
>>
>> radeon_ring_backup()
>> mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
>> radeon_fence_count_emitted()
>> radeon_fence_process()
>> wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
>>
>> When radeon_dpm_change_power_state_locked() is executed, "Wait X" is
>> performed by holding "Lock A". If radeon_ring_backup() is executed at
>> this time, "Wake X" cannot be performed to wake up "Wait X" in
>> radeon_dpm_change_power_state_locked(), because "Lock A" has been
>> already hold by radeon_dpm_change_power_state_locked(), causing a
>> possible deadlock.
>> I find that "Wait X" is performed with a timeout
>> MAX_SCHEDULE_TIMEOUT, to relieve the possible deadlock; but I think
>> this timeout can cause inefficient execution.
>>
>> #BUG 2
>> radeon_ring_lock()
>> mutex_lock(&rdev->ring_lock); --> Line 147 (Lock A)
>> radeon_ring_alloc()
>> radeon_fence_wait_next()
>> radeon_fence_wait_seq_timeout()
>> wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
>>
>> radeon_ring_backup()
>> mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
>> radeon_fence_count_emitted()
>> radeon_fence_process()
>> wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
>>
>> When radeon_ring_lock() is executed, "Wait X" is performed by holding
>> "Lock A". If radeon_ring_backup() is executed at this time, "Wake X"
>> cannot be performed to wake up "Wait X" in radeon_ring_lock(),
>> because "Lock A" has been already hold by radeon_ring_lock(), causing
>> a possible deadlock.
>> I find that "Wait X" is performed with a timeout
>> MAX_SCHEDULE_TIMEOUT, to relieve the possible deadlock; but I think
>> this timeout can cause inefficient execution.
>>
>> I am not quite sure whether these possible problems are real and how
>> to fix them if they are real.
>> Any feedback would be appreciated, thanks :)
>>
>>
>> Best wishes,
>> Jia-Ju Bai
>>
>
More information about the amd-gfx
mailing list