AMDGPU error: "[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!"

Michel Dänzer michel at daenzer.net
Thu Jun 24 10:51:15 UTC 2021


On 2021-06-04 3:08 p.m., Michel Dänzer wrote:
> On 2021-06-04 2:33 p.m., Alex Deucher wrote:
>> On Fri, Jun 4, 2021 at 3:47 AM Michel Dänzer <michel at daenzer.net> wrote:
>>>
>>> On 2021-05-19 3:57 p.m., Alex Deucher wrote:
>>>> On Wed, May 19, 2021 at 4:48 AM Michel Dänzer <michel at daenzer.net> wrote:
>>>>>
>>>>> On 2021-05-19 12:05 a.m., Alex Deucher wrote:
>>>>>> On Tue, May 18, 2021 at 10:11 AM Michel Dänzer <michel at daenzer.net> wrote:
>>>>>>>
>>>>>>> On 2021-05-17 11:33 a.m., xgqt wrote:
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> I run a AMD laptop "81NC Lenovo IdeaPad S340-15API" - AMD Ryzen 5 3500U with Radeon Vega 8 Graphics.
>>>>>>>> Recently some breakages started happening for me. In about 1h after boot-up while using a KDE desktop machine GUI would freeze. Sometimes it would be possible to move the mouse but the rest will be frozen. Screen may start blinking or go black.
>>>>>>>>
>>>>>>>> I'm not sure if this is my kernel, firmware or the hardware.
>>>>>>>> I don't understands dmesg that's why I'm guessing, but I think it is the firmware since this behavior started around 2021-05-15.
>>>>>>>> From my Portage logs I see that I updated my firmware on 2021-05-14 at 18:16:06.
>>>>>>>> So breakages started with my kernel: 5.10.27 and FW: 20210511.
>>>>>>>> After breakage I jumped to a older kernel 5.4.97 and compiled 5.12.4. I didn't notice a breakage on 5.4.97 but system ran ~40 minutes.
>>>>>>>> So I booted to newly compiled 5.12.4 where I was ~1h and it broke.
>>>>>>>> After that I booted to 5.4.97 again and downgraded my FW.
>>>>>>>> While I'm writing this I'm booted to kernel: 5.12.4 with FW: 20210315.
>>>>>>>>
>>>>>>>> I also described my situation on the Gentoo bugzilla: https://bugs.gentoo.org/790566
>>>>>>>>
>>>>>>>> "dmesg.log" attached here is from the time machine run fine (at the moment); "errors_sat_may_15_072825_pm_cest_2021.log" is a dmesg log from the time system broke
>>>>>>>>
>>>>>>>> Can I get any help with this? What are the next steps I should take? Any other files I should provide?
>>>>>>>
>>>>>>> I've hit similar hangs with a Lenovo ThinkPad E595 (Ryzen 7 3700U / Picasso / RAVEN 0x1002:0x15D8 0x17AA:0x5124 0xC1). I'm also suspecting them to be firware related. The hangs occurred with firmware from the AMD 20.50 release. I'm currently running with firmware from the 20.40 release, no hang in almost 2 weeks (the hangs happened within 1-2 days after boot).
>>>>>>
>>>>>> Can you narrow down which firmware(s) cause the problem?
>>>>>
>>>>> I'll try, but note I'm not really sure yet my hangs were related to firmware (only). Anyway, I'll try narrowing it down.
>>>>
>>>> Thanks.  Does this patch help?
>>>> https://patchwork.freedesktop.org/patch/433701/
>>>
>>> Unfortunately not. After no hangs for two weeks with older firmware, I just got a hang again within a day with newer firmware and a kernel with this fix.
>>>
>>>
>>> I'll try and narrow down which firmware triggers it now. Does Picasso use the picasso_*.bin ones only, or others as well?
>>
>> The picasso ones and raven_dmcu.bin.
> 
> Thanks. raven_dmcu.bin hasn't changed, so I'm trying to bisect the 8 Picasso ones which have changed:
> 
> picasso_asd.bin
> picasso_ce.bin
> picasso_me.bin
> picasso_mec2.bin
> picasso_mec.bin
> picasso_pfp.bin
> picasso_sdma.bin
> picasso_vcn.bin

Things are pointing to picasso_sdma.bin. I'm currently running with only that one reverted to linux-firmware 20210315, and haven't got any hangs for a week.

Note that I've previously gone for a week without a hang even with firmware which had hung before. So there's still a small chance that I'm just on another lucky run.

That said, Pierre-Eric has also homed in on raven_sdma.bin for similar hangs, and reverting to older firmware seems to have helped multiple people on bug reports.

So, I think it makes sense for you guys to start looking for what could be going wrong with the Picasso/Raven SDMA firmware from 20.50. One thing I noticed is that the SDMA firmware from 20.50 advertises the same feature version, but a *lower* firmware version than the one from 18.50. So it might be worth double-checking that there wasn't an accidental downgrade to some older version.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer


More information about the amd-gfx mailing list