[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Mon Aug 12 15:42:17 UTC 2019


https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #83 from ReddestDream <reddestdream at gmail.com> ---
> Here's what I found: The value of hard_min_level is 1001 in both 5.0.13 and 5.2.7 so the issue is not the value from the dpm table. The dpm table is probably correct. 

Fantastic! Glad you tested this. I had suspected the hard_min_level was bogus
and that's why it was failing. Card was rejecting the bogus value. Glad to know
that's not the case.

> However, what is interesting is that it doesn't always fail.

Yeah. I've had boots where I have my 2 4K DP monitors in and I don't get
powerplay error on boot. In fact, it can go a bit and seem stable. But then the
powerplay errors suddenly (not related to some high load on the card) start
showing up again and the graphics become unstable. Similarly others have
reported that on hotplugging a second monitor after boot, the powerplay errors
will start showing up.

So, maybe there is a timing problem involved with sending the message. It's
generally a question of when rather than if it's going to fail.

> 1. vega20_set_fclk_to_highest_dpm_level is called twice between the "ring vce2" line and "Initialized"

Is it always called twice? Even on 5.2.7? Because it looks like it might get
called two times right before "Initialized" on 5.0.13 but then only once on
5.2.7 before "Initialized" kicks in. Maybe "Initialized" is interrupting on
5.2.7 but not on 5.0.13. It's possible that Initialization of the card is
messing up values that powerplay needs to read off the card or making the card
unavailable for receiving messages or something . . .

> So initialization is happening between (and possibly a result of) sending the message and getting the response

Yeah. Something is definitely happening while
vega20_set_uclk_to_highest_dpm_level is running . . . Not 100% sure that's
really problematic tho . . .  But it could be an atomicity issue. Need to
figure out what exactly what is generating the line "[drm] Initialized amdgpu
3.27.0 20150101 for 0000:44:00.0 on minor 0." Looks like it's coming from the
drm core rather than amdgpu specifically.

> I'm going to see if I can disable/revert BACO entirely to at least rule it out.

I thought BACO was reverted for Vega 20 here:

https://github.com/torvalds/linux/commit/7db329e57b90ddebcb58fc88eedbb3082d22a957#diff-8a4d25be8ad5d9c3ff27bb54b678dab2

Your commit seems to have been introduced in 5.2-rc1, not 5.1.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20190812/618afc0a/attachment.html>


More information about the dri-devel mailing list