<html>
<head>
<base href="https://bugs.freedesktop.org/">
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW - Crashes / Resets From AMDGPU / Radeon VII"
href="https://bugs.freedesktop.org/show_bug.cgi?id=110674#c110">Comment # 110</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW - Crashes / Resets From AMDGPU / Radeon VII"
href="https://bugs.freedesktop.org/show_bug.cgi?id=110674">bug 110674</a>
from <span class="vcard"><a class="email" href="mailto:reddestdream@gmail.com" title="ReddestDream <reddestdream@gmail.com>"> <span class="fn">ReddestDream</span></a>
</span></b>
<pre><span class="quote">> 1. The functions in vega20_ppt.c are used with this new patch so that answers my question from earlier, that's what this file is for and why it contains similar/identical functions.</span >
I was hoping this was the case as the duplicated functions were confusing me
too. Glad we got this figured out! :)
<span class="quote">> I tried it, it didn't help the crashing issue and I was stuck at 30w. As soon as I started sddm the system froze. I've attached my dmesg from amdgpu.dpm=2 boot. It doesn't fix the issue but it does help answer a few questions I had:</span >
This is disappointing tho. I was hoping that setting amdgpu.dpm=2 would use the
more "actively developed" path and that would fix the issue. :/
<span class="quote">> Given that two different versions of the code produce the same result, my hunch is that the problem is B. The card is not in a state where it's able to receive power changes.</span >
I tend to agree, but it's still not clear why or how the card ends up in a bad
state when commands to it via smu_send_smc_msg_with_param seem to just suddenly
stop working. And given the amount of same/similar functions in vega20_hwmgr.c
and vega20_ppt.c it's hard to rule out A entirely.
Since amdgpu.dpm=0 resolves the issue (albeit at the cost of being stuck at
minimum clocks inherited from the VBIOS/GOP/UEFI/firmware), it seems that the
card is starting out in a reasonable state and then being thrown into a bad
state later by bad driver code. And that code is part of the DPM (Dynamic Power
Management) system. We are pretty confident that dpm_state.hard_min_level is
stable the whole time, so that's probably not what's throwing the card into a
bad state. But perhaps another value in the DPM table is . . .
It doesn't make intuitive sense that the soft min/max values would be
problematic since they are presumably "more flexible," but it's possible that
they get calculated out of spec or something and logging them should be
possible like how dpm_state.hard_min_level was logged.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>