Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
Hans de Goede
hdegoede at redhat.com
Tue Feb 20 15:27:07 UTC 2024
Hi,
On 2/20/24 16:15, Alex Deucher wrote:
> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions at leemhuis.info> wrote:
>>
>> On 20.02.24 15:45, Alex Deucher wrote:
>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>> Leemhuis) <regressions at leemhuis.info> wrote:
>>>>
>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>
>>>> For the record and everyone that lands here: the cause is known now
>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>
>>>> Other mentions:
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>
>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>> this there (but might have missed something!). From what I can see I
>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>> overall might be a bad idea here. We'll see I guess.
>>>
>>> The change aligns the driver what has been validated on each board
>>> design. Windows uses the same limits. Using values lower than the
>>> validated range can lead to undefined behavior and could potentially
>>> damage your hardware.
>>
>> Thx for the reply! Yeah, I was expecting something along those lines.
>>
>> Nevertheless it afaics still is a regression in the eyes of many users.
>> I'm not sure how Linus feels about this, but I wonder if we can find
>> some solution here so that users that really want to, can continue to do
>> what was possible out-of-the box before. Is that possible to realize or
>> even supported already?
>>
>> And sure, those users would be running their hardware outside of its
>> specifications. But is that different from overclocking (which the
>> driver allows, doesn't it? If not by all means please correct me!)?
>
> Sure. The driver has always had upper bound limits for overclocking,
> this change adds lower bounds checking for underclocking as well.
> When the silicon validation teams set the bounding box for a device,
> they set a range of values where it's reasonable to operate based on
> the characteristics of the design.
>
> If we did want to allow extended underclocking, we need a big warning
> in the logs at the very least.
Requiring a module-option to be set to allow this, as well as a big
warning in the logs sounds like a good solution to me.
Regards,
Hans
>>>> Roman posted something that apparently was meant to go to the list, so
>>>> let me put it here:
>>>>
>>>> """
>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>> discussion is on gitlab link below.
>>>>
>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>
>>>>
>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>
>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>> Author: Ma Jun <Jun.Ma2 at amd.com>
>>>>> Date: Thu Oct 12 09:33:45 2023 +0800
>>>>>
>>>>> drm/amd/pm: Support for getting power1_cap_min value
>>>>>
>>>>> Support for getting power1_cap_min value on smu13 and smu11.
>>>>> For other Asics, we still use 0 as the default value.
>>>>>
>>>>> Signed-off-by: Ma Jun <Jun.Ma2 at amd.com>
>>>>> Reviewed-by: Kenneth Feng <kenneth.feng at amd.com>
>>>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>>>
>>>>> However, this is not good as it remove under-powering range too far. I
>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>> freedom have to stick to such very high reference for min values without
>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>> guy and I wonder if because of maybe this post that I made few months
>>>> ago(business strategy?):
>>>>>
>>>>>
>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>
>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>> or reference values here either, just a move to lower the range of
>>>> options for whatever reason.
>>>>>
>>>>> I don't know how much power you guys have over them, but please
>>>> consider either reverting this change, or give us an option to set
>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>
>>>>>
>>>>> Thank you in advance for looking into this, with regards: Romano
>>>> """
>>>>
>>>> And while at it, let me add this issue to the tracking as well
>>>>
>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>> regressions; the text you find below is based on a few templates
>>>> paragraphs you might have encountered already in similar form.
>>>> See link in footer if these mails annoy you.]
>>>>
>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>> tracking bot:
>>>>
>>>> #regzbot introduced 1958946858a62b /
>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>> --
>>>> Everything you wanna know about Linux kernel regression tracking:
>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>> That page also explains what to do if mails like this annoy you.
>>>
>>>
>
More information about the amd-gfx
mailing list