[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

Mon Aug 12 14:34:52 UTC 2019

https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #81 from Tom B <tom at r.je> ---
Created attachment 145038
  --> https://bugs.freedesktop.org/attachment.cgi?id=145038&action=edit
5.2.7 dmesg with hard_min_level logged

As mentioned in the previous post, I started logging the value of
hard_min_level. I hadn't realised that vega20_set_uclk_to_highest_dpm_level
would be called so many times.

Here's what I found: The value of hard_min_level is 1001 in both 5.0.13 and
5.2.7 so the issue is not the value from the dpm table. The dpm table is
probably correct. Something prevents smum_send_msg_to_smc_with_parameter
accepting the value.

However, what is interesting is that it doesn't always fail.

[    4.082105] amdgpu: [powerplay] hard_min_level: 1001
[    4.372684] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:44:00.0 on
minor 0
[    4.517204] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    4.517205] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min
uclk failed!

Each hard_min_level line in the log is from
vega20_set_uclk_to_highest_dpm_level and there are multiple calls to it, which
don't fail, before the card is initialised.

This is from 5.2.7:

[    3.698907] amdgpu 0000:44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    4.082105] amdgpu: [powerplay] hard_min_level: 1001
[    4.372684] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:44:00.0 on
minor 0
[    4.517204] amdgpu: [powerplay] Failed to send message 0x28, response 0x0
[    4.517205] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min
uclk failed!
[    5.361482] amdgpu: [powerplay] Failed to send message 0x28, response 0x0

And the same from 5.0.13:

[    3.352380] amdgpu 0000:44:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    3.722422] amdgpu: [powerplay] hard_min_level: 1001
[    3.766269] amdgpu: [powerplay] hard_min_level: 1001
[    4.029679] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:44:00.0 on
minor 0

There are a couple of things here:

1. vega20_set_fclk_to_highest_dpm_level is called twice between the "ring vce2"
line and "Initialized"

2. My patched code looks like this:

                pr_err("hard_min_level: %d\n",
                                        dpm_table->dpm_state.hard_min_level);

                PP_ASSERT_WITH_CODE(!(ret =
smum_send_msg_to_smc_with_parameter(hwmgr,
                                PPSMC_MSG_SetHardMinByFreq,
                                (PPCLK_UCLK << 16 ) |
dpm_table->dpm_state.hard_min_level)),
                                "[SetUclkToHightestDpmLevel] Set hard min uclk
failed!",
                                return ret);

Yet the log shows:

- My debug line 
- Initialized amdgpu 3.32.0 20150101 for 0000:44:00.0 on minor 0
- [SetUclkToHightestDpmLevel] Set hard min uclk failed!

So initialization is happening between (and possibly a result of) sending the
message and getting the response.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20190812/4a2d9fba/attachment.html>