[Bug 110674] Crashes / Resets From AMDGPU / Radeon VII

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Aug 11 15:26:13 UTC 2019


https://bugs.freedesktop.org/show_bug.cgi?id=110674

--- Comment #70 from Tom B <tom at r.je> ---
> Based on all the data you (Tom B) and others have provided as well as my own tests, my current suspicion is that there is a bug in the display mode/type detection and enumeration, leading to the driver losing state consistency and eventually contact entirely with the hardware.

I looked through the commits and the code trying to find anything that dealt
with multiple displays as that seems to be the trigger but couldn't find
anything that looked promising.

It's probably worth noting what I tried/found even though I was unsuccessful as
it may help someone. I'm fairly sure that the problem must be this file:
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/powerplay/vega20_ppt.c
There is a variable called NumOfDisplays and related code.  Maybe someone who
understands driver development can point me in the right direction:

Line 2049 seems promising.

smu_send_smc_msg_with_param(smu, SMU_MSG_NumOfDisplays, 0);
        ret = vega20_set_uclk_to_highest_dpm_level(smu,
                                                   &dpm_table->mem_table);



if (ret)
                pr_err("Failed to set uclk to highest dpm level");




Although that error message is not displayed in dmesg, this function deals with
multiple displays and the power levels. Unfortunatelely I cannot find
documenation for the driver code. What does smu_send_smc_msg_with_param do?
Because here the last argument is 0. In the next function,
vega20_display_config_changed the final argument is the number of displays:

smu_send_smc_msg_with_param(smu,
                                            SMU_MSG_NumOfDisplays,
                                            smu->display_config->num_display);



The next point of interest is line 2091. I don't think it's the cause of the
bug but:

disable_mclk_switching = ((1 < smu->display_config->num_display) &&
                                  !smu->display_config->multi_monitor_in_sync)
|| vblank_too_short;


 disable_mclk_switching is set if the number of displays is more than 1 and
"multi_monitor_in_sync" (whatever that is, possibly mirrored displays?)  or
"vblank_too_short". I don't believe this is a problem because the code has
existed since January, presumably for the February release, but perhaps the
contents of the different variables has chagned so this code runs differently.

I only mention this because it's the only point in the code I found where it
does something different if more than one display is connected. 

My questions for the driver devs:

1. Why is smu_send_smc_msg_with_param called with zero in the function
vega20_pre_display_config_changed but the number of displays in the next
function?
2. Is num_displays an index (so 0 is actually the first display and we're
assuming 1 display in index 0) or is it actually 0, no displays?
3. Is there any way to see which code appears in which kernel version? The tags
are definitely incorrect, the first commit for that file:
https://github.com/torvalds/linux/commit/74e07f9d3b77034cd1546617afce1d014a68d1ca#diff-2575675126169f3c0c971db736852af9
says 5.2 but was done in December last year so I can't imagine this file isn't
used.



However, as a customer this is very frustrating. I bought the VII instead of an
nvidia card because AMD were supporting open source drivers.

As it stands:

- The AMDGPU driver worked for 4 months after the VII's release and now we've
had nearly the same amount of time where it hasn't worked with the latest
kernel.
- The AMDGPU-Pro driver only supports Ubuntu, I've never managed to get it to
run successfully on Arch and the latest version only supports The RX5700 cards
anyway.

I emailed AMD technical support about this bug over a month ago and never got a
reply.

The VII appears to be completely unsupported other than the initial driver
release when the card came out. I'll be going back to nvidia next time and
although I had intended to keep the VII for several years it looks like that
won't be possible as I can't run an old kernel forever.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20190811/7fa40da1/attachment.html>


More information about the dri-devel mailing list