[Bug 108781] 4.19 Regression - Hawaii (R9 390) boot failure - Invalid PCC GPIO / invalid powerlevel state / Fatal error during GPU init

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Thu Nov 22 00:13:17 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=108781

--- Comment #13 from jamespharvey20 at gmail.com ---
In all seriousness, can the AMD devs please tell me exactly which make and
model video card the devs use?  As long as it's something that has 3+
DisplayPorts, and can display 5 monitors using chaining, I'd honestly rather
have to buy that and be done with all this, and sell mine on eBay saying
"windows only".

The symptom I see, and others are seeing, is that 4.18.16 boots to a tty just
fine, and 4.19 goes to a black screen when I'd expect it to automatically use
kms to go to a higher resolution.

Bisecting between 4.18.16 and 4.19 unfortunately runs across multiple other
amdgpu bugs that make this a tangled mess of spaghetti.  Bisecting using "Do I
get to see a tty on my monitor" as the deciding factor for good/bad absolutely
gets to that 0d998891 is bad, and its parent c91b007e is good.  I've confirmed
via booting each of these a bunch of times.  See new attached journalctl's line
3, which includes the auto kernel version confirming this.

I really hope I'm wrong about this, but I don't think I've found the bug making
my screen go black in 4.19.  I'm saying this because the journacltl differences
illustrating what's wrong with 0d998891 do not show up in 4.19.  I think the
0d998891 bug was fixed by a later commit, and I think I haven't yet reached the
bug I really care about in 4.19.  The prospect of having to continue bisecting
thousands of other commits with the multiple amdgpu bugs discussed below
between these versions, plus who knows how many other bugs pop up and are fixed
infuriates me.

This isn't just about complaining about bisecting.  It's about what in the
world am I supposed to use as the deciding factor on "good" vs "bad"?  So, more
recent than 0d998891, the screen is going to be black a lot of the time, but I
can't use that because I'm hunting for the "other black screen" bug.  There are
so many errors in 4.19 journalctl, I'd be comparing tons of journalctl's, since
I couldn't go by is the screen on, going maybe based off the
"amdgpu_device_ip_init failed".  But, what if that isn't the deciding factor?

I think all of this is why you were saying you don't think 0d998891 is the
problem, because the 4.18.16 vs 4.19 original journalctl's I attached are
showing a bug from somewhere else.

With there being multiple bugs that pop up and back out, I honestly think AMD
needs to revert all changes between 4.18.16 and 4.19, and only re-add them once
it has actually tested the commits with its own products.  Cards being
discussed here are not unusual or old.  I don't mind doing a bisect for an open
source project once and a while, but I think having to get this deep is going
too far, and with this being a company making code for its own product rather
than something like a filesystem bug, I don't feel like this depth of bug
hunting should be on me.

If I'm wrong and 0d998891 is truly the source of the problem, and for some
reason the 4.19 journalctl just don't show the errors at the bottom of this
comment, then let me apologize and retract most of my rant here.  But, with its
journalctl errors disappearing somewhere between it and 4.19, I don't feel like
I'm wrong.



In my last comment, I was thinking it was at least possible I had the wrong
commit at the very end, because I couldn't help but notice that the parent/good
commit and the ones before it are regarding vkms.  With the worst symptom being
a black screen at the kms stage, it seemed to make sense that somehow vkms was
somehow turning my system into a headless system, making the screen black. 
But, that's *NOT* what's happening.  Parent/good commit has vkms=n.  Although
Arch 4.19 has vkms=m, I've been using Arch's 4.18 config which doesn't even
have vkms, so it winds up using the default of =n.  (Furthermore, I've tested
Arch 4.19 as it is but changing vkms=n and I still get a black screen.)

-----

Issue 1

We have to start somewhere, and the biggest issue to me right now is obviously
the screen going black preventing a tty.

Interestingly, using the 0d998891 (bad) commit, the system does boot and I can
ssh in.  Just all the screens are black.

Like I explained above, I don't know if this turns out to be the cause of the
4.19 black screen.

-----

Issue 2

[drm] Invalid PCC GPIO: 13!

This error is a red herring as it pertains to the usable screen / black screen
issue.  It appears in both 0d998891 (bad) and its parent c91b007e (good.)  So,
that is in an earlier commit.  No idea if it's harmful, but with it, at least
booting c91b007e (good) to tty it works.  So, another bisect towards older
commits would be needed to find what causes this.

-----

Issue 3 - Maybe an issue 4 or 5 in here too?

[drm:dm_pp_get_static_clocks [amdgpu]] *ERROR* DM_PPLIB: invalid powerlevel
state: 0!
...
[drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed
[drm:amdgpu_device_init.cold.14 [amdgpu]] *ERROR* hw_init of IP block
<vce_v2_0> failed -110
amdgpu 0000:03:00.0: amdgpu_device_ip_init failed
amdgpu 0000:03:00.0: Fatal error during GPU init
(stacktrace)

The rest of the errors in my original attachment, such as the ones briefly
shown just above this paragraph, don't show in my good or bad commit.  So,
another bisect towards newer commits would be needed to find what causes these.
 Is this a single commit that introduces all of these errors?  Could there be
multiple commits causing all of this?  Who knows.





-----

Deeper on issue 1, regarding this bad commit

I'm vimdiff'ing the new attached journalctl's with ":%s/Nov 21 ..:..:.. //g". 
These are interesting (to me) differences:


archlinux kernel:   Magic number: 10:966:801
archlinux kernel: acpi PNP0F03:00: hash matches
===good above becomes bad below - probably pseudo-random noise but not sure so
including===
archlinux kernel:   Magic number: 10:413:850
archlinux kernel:  index2: hash matches
(line repeats 32 times, number of cores I  have)
archlinux kernel: processor cpu14: hash matches


Then at :1625(good) and :1663(bad) we see what changes between the good and bad
commits, regarding drm/fbcon.

[drm] amdgpu_dm_irq_schedule_work FAILED src 10
[drm] DM_MST: added connector: (____ptrval____) [id: 76] [master:
(____ptrval____)]
[drm] fb mappable at 0xC05BC000
[drm] vram apper at 0xC0000000
[drm] size 14745600
[drm] fb depth is 24
[drm]    pitch is 10240
fbcon: amdgpudrmfb (fb0) is primary device
switching from power state:
        ui class: performance
        internal class: none
        caps:
        uvd    vclk: 0 dclk: 0
                power level 0    sclk: 76600 mclk: 150000 pcie gen: 3 pcie
lanes: 16
                power level 1    sclk: 105000 mclk: 150000 pcie gen: 3 pcie
lanes: 16
        status: c
switching to power state:
        ui class: performance
        internal class: none
        caps:
        uvd    vclk: 0 dclk: 0
                power level 0    sclk: 30000 mclk: 15000 pcie gen: 3 pcie
lanes: 16
                power level 1    sclk: 105000 mclk: 150000 pcie gen: 3 pcie
lanes: 16
        status: r
[drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk
241500
===good above becomes bad below===
[drm] amdgpu_dm_irq_schedule_work FAILED src 10
[drm] amdgpu_dm_irq_schedule_work FAILED src 8
[drm] amdgpu_dm_irq_schedule_work FAILED src 10
[drm] DM_MST: added connector: (____ptrval____) [id: 76] [master:
(____ptrval____)]
[drm] Cannot find any crtc or sizes
[drm] amdgpu_dm_irq_schedule_work FAILED src 12
[drm] DM_MST: added connector: (____ptrval____) [id: 143] [master:
(____ptrval____)]
[drm] Cannot find any crtc or sizes
[drm] DM_MST: added connector: (____ptrval____) [id: 220] [master:
(____ptrval____)]
[drm] Cannot find any crtc or sizes
[drm] DM_MST: added connector: (____ptrval____) [id: 183] [master:
(____ptrval____)]
[drm] DM_MST: added connector: (____ptrval____) [id: 236] [master:
(____ptrval____)]
[drm] Cannot find any crtc or sizes
[drm] DM_MST: added connector: (____ptrval____) [id: 266] [master:
(____ptrval____)]
[drm] Cannot find any crtc or sizes


My original comment gave kernel parameters relating to radeon/amd.  The
journalctl's had it all.  At first, I worried that abbreviating what I said in
the comment might have thrown things off for the dev's, because the "bad"
commit has to do with fb, and I do use some fbcon kernel parameters.  But,
trying my "bad" commit and even Arch 4.19 without the fbcon kernel parameters
still leads to a black screen.  It's in the journalctl's, but my full kernel
line is:

initrd=intel-ucode.img initrd=initramfs-linux.img root=/dev/lvm/arch rw
consoleblank=0 fbcon=scrollback:128k fbcon=rotate:3 intel_iommu=on
radeon.cik_support=0 amdgpu.cik_support=1 amdgpu.dpm=1 amdgpu.dc=1

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20181122/a36cb0b6/attachment.html>


More information about the dri-devel mailing list