[Advice Request] Trying to debug amdgpu fatal error

Christian König christian.koenig at amd.com
Tue Apr 10 06:50:56 UTC 2018


Hi Daniel,

nice to know that you got it working.

And it is an interesting rational that disabling thunderbolt boot 
support in the BIOS fixes thinks. Going to keep that in mind when other 
users run into the same issue.

Thanks,
Christian.

Am 09.04.2018 um 18:20 schrieb Daniel Moran:
> Christian,
>
> Thanks for the response. That got me in the right direction.
> After trial and error I found the cause - Thunderbolt Boot Support 
> option must be disabled in BIOS.
> If I disable it I can boot to Ubuntu and looks like amdgpu inits okay. 
> If I enable with no other changes, init fails.
>
> The last issue was one of my own - forgetting to use DRI_PRIME and 
> xrandr correctly.
> Happy to say the Red Devil is working now in eGPU mode!
> It's about a 20% perf loss over PCI-E slot and right in line with our 
> previous tests.
>
> As always thank you for your continued time and support.
> We'll be happy to give a shout out to you guys for the help at 
> article/video time.
>
>
> Respectfully,
> Daniel S. Moran (garwynn)
> PC Hardware Editor - XDA-Developers
> Phone: 1-559-316-0760/+81-90-5484-4155
> Article Links: http://www.xda-developers.com/author/garwynn
> E-mail: xdagarwynn at gmail.com <mailto:xdagarwynn at gmail.com> | Twitter: 
> @xdagarwynn
>
> On Mon, Apr 9, 2018 at 10:48 PM, Christian König 
> <christian.koenig at amd.com <mailto:christian.koenig at amd.com>> wrote:
>
>     Hi Daniel,
>
>     your problem is that the system BIOS is buggy and doesn't assign
>     resources to the card:
>>         Region 0: Memory at <ignored> (64-bit, prefetchable)
>>         Region 2: Memory at <ignored> (64-bit, prefetchable)
>>         Region 4: I/O ports at 9000 [size=256]
>>         Region 5: Memory at <ignored> (32-bit, non-prefetchable)
>>         Expansion ROM at <ignored> [disabled]
>
>     The kernel actually tries to assign resources to the bridges, but
>     fails as well because the BIOS didn't reserved any during startup.
>>     [    0.179743] pci 0000:12:00.0: can't claim BAR 14 [mem
>>     0x01c00000-0xef0fffff]: no compatible bridge window
>>     [    0.179745] pci 0000:12:00.0: [mem 0x01c00000-0xef0fffff]
>>     clipped to [mem 0xef000000-0xef0fffff]
>>     [    0.179747] pci 0000:12:00.0:   bridge window [mem
>>     0xef000000-0xef0fffff]
>>     [    0.179751] pci 0000:13:01.0: can't claim BAR 14 [mem
>>     0x01c00000-0x01ffffff]: no compatible bridge window
>>     [    0.179753] pci 0000:14:00.0: can't claim BAR 14 [mem
>>     0x01c00000-0x01ffffff]: no compatible bridge window
>>     [    0.179754] pci 0000:15:00.0: can't claim BAR 14 [mem
>>     0x01d00000-0x01dfffff]: no compatible bridge window
>>     [    0.179756] pci 0000:08:04.0: can't claim BAR 13 [io 
>>     0xb000-0xcfff]: address conflict with PCI Bus 0000:12 [io 
>>     0x9000-0xbfff]
>>     [    0.179782] pci 0000:14:00.0: can't claim BAR 0 [mem
>>     0x01c00000-0x01c03fff]: no compatible bridge window
>>     [    0.179789] pci 0000:16:00.0: can't claim BAR 0 [mem
>>     0xd0000000-0xdfffffff 64bit pref]: no compatible bridge window
>>     [    0.179791] pci 0000:16:00.0: can't claim BAR 2 [mem
>>     0xe0200000-0xe03fffff 64bit pref]: no compatible bridge window
>>     [    0.179793] pci 0000:16:00.0: can't claim BAR 5 [mem
>>     0x01d00000-0x01d7ffff]: no compatible bridge window
>>     [    0.179798] pci 0000:16:00.1: can't claim BAR 0 [mem
>>     0x01da0000-0x01da3fff]: no compatible bridge window
>
>     There isn't much you can do except for trying to update the BIOS
>     and if that doesn't help replace your motherboard.
>
>     Regards,
>     Christian.
>
>
>     Am 09.04.2018 um 15:33 schrieb Daniel Moran:
>>     Christian,
>>     Andrey,
>>
>>     Thank you for the responses.
>>     Here's the requested dmesg/lspci. Also pulled journalctl just in
>>     case but didn't see anything that stands out.
>>
>>     I'll take another look at the BIOS settings to see if anything
>>     else may explain the memory error.
>>     I've got 16GB in the system at the moment, can bump up to 32 -
>>     also added a larger swap just in case that was the issue. (No
>>     change.)
>>
>>     As always thank you for your continued time and support.
>>
>>     Respectfully,
>>     Daniel S. Moran (garwynn)
>>     PC Hardware Editor - XDA-Developers
>>     Phone: 1-559-316-0760/+81-90-5484-4155
>>     Article Links: http://www.xda-developers.com/author/garwynn
>>     <http://www.xda-developers.com/author/garwynn>
>>     E-mail: xdagarwynn at gmail.com <mailto:xdagarwynn at gmail.com> |
>>     Twitter: @xdagarwynn
>>
>>     On Mon, Apr 9, 2018 at 3:52 PM, Christian König
>>     <christian.koenig at amd.com <mailto:christian.koenig at amd.com>> wrote:
>>
>>         Please provide the full dmesg of the system as well as the
>>         output of "lspci -s 0000:16:00.0 -vvvv" as attachment.
>>
>>         Thanks,
>>         Christian.
>>
>>         Am 09.04.2018 um 06:00 schrieb Andrey Grodzovsky:
>>>
>>>         Just from a quick look it seems to fail in
>>>         amdgpu_device_init->ioremap with ENOMEM, that would explain
>>>         why you don't see any more prints - this failure is very
>>>         early in the device init process.
>>>
>>>         No idea why ioremap would fail in this case and not even
>>>         sure which implementation of ioremap to look into for your case.
>>>
>>>         Adding Christian for this.
>>>
>>>         Andrey
>>>
>>>
>>>         On 04/07/2018 03:16 AM, Daniel Moran wrote:
>>>>         Also, to clarify... if I move it into a regular slot, turn
>>>>         off the eGPU it works as expected.
>>>>         Tested with Intel iGPU enabled and disabled, made sure i915
>>>>         loaded without error and can connect display to it.
>>>>
>>>>
>>>>
>>>>         Again, thank you in advance for any time/support offered.
>>>>
>>>>         Respectfully,
>>>>         Daniel S. Moran (garwynn)
>>>>         PC Hardware Editor - XDA-Developers
>>>>         Phone: 1-559-316-0760/+81-90-5484-4155
>>>>         Article Links: http://www.xda-developers.com/author/garwynn
>>>>         <http://www.xda-developers.com/author/garwynn>
>>>>         E-mail: xdagarwynn at gmail.com
>>>>         <mailto:xdagarwynn at gmail.com> | Twitter: @xdagarwynn
>>>>
>>>>         On Sat, Apr 7, 2018 at 3:58 PM, Daniel Moran
>>>>         <xdagarwynn at gmail.com <mailto:xdagarwynn at gmail.com>> wrote:
>>>>
>>>>             Hello all,
>>>>
>>>>             I've got a Powercolor Red Devil Vega 56 here that I'm
>>>>             trying to get working in eGPU mode.
>>>>             I think on the BIOS/hardware side it's now all fleshed out.
>>>>             Now I'm at a point where amdgpu tries to init and
>>>>             reaches a fatal error.
>>>>
>>>>             Set loglevel=8 doesn't get any additional messages.
>>>>             Here's what it does report (full dmesg attached):
>>>>
>>>>             [  429.005909] [drm] amdgpu kernel modesetting enabled.
>>>>             [  429.006080] [drm] initializing kernel modesetting
>>>>             (VEGA10 0x1002:0x687F 0x148C:0x2388 0xC3).
>>>>             [  429.006082] amdgpu 0000:16:00.0: Fatal error during
>>>>             GPU init
>>>>             [  429.006155] amdgpu: probe of 0000:16:00.0 failed
>>>>             with error -12
>>>>
>>>>             Using the following commands to unload & reload for
>>>>             testing. Since it's as an eGPU I'm using the i7-7700K
>>>>             iGPU (i915 module) as the primary and these commands
>>>>             work in terminal without requiring a reboot.
>>>>
>>>>             sudo rmmod amdgpu
>>>>             sudo modprobe -v amgpu
>>>>
>>>>             Pulled the UMR and tried to make, fails on Cmake. I'll
>>>>             attach log in a text.
>>>>             Also will attach a full dmesg and lspci dump. uname -a
>>>>             below:
>>>>             /Linux testbox 4.15.15-041515-generic #201803311331 SMP
>>>>             Sat Mar 31 17:34:21 UTC 2018 x86_64 x86_64 x86_64
>>>>             GNU/Linux/
>>>>
>>>>             Any other ideas on how I can debug this further? Feel
>>>>             I'm so close, don't want to let this go.
>>>>             Thank you in advance for your time.
>>>>
>>>>             Respectfully,
>>>>             Daniel S. Moran (garwynn)
>>>>             PC Hardware Editor - XDA-Developers
>>>>             Phone: 1-559-316-0760/+81-90-5484-4155
>>>>             Article Links:
>>>>             http://www.xda-developers.com/author/garwynn
>>>>             <http://www.xda-developers.com/author/garwynn>
>>>>             E-mail: xdagarwynn at gmail.com
>>>>             <mailto:xdagarwynn at gmail.com> | Twitter: @xdagarwynn
>>>>
>>>>
>>>>
>>>>
>>>>         _______________________________________________
>>>>         amd-gfx mailing list
>>>>         amd-gfx at lists.freedesktop.org
>>>>         <mailto:amd-gfx at lists.freedesktop.org>
>>>>         https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>         <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>>
>>
>>
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180410/7ecd0a60/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2018-04-07 16-08-59.png
Type: image/png
Size: 60529 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180410/7ecd0a60/attachment-0001.png>


More information about the amd-gfx mailing list