[Advice Request] Trying to debug amdgpu fatal error

Christian König christian.koenig at amd.com
Mon Apr 9 13:48:21 UTC 2018


Hi Daniel,

your problem is that the system BIOS is buggy and doesn't assign 
resources to the card:
>     Region 0: Memory at <ignored> (64-bit, prefetchable)
>     Region 2: Memory at <ignored> (64-bit, prefetchable)
>     Region 4: I/O ports at 9000 [size=256]
>     Region 5: Memory at <ignored> (32-bit, non-prefetchable)
>     Expansion ROM at <ignored> [disabled]

The kernel actually tries to assign resources to the bridges, but fails 
as well because the BIOS didn't reserved any during startup.
> [    0.179743] pci 0000:12:00.0: can't claim BAR 14 [mem 
> 0x01c00000-0xef0fffff]: no compatible bridge window
> [    0.179745] pci 0000:12:00.0: [mem 0x01c00000-0xef0fffff] clipped 
> to [mem 0xef000000-0xef0fffff]
> [    0.179747] pci 0000:12:00.0:   bridge window [mem 
> 0xef000000-0xef0fffff]
> [    0.179751] pci 0000:13:01.0: can't claim BAR 14 [mem 
> 0x01c00000-0x01ffffff]: no compatible bridge window
> [    0.179753] pci 0000:14:00.0: can't claim BAR 14 [mem 
> 0x01c00000-0x01ffffff]: no compatible bridge window
> [    0.179754] pci 0000:15:00.0: can't claim BAR 14 [mem 
> 0x01d00000-0x01dfffff]: no compatible bridge window
> [    0.179756] pci 0000:08:04.0: can't claim BAR 13 [io 
> 0xb000-0xcfff]: address conflict with PCI Bus 0000:12 [io 0x9000-0xbfff]
> [    0.179782] pci 0000:14:00.0: can't claim BAR 0 [mem 
> 0x01c00000-0x01c03fff]: no compatible bridge window
> [    0.179789] pci 0000:16:00.0: can't claim BAR 0 [mem 
> 0xd0000000-0xdfffffff 64bit pref]: no compatible bridge window
> [    0.179791] pci 0000:16:00.0: can't claim BAR 2 [mem 
> 0xe0200000-0xe03fffff 64bit pref]: no compatible bridge window
> [    0.179793] pci 0000:16:00.0: can't claim BAR 5 [mem 
> 0x01d00000-0x01d7ffff]: no compatible bridge window
> [    0.179798] pci 0000:16:00.1: can't claim BAR 0 [mem 
> 0x01da0000-0x01da3fff]: no compatible bridge window

There isn't much you can do except for trying to update the BIOS and if 
that doesn't help replace your motherboard.

Regards,
Christian.


Am 09.04.2018 um 15:33 schrieb Daniel Moran:
> Christian,
> Andrey,
>
> Thank you for the responses.
> Here's the requested dmesg/lspci. Also pulled journalctl just in case 
> but didn't see anything that stands out.
>
> I'll take another look at the BIOS settings to see if anything else 
> may explain the memory error.
> I've got 16GB in the system at the moment, can bump up to 32 - also 
> added a larger swap just in case that was the issue. (No change.)
>
> As always thank you for your continued time and support.
>
> Respectfully,
> Daniel S. Moran (garwynn)
> PC Hardware Editor - XDA-Developers
> Phone: 1-559-316-0760/+81-90-5484-4155
> Article Links: http://www.xda-developers.com/author/garwynn
> E-mail: xdagarwynn at gmail.com <mailto:xdagarwynn at gmail.com> | Twitter: 
> @xdagarwynn
>
> On Mon, Apr 9, 2018 at 3:52 PM, Christian König 
> <christian.koenig at amd.com <mailto:christian.koenig at amd.com>> wrote:
>
>     Please provide the full dmesg of the system as well as the output
>     of "lspci -s 0000:16:00.0 -vvvv" as attachment.
>
>     Thanks,
>     Christian.
>
>     Am 09.04.2018 um 06:00 schrieb Andrey Grodzovsky:
>>
>>     Just from a quick look it seems to fail in
>>     amdgpu_device_init->ioremap with ENOMEM, that would explain why
>>     you don't see any more prints - this failure is very early in the
>>     device init process.
>>
>>     No idea why ioremap would fail in this case and not even sure
>>     which implementation of ioremap to look into for your case.
>>
>>     Adding Christian for this.
>>
>>     Andrey
>>
>>
>>     On 04/07/2018 03:16 AM, Daniel Moran wrote:
>>>     Also, to clarify... if I move it into a regular slot, turn off
>>>     the eGPU it works as expected.
>>>     Tested with Intel iGPU enabled and disabled, made sure i915
>>>     loaded without error and can connect display to it.
>>>
>>>
>>>
>>>     Again, thank you in advance for any time/support offered.
>>>
>>>     Respectfully,
>>>     Daniel S. Moran (garwynn)
>>>     PC Hardware Editor - XDA-Developers
>>>     Phone: 1-559-316-0760/+81-90-5484-4155
>>>     Article Links: http://www.xda-developers.com/author/garwynn
>>>     <http://www.xda-developers.com/author/garwynn>
>>>     E-mail: xdagarwynn at gmail.com <mailto:xdagarwynn at gmail.com> |
>>>     Twitter: @xdagarwynn
>>>
>>>     On Sat, Apr 7, 2018 at 3:58 PM, Daniel Moran
>>>     <xdagarwynn at gmail.com <mailto:xdagarwynn at gmail.com>> wrote:
>>>
>>>         Hello all,
>>>
>>>         I've got a Powercolor Red Devil Vega 56 here that I'm trying
>>>         to get working in eGPU mode.
>>>         I think on the BIOS/hardware side it's now all fleshed out.
>>>         Now I'm at a point where amdgpu tries to init and reaches a
>>>         fatal error.
>>>
>>>         Set loglevel=8 doesn't get any additional messages.
>>>         Here's what it does report (full dmesg attached):
>>>
>>>         [  429.005909] [drm] amdgpu kernel modesetting enabled.
>>>         [  429.006080] [drm] initializing kernel modesetting (VEGA10
>>>         0x1002:0x687F 0x148C:0x2388 0xC3).
>>>         [  429.006082] amdgpu 0000:16:00.0: Fatal error during GPU init
>>>         [  429.006155] amdgpu: probe of 0000:16:00.0 failed with
>>>         error -12
>>>
>>>         Using the following commands to unload & reload for testing.
>>>         Since it's as an eGPU I'm using the i7-7700K iGPU (i915
>>>         module) as the primary and these commands work in terminal
>>>         without requiring a reboot.
>>>
>>>         sudo rmmod amdgpu
>>>         sudo modprobe -v amgpu
>>>
>>>         Pulled the UMR and tried to make, fails on Cmake. I'll
>>>         attach log in a text.
>>>         Also will attach a full dmesg and lspci dump. uname -a below:
>>>         /Linux testbox 4.15.15-041515-generic #201803311331 SMP Sat
>>>         Mar 31 17:34:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux/
>>>
>>>         Any other ideas on how I can debug this further? Feel I'm so
>>>         close, don't want to let this go.
>>>         Thank you in advance for your time.
>>>
>>>         Respectfully,
>>>         Daniel S. Moran (garwynn)
>>>         PC Hardware Editor - XDA-Developers
>>>         Phone: 1-559-316-0760/+81-90-5484-4155
>>>         Article Links: http://www.xda-developers.com/author/garwynn
>>>         <http://www.xda-developers.com/author/garwynn>
>>>         E-mail: xdagarwynn at gmail.com <mailto:xdagarwynn at gmail.com> |
>>>         Twitter: @xdagarwynn
>>>
>>>
>>>
>>>
>>>     _______________________________________________
>>>     amd-gfx mailing list
>>>     amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>>>     https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>     <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180409/31ccd2fc/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2018-04-07 16-08-59.png
Type: image/png
Size: 60529 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180409/31ccd2fc/attachment-0001.png>


More information about the amd-gfx mailing list