amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

Fri Dec 15 12:37:35 UTC 2023

Am 15.12.23 um 12:45 schrieb Mikhail Gavrilov:
> On Tue, Feb 28, 2023 at 5:43 PM Christian König
> <ckoenig.leichtzumerken at gmail.com> wrote:
>> The point is it doesn't need to talk to the amdgpu hardware. What it
>> does is that it talks to the good old VGA/VESA emulation and that just
>> happens to be still enabled by the BIOS/GRUB.
>>
>> And that VGA/VESA emulation doesn't need any BAR or whatever to keep the
>> hw running in the state where it was initialized before the kernel
>> started. The kernel just grabs the addresses where it needs to write the
>> display data and keeps going with that.
>>
>> But when a hw specific driver wants to load this is the first thing
>> which gets disabled because we need to load new firmware. And with the
>> BARs disabled this can't be re-enabled without rebooting the system.
>>
>>> My suggestion is that if
>>> amdgpu fails to talk to the hardware, then let another suitable driver
>>> do it. I attached a system log when I apply "pci=nocrs" with
>>> "modprobe.blacklist=amdgpu" for showing that graphics work right in
>>> this case.
>>> To do this, does the Linux module loading mechanism need to be refined?
>> That's actually working as expected. The real problem is that the BIOS
>> on that system is so broken that we can't access the hw correctly.
>>
>> What we could to do is to check the BARs very early on and refuse to
>> load when they are disable. The problem with this approach is that there
>> are systems where it is normal that the BARs are disable until the
>> driver loads and get enabled during the hardware initialization process.
>>
>> What you might want to look into is to find a quirk for the BIOS to
>> properly enable the nvme controller.
>>
> That's interesting. I noticed that now amdgpu could work even with
> parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels.
> It means BARs became available?
> I attached here the kerner log and lspci. What's changed?

I have no idea :)

 From the logs I can see that the AMDGPU now has the proper BARs assigned:

[    5.722015] pci 0000:03:00.0: [1002:73df] type 00 class 0x038000
[    5.722051] pci 0000:03:00.0: reg 0x10: [mem 
0xf800000000-0xfbffffffff 64bit pref]
[    5.722081] pci 0000:03:00.0: reg 0x18: [mem 
0xfc00000000-0xfc0fffffff 64bit pref]
[    5.722112] pci 0000:03:00.0: reg 0x24: [mem 0xfca00000-0xfcafffff]
[    5.722134] pci 0000:03:00.0: reg 0x30: [mem 0xfcb00000-0xfcb1ffff pref]
[    5.722368] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[    5.722484] pci 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth, 
limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048 
Gb/s with 16.0 GT/s PCIe x16 link)

And with that the driver can work perfectly fine.

Have you updated the BIOS or added/removed some other hardware? Maybe 
somebody added a quirk for your BIOS into the PCIe code or something 
like that.

Regards,
Christian.