Crashes under Xen with Radeon graphics card

Juergen Gross jgross at suse.com
Fri Dec 15 16:12:37 UTC 2023


On 15.12.23 17:04, Deucher, Alexander wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Juergen Gross <jgross at suse.com>
>> Sent: Friday, December 15, 2023 6:57 AM
>> To: lkml <linux-kernel at vger.kernel.org>; xen-devel at lists.xenproject.org; amd-
>> gfx at lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>
>> Subject: Crashes under Xen with Radeon graphics card
>>
>> Hi,
>>
>> I recently stumbled over a test system which showed crashes probably
>> resulting from memory being overwritten randomly.
>>
>> The problem is occurring only in Dom0 when running under Xen. It seems to
>> be present since at least kernel 6.3 (I didn't go back further yet), and it seems
>> NOT to be present in kernel 5.14.
>>
>> I tracked the problem down to the initialization of the graphics card (the
>> problem might surface only later, but at least an early initialization failure made
>> the problem go away).
>>
>> # lspci
>> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
>> Caicos XTX [Radeon HD 8490 / R5 235X OEM]
>> 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI
>> Audio [Radeon HD 6450 / 7450/8450/8490 OEM / R5 230/235/235X OEM]
>>
>> I had a working .config and one which did produce the crashes, so I narrowed
>> the problem down to detect that the important difference was in the area of
>> firmware loading (the working .config didn't have
>> CONFIG_FW_LOADER_COMPRESS_XZ set, causing firmware loading for the
>> card to fail). This was of course not the real problem, but it caused the card
>> initialization to fail.
>>
>> I manually decompressed the firmware files one by one to see whether the
>> problem would be in the decompressor or probably in the driver of the card.
>>
>> The last step without crash was:
>>
>> # dmesg | grep radeon
>> [   10.106405] [drm] radeon kernel modesetting enabled.
>> [   10.106455] radeon 0000:01:00.0: vgaarb: deactivate vga console
>> [   10.222944] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
>> -
>> 0x000000003FFFFFFF (1024M used)
>> [   10.252921] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
>> 0x000000007FFFFFFF
>> [   10.278255] [drm] radeon: 1024M of VRAM memory ready
>> [   10.295828] [drm] radeon: 1024M of GTT memory ready.
>> [   10.295867] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_pfp.bin succeeded
>> [   10.330846] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_me.bin succeeded
>> [   10.330858] radeon 0000:01:00.0: Direct firmware load for
>> radeon/BTC_rlc.bin
>> succeeded
>> [   10.330870] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_mc.bin failed with error -2
>> [   10.380979] ni_cp: Failed to load firmware "radeon/CAICOS_mc.bin"
>> [   10.381006] [drm:evergreen_init [radeon]] *ERROR* Failed to load
>> firmware!
>> [   10.405765] radeon 0000:01:00.0: Fatal error during GPU init
>> [   10.432107] [drm] radeon: finishing device.
>> [   10.439179] [drm] radeon: ttm finalized
>> [   10.463203] radeon: probe of 0000:01:00.0 failed with error -2
>>
>> And with decompressing radeon/CAICOS_mc.bin I got:
>>
>> # dmesg | grep radeon
>> [   10.266491] [drm] radeon kernel modesetting enabled.
>> [   10.266552] radeon 0000:01:00.0: vgaarb: deactivate vga console
>> [   10.456047] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
>> -
>> 0x000000003FFFFFFF (1024M used)
>> [   10.470270] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
>> 0x000000007FFFFFFF
>> [   10.566946] [drm] radeon: 1024M of VRAM memory ready
>> [   10.576891] [drm] radeon: 1024M of GTT memory ready.
>> [   10.586971] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_pfp.bin succeeded
>> [   10.611886] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_me.bin succeeded
>> [   10.611909] radeon 0000:01:00.0: Direct firmware load for
>> radeon/BTC_rlc.bin
>> succeeded
>> [   10.611938] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_mc.bin succeeded
>> [   10.660599] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_smc.bin failed with error -2
>> [   10.660601] smc: error loading firmware "radeon/CAICOS_smc.bin"
> 
> You also need to make sure CAICOS_smc.bin is available.

Of course. But with all firmware files loadable the system is crashing, too.

I thought it might help to see after which firmware the crashes are starting.

> 
>> [   10.661676] [drm] radeon: power management initialized
>> [   10.713666] radeon 0000:01:00.0: Direct firmware load for
>> radeon/SUMO_uvd.bin
>> failed with error -2
>> [   10.713668] radeon 0000:01:00.0: radeon_uvd: Can't load firmware
>> "radeon/SUMO_uvd.bin"
>> [   10.713669] radeon 0000:01:00.0: failed UVD (-2) init.
> 
> And SUMO_uvd.bin.

Sure.

> 
>> [   10.714787] [drm] enabling PCIE gen 2 link speeds, disable with
>> radeon.pcie_gen2=0
>> [   10.809213] radeon 0000:01:00.0: WB enabled
>> [   10.817528] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
>> 0x0000000040000c00
>> [   10.833755] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
>> 0x0000000040000c0c
>> [   10.850330] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
>> [   10.862154] radeon 0000:01:00.0: radeon: using MSI.
>> [   10.871930] [drm] radeon: irq initialized.
>> [   11.062028] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on
>> minor 0
>> [   11.119723] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
>> monitor but no|invalid EDID
>> [   11.411370] fbcon: radeondrmfb (fb0) is primary device
>> [   11.507252] radeon 0000:01:00.0: [drm] fb0: radeondrmfb frame buffer
>> device
>> [   11.674028] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
>> monitor but no|invalid EDID
>> [   11.834317] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
>> monitor but no|invalid EDID
>> [   28.313041] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops
>> radeon_audio_component_bind_ops [radeon])
>> [   44.371991] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
>> monitor but no|invalid EDID
>> [   44.428068] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
>> monitor but no|invalid EDID
>>
>> followed by a crash some seconds after the system was up.
>>
>> The crashes vary, but often the kernel accesses non-canonical addresses or
>> tries to map illegal physical addresses. Sometimes the system is just hanging,
>> either with softlockups or without any further signs of being alive.
>>
>> I can easily reproduce the problem, so any debug patches to narrow down the
>> problem are welcome.
> 
> There are still missing firmware required for proper operation.  Please fix them up.

That was the starting point, of course!

BTW, meanwhile I have tested kernel 5.19, which is working. I suspected that
the patch series merging swiotlb and swiotlb-xen could be to blame, but that
went into v5.19.


Juergen
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xB0DE9DD628BF132F.asc
Type: application/pgp-keys
Size: 3683 bytes
Desc: OpenPGP public key
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231215/8f6d3244/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231215/8f6d3244/attachment-0001.sig>


More information about the amd-gfx mailing list