Crashes under Xen with Radeon graphics card
Juergen Gross
jgross at suse.com
Fri Dec 15 16:32:56 UTC 2023
On 15.12.23 17:19, Deucher, Alexander wrote:
> [AMD Official Use Only - General]
>
>> -----Original Message-----
>> From: Juergen Gross <jgross at suse.com>
>> Sent: Friday, December 15, 2023 11:13 AM
>> To: Deucher, Alexander <Alexander.Deucher at amd.com>; lkml <linux-
>> kernel at vger.kernel.org>; xen-devel at lists.xenproject.org; amd-
>> gfx at lists.freedesktop.org
>> Cc: Koenig, Christian <Christian.Koenig at amd.com>; Pan, Xinhui
>> <Xinhui.Pan at amd.com>
>> Subject: Re: Crashes under Xen with Radeon graphics card
>>
>> On 15.12.23 17:04, Deucher, Alexander wrote:
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Juergen Gross <jgross at suse.com>
...
>>>> The crashes vary, but often the kernel accesses non-canonical
>>>> addresses or tries to map illegal physical addresses. Sometimes the
>>>> system is just hanging, either with softlockups or without any further signs
>> of being alive.
>>>>
>>>> I can easily reproduce the problem, so any debug patches to narrow
>>>> down the problem are welcome.
>>>
>>> There are still missing firmware required for proper operation. Please fix
>> them up.
>>
>> That was the starting point, of course!
>
> Ah, ok. Thanks for clarifying. What exactly happens when you get this crash? System hang? Kernel oops? Is there anything in the dmesg when it happens?
As I wrote above: rather different cases. The crash happens normally
within 20 seconds after the system is completely up. I had one case
where it survived ca. 2 minutes.
One example:
[ 64.549114] BUG: unable to handle page fault for address: ffff888121291000
[ 64.562850] #PF: supervisor write access in kernel mode
[ 64.573352] #PF: error_code(0x0003) - permissions violation
[ 64.584589] PGD 2836067 P4D 2836067 PUD 3e73f7067 PMD 3e72ed067 PTE
8010000121291025
[ 64.600212] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 64.608985] CPU: 3 PID: 2090 Comm: kioslave5 Tainted: G E
6.7.0-rc5-default #974
[ 64.626721] Hardware name: Dell Inc. OptiPlex 9020/0PC5F7, BIOS A25 05/30/2019
[ 64.641193] RIP: e030:clear_page_erms+0x7/0x10
[ 64.650161] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc
cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 64.687996] RSP: e02b:ffffc9004206fb50 EFLAGS: 00010246
[ 64.698378] RAX: 0000000000000000 RBX: ffffea000484a400 RCX: 0000000000001000
[ 64.712780] RDX: 0000000000052dc0 RSI: 0000000000000003 RDI: ffff888121291000
[ 64.727154] RBP: 0000000000000901 R08: ffffea000484a440 R09: ffffea000484a600
[ 64.741491] R10: 0000000000000002 R11: 000000000000241e R12: ffff8883e7d21d80
[ 64.755843] R13: 000000000028d834 R14: 0000000000000901 R15: ffffea000484a400
[ 64.770207] FS: 00007f4c2b79d280(0000) GS:ffff888409380000(0000)
knlGS:0000000000000000
[ 64.786487] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 64.798019] CR2: ffff888121291000 CR3: 000000014fef4000 CR4: 0000000000050660
[ 64.812411] Call Trace:
[ 64.817308] <TASK>
[ 64.821625] ? __die_body+0x1a/0x60
[ 64.828746] ? page_fault_oops+0x151/0x470
[ 64.837065] ? search_bpf_extables+0x65/0x70
[ 64.845717] ? fixup_exception+0x22/0x320
[ 64.853844] ? exc_page_fault+0xb3/0x150
[ 64.861792] ? asm_exc_page_fault+0x22/0x30
[ 64.870275] ? clear_page_erms+0x7/0x10
[ 64.878050] prep_new_page+0x97/0xb0
[ 64.885308] get_page_from_freelist+0x7a4/0x1f40
[ 64.894678] __alloc_pages+0x18b/0x350
[ 64.902270] ? kvmalloc_node+0x3a/0xd0
[ 64.909892] __kmalloc_large_node+0x7a/0x140
[ 64.918542] __kmalloc_node+0xc1/0x130
[ 64.926149] kvmalloc_node+0x3a/0xd0
[ 64.933399] proc_sys_call_handler+0xfa/0x230
[ 64.942259] vfs_read+0x22f/0x2e0
[ 64.949007] ksys_read+0xa5/0xe0
[ 64.955527] do_syscall_64+0x5d/0xe0
[ 64.962806] ? do_user_addr_fault+0x5b3/0x8a0
[ 64.971647] ? exc_page_fault+0x6f/0x150
[ 64.979587] entry_SYSCALL_64_after_hwframe+0x6f/0x77
[ 64.989821] RIP: 0033:0x7f4c29f06a3e
[ 64.997098] Code: 08 e8 f4 1e 02 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0
ff ff 77 5a f3 c3 0f 1f 84 00 00 00 00 00 41 54 55 49
[ 65.034962] RSP: 002b:00007ffd5a86f2b8 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 65.050071] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4c29f06a3e
[ 65.064415] RDX: 0000000000004000 RSI: 0000000002562c18 RDI: 0000000000000004
[ 65.078775] RBP: 0000000002561d60 R08: 00007f4c2abd3418 R09: 0000000000000028
[ 65.093155] R10: 000000000253b010 R11: 0000000000000246 R12: 0000000000004000
[ 65.107492] R13: 0000000000004000 R14: 0000000000000004 R15: 0000000002562c18
[ 65.121850] </TASK>
>
>>
>> BTW, meanwhile I have tested kernel 5.19, which is working. I suspected that
>> the patch series merging swiotlb and swiotlb-xen could be to blame, but that
>> went into v5.19.
>
> Can you bisect?
I can try to find the offending commit, sure. I just wanted to share my current
findings in the hope that someone might have an idea ...
Juergen
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xB0DE9DD628BF132F.asc
Type: application/pgp-keys
Size: 3683 bytes
Desc: OpenPGP public key
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231215/412ea2ad/attachment.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231215/412ea2ad/attachment.sig>
More information about the amd-gfx
mailing list