Crashes under Xen with Radeon graphics card

Juergen Gross jgross at suse.com
Fri Dec 15 16:32:56 UTC 2023


On 15.12.23 17:19, Deucher, Alexander wrote:
> [AMD Official Use Only - General]
> 
>> -----Original Message-----
>> From: Juergen Gross <jgross at suse.com>
>> Sent: Friday, December 15, 2023 11:13 AM
>> To: Deucher, Alexander <Alexander.Deucher at amd.com>; lkml <linux-
>> kernel at vger.kernel.org>; xen-devel at lists.xenproject.org; amd-
>> gfx at lists.freedesktop.org
>> Cc: Koenig, Christian <Christian.Koenig at amd.com>; Pan, Xinhui
>> <Xinhui.Pan at amd.com>
>> Subject: Re: Crashes under Xen with Radeon graphics card
>>
>> On 15.12.23 17:04, Deucher, Alexander wrote:
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Juergen Gross <jgross at suse.com>

...

>>>> The crashes vary, but often the kernel accesses non-canonical
>>>> addresses or tries to map illegal physical addresses. Sometimes the
>>>> system is just hanging, either with softlockups or without any further signs
>> of being alive.
>>>>
>>>> I can easily reproduce the problem, so any debug patches to narrow
>>>> down the problem are welcome.
>>>
>>> There are still missing firmware required for proper operation.  Please fix
>> them up.
>>
>> That was the starting point, of course!
> 
> Ah, ok.  Thanks for clarifying.  What exactly happens when you get this crash?  System hang?  Kernel oops?  Is there anything in the dmesg when it happens?

As I wrote above: rather different cases. The crash happens normally
within 20 seconds after the system is completely up. I had one case
where it survived ca. 2 minutes.

One example:

[   64.549114] BUG: unable to handle page fault for address: ffff888121291000
[   64.562850] #PF: supervisor write access in kernel mode
[   64.573352] #PF: error_code(0x0003) - permissions violation
[   64.584589] PGD 2836067 P4D 2836067 PUD 3e73f7067 PMD 3e72ed067 PTE 
8010000121291025
[   64.600212] Oops: 0003 [#1] PREEMPT SMP NOPTI
[   64.608985] CPU: 3 PID: 2090 Comm: kioslave5 Tainted: G            E 
6.7.0-rc5-default #974
[   64.626721] Hardware name: Dell Inc. OptiPlex 9020/0PC5F7, BIOS A25 05/30/2019
[   64.641193] RIP: e030:clear_page_erms+0x7/0x10
[   64.650161] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90 
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc 
cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[   64.687996] RSP: e02b:ffffc9004206fb50 EFLAGS: 00010246
[   64.698378] RAX: 0000000000000000 RBX: ffffea000484a400 RCX: 0000000000001000
[   64.712780] RDX: 0000000000052dc0 RSI: 0000000000000003 RDI: ffff888121291000
[   64.727154] RBP: 0000000000000901 R08: ffffea000484a440 R09: ffffea000484a600
[   64.741491] R10: 0000000000000002 R11: 000000000000241e R12: ffff8883e7d21d80
[   64.755843] R13: 000000000028d834 R14: 0000000000000901 R15: ffffea000484a400
[   64.770207] FS:  00007f4c2b79d280(0000) GS:ffff888409380000(0000) 
knlGS:0000000000000000
[   64.786487] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[   64.798019] CR2: ffff888121291000 CR3: 000000014fef4000 CR4: 0000000000050660
[   64.812411] Call Trace:
[   64.817308]  <TASK>
[   64.821625]  ? __die_body+0x1a/0x60
[   64.828746]  ? page_fault_oops+0x151/0x470
[   64.837065]  ? search_bpf_extables+0x65/0x70
[   64.845717]  ? fixup_exception+0x22/0x320
[   64.853844]  ? exc_page_fault+0xb3/0x150
[   64.861792]  ? asm_exc_page_fault+0x22/0x30
[   64.870275]  ? clear_page_erms+0x7/0x10
[   64.878050]  prep_new_page+0x97/0xb0
[   64.885308]  get_page_from_freelist+0x7a4/0x1f40
[   64.894678]  __alloc_pages+0x18b/0x350
[   64.902270]  ? kvmalloc_node+0x3a/0xd0
[   64.909892]  __kmalloc_large_node+0x7a/0x140
[   64.918542]  __kmalloc_node+0xc1/0x130
[   64.926149]  kvmalloc_node+0x3a/0xd0
[   64.933399]  proc_sys_call_handler+0xfa/0x230
[   64.942259]  vfs_read+0x22f/0x2e0
[   64.949007]  ksys_read+0xa5/0xe0
[   64.955527]  do_syscall_64+0x5d/0xe0
[   64.962806]  ? do_user_addr_fault+0x5b3/0x8a0
[   64.971647]  ? exc_page_fault+0x6f/0x150
[   64.979587]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
[   64.989821] RIP: 0033:0x7f4c29f06a3e
[   64.997098] Code: 08 e8 f4 1e 02 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 
ff ff 77 5a f3 c3 0f 1f 84 00 00 00 00 00 41 54 55 49
[   65.034962] RSP: 002b:00007ffd5a86f2b8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000000
[   65.050071] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4c29f06a3e
[   65.064415] RDX: 0000000000004000 RSI: 0000000002562c18 RDI: 0000000000000004
[   65.078775] RBP: 0000000002561d60 R08: 00007f4c2abd3418 R09: 0000000000000028
[   65.093155] R10: 000000000253b010 R11: 0000000000000246 R12: 0000000000004000
[   65.107492] R13: 0000000000004000 R14: 0000000000000004 R15: 0000000002562c18
[   65.121850]  </TASK>

> 
>>
>> BTW, meanwhile I have tested kernel 5.19, which is working. I suspected that
>> the patch series merging swiotlb and swiotlb-xen could be to blame, but that
>> went into v5.19.
> 
> Can you bisect?

I can try to find the offending commit, sure. I just wanted to share my current
findings in the hope that someone might have an idea ...


Juergen
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xB0DE9DD628BF132F.asc
Type: application/pgp-keys
Size: 3683 bytes
Desc: OpenPGP public key
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231215/412ea2ad/attachment.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231215/412ea2ad/attachment.sig>


More information about the amd-gfx mailing list