Regression on linux-next (next-20250321)

Tue Mar 25 05:39:39 UTC 2025

Hello Nicolin,

Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.

This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.

Since the version next-20250321 [2], we are seeing the following regression

`````````````````````````````````````````````````````````````````````````````````
<4>[    0.226495] Unpatched return thunk in use. This should not happen!
<4>[    0.226502] WARNING: CPU: 0 PID: 1 at arch/x86/kernel/cpu/bugs.c:3107 __warn_thunk+0x62/0x70
<4>[    0.226513] Modules linked in:
<4>[    0.226521] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc7-next-20250321-next-20250321-g9388ec571cb1+ #1 PREEMPT(voluntary) 
<4>[    0.226532] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023
<4>[    0.226539] RIP: 0010:__warn_thunk+0x62/0x70
<4>[    0.226544] Code: 34 4c 5d 02 01 e8 fe f6 a7 00 84 c0 75 d9 48 c7 c7 f8 bf 0d 83 e8 7e c6 08 00 48 c7 c7 a0 a2 a0 82 e8 e2 f6 a7 00 84 c0 75 bd <0f> 0b eb b9 cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90
<4>[    0.226559] RSP: 0000:ffffc90000067d78 EFLAGS: 00010246
<4>[    0.226565] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
<4>[    0.226571] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
<4>[    0.226577] RBP: ffffc90000067d80 R08: 0000000000000000 R09: 0000000000000000
<4>[    0.226583] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[    0.226589] R13: ffffffff83c9417c R14: ffff88887f344bc0 R15: ffff888102370100
<4>[    0.226595] FS:  0000000000000000(0000) GS:ffff8888dacfd000(0000) knlGS:0000000000000000
<4>[    0.226602] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[    0.226608] CR2: ffff88887f7ff000 CR3: 000000000344a000 CR4: 0000000000f50ef0
<4>[    0.226614] PKRU: 55555554
<4>[    0.226617] Call Trace:
<4>[    0.226620]  <TASK>
<4>[    0.226624]  ? show_regs+0x6c/0x80
<4>[    0.226630]  ? __warn+0x94/0x210
<4>[    0.226635]  ? __warn_thunk+0x62/0x70
<4>[    0.226640]  ? __report_bug+0x110/0x280
<4>[    0.227000]  ? __lock_acquire+0x447/0x2c70
<4>[    0.227011]  ? _prb_read_valid+0x25a/0x310
<4>[    0.227018]  ? __lock_acquire+0x447/0x2c70
<4>[    0.227024]  ? prb_read_valid+0x1c/0x30
<4>[    0.227037]  ? lock_acquire+0xc4/0x330
<4>[    0.227055]  ? _prb_read_valid+0x25a/0x310
<4>[    0.227073]  ? __warn_thunk+0x62/0x70
<4>[    0.227081]  ? report_bug+0x24/0x80
<4>[    0.227089]  ? handle_bug+0x16a/0x2a0
<4>[    0.227098]  ? exc_invalid_op+0x18/0x80
<4>[    0.227106]  ? asm_exc_invalid_op+0x1b/0x20
<4>[    0.227122]  ? __warn_thunk+0x62/0x70
<4>[    0.227130]  ? __warn_thunk+0x5e/0x70
<4>[    0.227135]  ? iommu_dma_ranges_sort+0x40/0x40
<4>[    0.227144]  warn_thunk_thunk+0x16/0x30
<4>[    0.227157]  do_one_initcall+0x5d/0x460
<4>[    0.227171]  kernel_init_freeable+0x3ac/0x530
<4>[    0.227187]  ? __pfx_kernel_init+0x10/0x10
<4>[    0.227196]  kernel_init+0x1b/0x200
<4>[    0.227203]  ret_from_fork+0x44/0x70
<4>[    0.227210]  ? __pfx_kernel_init+0x10/0x10
<4>[    0.227217]  ret_from_fork_asm+0x1a/0x30
<4>[    0.227236]  </TASK>
`````````````````````````````````````````````````````````````````````````````````
Details log can be found in [3].

After bisecting the tree, the following patch [4] seems to be the first "bad"
commit

`````````````````````````````````````````````````````````````````````````````````````````````````````````
commit e009e088d88e8402539f9595b10c0014125a70c1
Author: Nicolin Chen mailto:nicolinc at nvidia.com
Date:   Thu Mar 6 13:00:49 2025 -0800

    iommu: Drop sw_msi from iommu_domain

    There are only two sw_msi implementations in the entire system, thus it's
    not very necessary to have an sw_msi pointer.

    Instead, check domain->cookie_type to call the two sw_msi implementations
    directly from the core code.
`````````````````````````````````````````````````````````````````````````````````````````````````````````

We also verified that if we revert the patch the issue is not seen.

Could you please check why the patch causes this regression and provide a fix if necessary?

Thank you.

Regards

Chaitanya

[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20250321 
[3] https://intel-gfx-ci.01.org/tree/linux-next/next-20250321/bat-rpls-4/boot0.txt 
[4] https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20250321&id=e009e088d88e8402539f9595b10c0014125a70c1