Regression on linux-next (next-20241120) and drm-tip
Lucas De Marchi
lucas.demarchi at intel.com
Tue Dec 3 17:29:36 UTC 2024
On Tue, Dec 03, 2024 at 03:50:28PM +0100, Thomas Weißschuh wrote:
>On 2024-12-03 15:33:21+0100, Rafael J. Wysocki wrote:
>> On Tue, Dec 3, 2024 at 1:04 PM Thomas Weißschuh <linux at weissschuh.net> wrote:
>> >
>> > On 2024-12-03 12:54:54+0100, Rafael J. Wysocki wrote:
>> > > On Tue, Dec 3, 2024 at 7:51 AM Thomas Weißschuh <linux at weissschuh.net> wrote:
>> > > >
>> > > > (+Cc Sebastian)
>> > > >
>> > > > Hi Chaitanya,
>> > > >
>> > > > On 2024-12-03 05:07:47+0000, Borah, Chaitanya Kumar wrote:
>> > > > > Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
>> > > > >
>> > > > > This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.
>> > > >
>> > > > Thanks for the report.
>> > > >
>> > > > > Since the version next-20241120 [2], we are seeing the following regression
>> > > > >
>> > > > > `````````````````````````````````````````````````````````````````````````````````
>> > > > > <4>[ 19.990743] Oops: general protection fault, probably for non-canonical address 0xb11675ef8d1ccbce: 0000 [#1] PREEMPT SMP NOPTI
>> > > > > <4>[ 19.990760] CPU: 21 UID: 110 PID: 867 Comm: prometheus-node Not tainted 6.12.0-next-20241120-next-20241120-gac24e26aa08f+ #1
>> > > > > <4>[ 19.990771] Hardware name: Intel Corporation Arrow Lake Client Platform/MTL-S UDIMM 2DPC EVCRB, BIOS MTLSFWI1.R00.4400.D85.2410100007 10/10/2024
>> > > > > <4>[ 19.990782] RIP: 0010:power_supply_get_property+0x3e/0xe0
>> > > > > `````````````````````````````````````````````````````````````````````````````````
>> > > > > Details log can be found in [3].
>> > > > >
>> > > > > After bisecting the tree, the following patch [4] seems to be the first "bad"
>> > > > > commit
>> > > > >
>> > > > > `````````````````````````````````````````````````````````````````````````````````````````````````````````
>> > > > > Commit 49000fee9e639f62ba1f965ed2ae4c5ad18d19e2
>> > > > > Author: Thomas Weißschuh <mailto:linux at weissschuh.net>
>> > > > > AuthorDate: Sat Oct 5 12:05:03 2024 +0200
>> > > > > Commit: Sebastian Reichel <mailto:sebastian.reichel at collabora.com>
>> > > > > CommitDate: Tue Oct 15 22:22:20 2024 +0200
>> > > > > power: supply: core: add wakeup source inhibit by power_supply_config
>> > > > > `````````````````````````````````````````````````````````````````````````````````````````````````````````
>> > > > >
>> > > > > This is now seen in our drm-tip runs as well. [5]
>> > > > >
>> > > > > Could you please check why the patch causes this regression and provide a fix if necessary?
>> > > >
>> > > > I don't see how this patch can lead to this error.
>> > >
>> > > It looks like the cfg->no_wakeup_source access reaches beyond the
>> > > struct boundary for some reason.
>> >
>> > But the access to this field is only done in __power_supply_register().
>> > The error reports however don't show this function at all,
>> > they come from power_supply_uevent() and power_supply_get_property() by
>> > which time the call to __power_supply_register() is long over.
>> >
>> > FWIW there is an uninitialized 'struct power_supply_config' in
>> > drivers/hid/hid-corsair-void.c. But I highly doubt the test machines are
>> > using that. (I'll send a patch later for it)
>>
>> So the only way I can think about in which the commit in question may
>> lead to the reported issues is that changing the size of struct
>> power_supply_config or its alignment makes an unexpected functional
>> difference somewhere.
>
>Indeed. I'd really like to see this issue reproduced with KASAN.
I just reproduced it in another machine, Lunar Lake. Signature is
slightly different. See the output with KASAN:
[ 86.490814] Oops: general protection fault, probably for non-canonical address 0xe95b10e51c79f1ba: 0000 [#2] PREEMPT SMP KASAN NOPTI
[ 86.502796] KASAN: maybe wild-memory-access in range [0x4ad8a728e3cf8dd0-0x4ad8a728e3cf8dd7]
[ 86.511296] CPU: 1 UID: 0 PID: 900 Comm: systemd-logind Tainted: G D 6.13.0-rc1-xe+ #2
[ 86.520675] Tainted: [D]=DIE
[ 86.523602] Hardware name: Intel Corporation Lunar Lake Client Platform/LNL-M LP5 RVP1, BIOS LNLMFWI1.R00.3220.D98.2407250720 07/25/2024
[ 86.532322] loop0: detected capacity change from 0 to 8
[ 86.535928] RIP: 0010:power_supply_get_property+0x105/0x2f0
[ 86.546843] Code: ed 31 c0 48 be 00 00 00 00 00 fc ff df eb 10 41 83 c5 01 49 63 c5 48 39 c8 0f 83 e9 00 00 00 49 8d 1c 80 48 89 d8 48 c1 e8 03 <0f> b6 14 30 48 89 d8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 43
[ 86.565712] RSP: 0018:ffff888119d97900 EFLAGS: 00010203
[ 86.565720] RAX: 095b14e51c79f1ba RBX: 4ad8a728e3cf8dd5 RCX: 16c0acab92af81fb
[ 86.565724] RDX: 1ffff11020b6449a RSI: dffffc0000000000 RDI: ffff888105b224d0
[ 86.565727] RBP: ffff888119d97940 R08: 4ad8a728e3cf8dd5 R09: ffff888105b224b8
[ 86.565731] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888119d97988
[ 86.565734] R13: 0000000000000000 R14: 0000000000000004 R15: ffff888110182000
[ 86.565737] FS: 000071c34d5f69c0(0000) GS:ffff888821080000(0000) knlGS:0000000000000000
[ 86.565742] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 86.565745] CR2: 00005eada2cefb98 CR3: 00000001286c0003 CR4: 0000000000f70ef0
[ 86.565749] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 86.565752] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 86.565755] PKRU: 55555554
[ 86.565758] Call Trace:
[ 86.565760] <TASK>
[ 86.565764] ? show_regs+0x6c/0x80
[ 86.565772] ? die_addr+0x41/0xc0
[ 86.565778] ? exc_general_protection+0x163/0x260
[ 86.661418] ? asm_exc_general_protection+0x27/0x30
[ 86.661431] ? power_supply_get_property+0x105/0x2f0
[ 86.661440] ? kernfs_seq_start+0x4b/0x3e0
[ 86.675515] power_supply_show_property+0x16c/0x5a0
[ 86.675524] ? __pfx_power_supply_show_property+0x10/0x10
[ 86.675532] ? kasan_save_track+0x14/0x40
[ 86.675542] dev_attr_show+0x48/0xe0
[ 86.675548] ? __asan_memset+0x39/0x50
[ 86.675554] sysfs_kf_seq_show+0x1f4/0x3e0
[ 86.701536] kernfs_seq_show+0x117/0x170
[ 86.701543] seq_read_iter+0x410/0x12d0
[ 86.701556] kernfs_fop_read_iter+0x428/0x5a0
[ 86.713826] ? rw_verify_area+0x6f/0x600
[ 86.713834] vfs_read+0x7ee/0xd30
[ 86.713842] ? vfs_getattr_nosec+0x236/0x380
[ 86.713849] ? __pfx_vfs_read+0x10/0x10
[ 86.713858] ? __do_sys_newfstat+0xf8/0x100
[ 86.733600] ksys_read+0x11a/0x240
[ 86.733608] ? __pfx_ksys_read+0x10/0x10
[ 86.741034] __x64_sys_read+0x72/0xc0
[ 86.741040] ? syscall_trace_enter+0xa6/0x290
[ 86.749155] x64_sys_call+0x1bd1/0x2650
[ 86.749162] do_syscall_64+0x91/0x180
[ 86.749169] ? do_syscall_64+0x9d/0x180
[ 86.760632] ? trace_irq_enable+0xed/0x130
[ 86.760642] ? syscall_exit_to_user_mode+0x95/0x250
[ 86.760650] ? do_syscall_64+0x9d/0x180
[ 86.760656] ? syscall_exit_to_user_mode+0x95/0x250
[ 86.778537] ? do_syscall_64+0x9d/0x180
[ 86.778543] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 86.778548] RIP: 0033:0x71c34cf25701
[ 86.778553] Code: 00 48 8b 15 21 b7 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 c0 cb 01 00 f3 0f 1e fa 80 3d 65 39 0f 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
[ 86.810010] RSP: 002b:00007ffcb91ef858 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 86.810015] RAX: ffffffffffffffda RBX: 0000000000001008 RCX: 000071c34cf25701
[ 86.810019] RDX: 0000000000001008 RSI: 00005eada2cd2fa0 RDI: 000000000000000b
[ 86.810022] RBP: 00007ffcb91ef960 R08: 0000000000000000 R09: 0000000000000001
[ 86.810025] R10: 0000000000000003 R11: 0000000000000246 R12: 000000000000000b
[ 86.810028] R13: 0000000000001008 R14: ffffffffffffffff R15: 00005eada2cd2fa0
[ 86.810039] </TASK>
[ 86.810042] Modules linked in: overlay cmdlinepart spi_nor mtd intel_rapl_msr sunrpc wmi_bmof spi_intel_pci binfmt_misc processor_thermal_device_pci spi_intel processor_thermal_device processor_thermal_wt_hint processor_thermal_rfim intel_vpu processor_thermal_rapl mei_me drm_shmem_helper intel_rapl_common mei processor_thermal_wt_req idma64 drm_kms_helper processor_thermal_power_floor processor_thermal_mbox nls_iso88
59_1 intel_skl_int3472_tps68470 tps68470_regulator clk_tps68470 int3403_thermal int340x_thermal_zone intel_pmc_core intel_hid int3400_thermal intel_vsec pmt_telemetry intel_skl_int3472_discrete sparse_keymap pmt_class acpi_thermal_rel intel_skl_int3472_common acpi_pad acpi_tad input_leds joydev msr fuse efi_pstore dm_multipath nfnetlink ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor a
sync_tx raid1 raid0 hid_sensor_custom hid_sensor_hub hid_generic intel_ishtp_hid crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 usbhid
[ 86.856030] sha1_ssse3 hid e1000e intel_ish_ipc intel_ishtp ucsi_acpi thunderbolt typec_ucsi typec video wmi pinctrl_intel_platform aesni_intel crypto_simd cryptd
[ 86.960822] ---[ end trace 0000000000000000 ]---
>
>> AFAICS, this commit cannot be reverted by itself, so which commits on
>> top of it need to be reverted in order to revert it cleanly?
>
>All the other patches from this series:
>https://lore.kernel.org/lkml/20241005-power-supply-no-wakeup-source-v1-0-1d62bf9bcb1d@weissschuh.net/
it doesn's seem like these changes in power_supply are really the
culprit. I tried taking all the power_supply changes on top of v6.12
and I don't see any issue there. It's looking more like a memory
corruption
>
>Could you point me to the full boot log in the drm-tip CI?
Here is one:
https://intel-gfx-ci.01.org/tree/intel-xe/xe-2310-5779fb3c12faf12589054127d60b1d36d56ba219/bat-lnl-1/boot0.txt
You can get more, with slightly different signatures by heading to
https://intel-gfx-ci.01.org/tree/intel-xe/bat-all.html?testfilter=igt@runner@aborted
- click in one of the red cells for bat-lnl and then boot0
Lucas De Marchi
More information about the Intel-gfx
mailing list