Regression on linux-next (next-20241120) and drm-tip

Lucas De Marchi lucas.demarchi at intel.com
Tue Dec 3 17:29:36 UTC 2024


On Tue, Dec 03, 2024 at 03:50:28PM +0100, Thomas Weißschuh wrote:
>On 2024-12-03 15:33:21+0100, Rafael J. Wysocki wrote:
>> On Tue, Dec 3, 2024 at 1:04 PM Thomas Weißschuh <linux at weissschuh.net> wrote:
>> >
>> > On 2024-12-03 12:54:54+0100, Rafael J. Wysocki wrote:
>> > > On Tue, Dec 3, 2024 at 7:51 AM Thomas Weißschuh <linux at weissschuh.net> wrote:
>> > > >
>> > > > (+Cc Sebastian)
>> > > >
>> > > > Hi Chaitanya,
>> > > >
>> > > > On 2024-12-03 05:07:47+0000, Borah, Chaitanya Kumar wrote:
>> > > > > Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
>> > > > >
>> > > > > This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.
>> > > >
>> > > > Thanks for the report.
>> > > >
>> > > > > Since the version next-20241120 [2], we are seeing the following regression
>> > > > >
>> > > > > `````````````````````````````````````````````````````````````````````````````````
>> > > > > <4>[   19.990743] Oops: general protection fault, probably for non-canonical address 0xb11675ef8d1ccbce: 0000 [#1] PREEMPT SMP NOPTI
>> > > > > <4>[   19.990760] CPU: 21 UID: 110 PID: 867 Comm: prometheus-node Not tainted 6.12.0-next-20241120-next-20241120-gac24e26aa08f+ #1
>> > > > > <4>[   19.990771] Hardware name: Intel Corporation Arrow Lake Client Platform/MTL-S UDIMM 2DPC EVCRB, BIOS MTLSFWI1.R00.4400.D85.2410100007 10/10/2024
>> > > > > <4>[   19.990782] RIP: 0010:power_supply_get_property+0x3e/0xe0
>> > > > > `````````````````````````````````````````````````````````````````````````````````
>> > > > > Details log can be found in [3].
>> > > > >
>> > > > > After bisecting the tree, the following patch [4] seems to be the first "bad"
>> > > > > commit
>> > > > >
>> > > > > `````````````````````````````````````````````````````````````````````````````````````````````````````````
>> > > > > Commit 49000fee9e639f62ba1f965ed2ae4c5ad18d19e2
>> > > > > Author:     Thomas Weißschuh <mailto:linux at weissschuh.net>
>> > > > > AuthorDate: Sat Oct 5 12:05:03 2024 +0200
>> > > > > Commit:     Sebastian Reichel <mailto:sebastian.reichel at collabora.com>
>> > > > > CommitDate: Tue Oct 15 22:22:20 2024 +0200
>> > > > >     power: supply: core: add wakeup source inhibit by power_supply_config
>> > > > > `````````````````````````````````````````````````````````````````````````````````````````````````````````
>> > > > >
>> > > > > This is now seen in our drm-tip runs as well. [5]
>> > > > >
>> > > > > Could you please check why the patch causes this regression and provide a fix if necessary?
>> > > >
>> > > > I don't see how this patch can lead to this error.
>> > >
>> > > It looks like the cfg->no_wakeup_source access reaches beyond the
>> > > struct boundary for some reason.
>> >
>> > But the access to this field is only done in __power_supply_register().
>> > The error reports however don't show this function at all,
>> > they come from power_supply_uevent() and power_supply_get_property() by
>> > which time the call to __power_supply_register() is long over.
>> >
>> > FWIW there is an uninitialized 'struct power_supply_config' in
>> > drivers/hid/hid-corsair-void.c. But I highly doubt the test machines are
>> > using that. (I'll send a patch later for it)
>>
>> So the only way I can think about in which the commit in question may
>> lead to the reported issues is that changing the size of struct
>> power_supply_config or its alignment makes an unexpected functional
>> difference somewhere.
>
>Indeed. I'd really like to see this issue reproduced with KASAN.

I just reproduced it in another machine, Lunar Lake. Signature is
slightly different. See the output with KASAN:


[   86.490814] Oops: general protection fault, probably for non-canonical address 0xe95b10e51c79f1ba: 0000 [#2] PREEMPT SMP KASAN NOPTI
[   86.502796] KASAN: maybe wild-memory-access in range [0x4ad8a728e3cf8dd0-0x4ad8a728e3cf8dd7]
[   86.511296] CPU: 1 UID: 0 PID: 900 Comm: systemd-logind Tainted: G      D            6.13.0-rc1-xe+ #2
[   86.520675] Tainted: [D]=DIE
[   86.523602] Hardware name: Intel Corporation Lunar Lake Client Platform/LNL-M LP5 RVP1, BIOS LNLMFWI1.R00.3220.D98.2407250720 07/25/2024
[   86.532322] loop0: detected capacity change from 0 to 8
[   86.535928] RIP: 0010:power_supply_get_property+0x105/0x2f0
[   86.546843] Code: ed 31 c0 48 be 00 00 00 00 00 fc ff df eb 10 41 83 c5 01 49 63 c5 48 39 c8 0f 83 e9 00 00 00 49 8d 1c 80 48 89 d8 48 c1 e8 03 <0f> b6 14 30 48 89 d8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 43
[   86.565712] RSP: 0018:ffff888119d97900 EFLAGS: 00010203
[   86.565720] RAX: 095b14e51c79f1ba RBX: 4ad8a728e3cf8dd5 RCX: 16c0acab92af81fb
[   86.565724] RDX: 1ffff11020b6449a RSI: dffffc0000000000 RDI: ffff888105b224d0
[   86.565727] RBP: ffff888119d97940 R08: 4ad8a728e3cf8dd5 R09: ffff888105b224b8
[   86.565731] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888119d97988
[   86.565734] R13: 0000000000000000 R14: 0000000000000004 R15: ffff888110182000
[   86.565737] FS:  000071c34d5f69c0(0000) GS:ffff888821080000(0000) knlGS:0000000000000000
[   86.565742] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   86.565745] CR2: 00005eada2cefb98 CR3: 00000001286c0003 CR4: 0000000000f70ef0
[   86.565749] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   86.565752] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[   86.565755] PKRU: 55555554
[   86.565758] Call Trace:
[   86.565760]  <TASK>
[   86.565764]  ? show_regs+0x6c/0x80
[   86.565772]  ? die_addr+0x41/0xc0
[   86.565778]  ? exc_general_protection+0x163/0x260
[   86.661418]  ? asm_exc_general_protection+0x27/0x30
[   86.661431]  ? power_supply_get_property+0x105/0x2f0
[   86.661440]  ? kernfs_seq_start+0x4b/0x3e0
[   86.675515]  power_supply_show_property+0x16c/0x5a0
[   86.675524]  ? __pfx_power_supply_show_property+0x10/0x10
[   86.675532]  ? kasan_save_track+0x14/0x40
[   86.675542]  dev_attr_show+0x48/0xe0
[   86.675548]  ? __asan_memset+0x39/0x50
[   86.675554]  sysfs_kf_seq_show+0x1f4/0x3e0
[   86.701536]  kernfs_seq_show+0x117/0x170
[   86.701543]  seq_read_iter+0x410/0x12d0
[   86.701556]  kernfs_fop_read_iter+0x428/0x5a0
[   86.713826]  ? rw_verify_area+0x6f/0x600
[   86.713834]  vfs_read+0x7ee/0xd30
[   86.713842]  ? vfs_getattr_nosec+0x236/0x380
[   86.713849]  ? __pfx_vfs_read+0x10/0x10
[   86.713858]  ? __do_sys_newfstat+0xf8/0x100
[   86.733600]  ksys_read+0x11a/0x240
[   86.733608]  ? __pfx_ksys_read+0x10/0x10
[   86.741034]  __x64_sys_read+0x72/0xc0
[   86.741040]  ? syscall_trace_enter+0xa6/0x290
[   86.749155]  x64_sys_call+0x1bd1/0x2650
[   86.749162]  do_syscall_64+0x91/0x180
[   86.749169]  ? do_syscall_64+0x9d/0x180
[   86.760632]  ? trace_irq_enable+0xed/0x130
[   86.760642]  ? syscall_exit_to_user_mode+0x95/0x250
[   86.760650]  ? do_syscall_64+0x9d/0x180
[   86.760656]  ? syscall_exit_to_user_mode+0x95/0x250
[   86.778537]  ? do_syscall_64+0x9d/0x180
[   86.778543]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   86.778548] RIP: 0033:0x71c34cf25701
[   86.778553] Code: 00 48 8b 15 21 b7 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 c0 cb 01 00 f3 0f 1e fa 80 3d 65 39 0f 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
[   86.810010] RSP: 002b:00007ffcb91ef858 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   86.810015] RAX: ffffffffffffffda RBX: 0000000000001008 RCX: 000071c34cf25701
[   86.810019] RDX: 0000000000001008 RSI: 00005eada2cd2fa0 RDI: 000000000000000b
[   86.810022] RBP: 00007ffcb91ef960 R08: 0000000000000000 R09: 0000000000000001
[   86.810025] R10: 0000000000000003 R11: 0000000000000246 R12: 000000000000000b
[   86.810028] R13: 0000000000001008 R14: ffffffffffffffff R15: 00005eada2cd2fa0
[   86.810039]  </TASK>
[   86.810042] Modules linked in: overlay cmdlinepart spi_nor mtd intel_rapl_msr sunrpc wmi_bmof spi_intel_pci binfmt_misc processor_thermal_device_pci spi_intel processor_thermal_device processor_thermal_wt_hint processor_thermal_rfim intel_vpu processor_thermal_rapl mei_me drm_shmem_helper intel_rapl_common mei processor_thermal_wt_req idma64 drm_kms_helper processor_thermal_power_floor processor_thermal_mbox nls_iso88
59_1 intel_skl_int3472_tps68470 tps68470_regulator clk_tps68470 int3403_thermal int340x_thermal_zone intel_pmc_core intel_hid int3400_thermal intel_vsec pmt_telemetry intel_skl_int3472_discrete sparse_keymap pmt_class acpi_thermal_rel intel_skl_int3472_common acpi_pad acpi_tad input_leds joydev msr fuse efi_pstore dm_multipath nfnetlink ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor a
sync_tx raid1 raid0 hid_sensor_custom hid_sensor_hub hid_generic intel_ishtp_hid crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 usbhid
[   86.856030]  sha1_ssse3 hid e1000e intel_ish_ipc intel_ishtp ucsi_acpi thunderbolt typec_ucsi typec video wmi pinctrl_intel_platform aesni_intel crypto_simd cryptd
[   86.960822] ---[ end trace 0000000000000000 ]---


>
>> AFAICS, this commit cannot be reverted by itself, so which commits on
>> top of it need to be reverted in order to revert it cleanly?
>
>All the other patches from this series:
>https://lore.kernel.org/lkml/20241005-power-supply-no-wakeup-source-v1-0-1d62bf9bcb1d@weissschuh.net/

it doesn's seem like these changes in power_supply are really the
culprit.  I tried taking all the power_supply changes on top of v6.12
and I don't see any issue there. It's looking more like a memory
corruption

>
>Could you point me to the full boot log in the drm-tip CI?

Here is one:
https://intel-gfx-ci.01.org/tree/intel-xe/xe-2310-5779fb3c12faf12589054127d60b1d36d56ba219/bat-lnl-1/boot0.txt

You can get more, with slightly different signatures by heading to
https://intel-gfx-ci.01.org/tree/intel-xe/bat-all.html?testfilter=igt@runner@aborted
  - click in one of the red cells for bat-lnl and then boot0

Lucas De Marchi


More information about the Intel-xe mailing list