[Intel-gfx] Regression in linux-next

Wysocki, Rafael J rafael.j.wysocki at intel.com
Wed Oct 11 16:14:03 UTC 2023


Hi,

On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
> Hello Rafael,
>
>> -----Original Message-----
>> From: Wysocki, Rafael J <rafael.j.wysocki at intel.com>
>> Sent: Tuesday, October 10, 2023 12:54 AM
>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>
>> Cc: intel-gfx at lists.freedesktop.org; Kurmi, Suresh Kumar
>> <suresh.kumar.kurmi at intel.com>; Saarinen, Jani <jani.saarinen at intel.com>
>> Subject: Re: Regression in linux-next
>>
>> Hi,
>>
>> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
>>> Hello Rafael
>>>
>>>> Thanks for the report, I think that this is a lockdep assertion failing.
>>>> If that is correct, it should be straightforward to fix.
>>>> I'll take care of this early next week.
>>>> Thanks!
>>> Thank you for your response.  Please let us know when a fix is available.
>> It should be fixed in linux-next from today, by this commit:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
>> pm.git/commit/?h=linux-
>> next&id=b44444027ce7714f309e96b804b7fb088a40d708
>>
>> Thanks!
> Thanks a lot for the fix. This seems to have fixed the issue in most of the machines but we are still seeing a similar problem in few of the machines.

Thanks for reporting this!


> This has a different call stack but seems to be from the same thermal subsystem. Full logs in [1]
>
> <4>[    4.392015] WARNING: CPU: 1 PID: 306 at drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
> <4>[    4.392022] Modules linked in: x86_pkg_temp_thermal coretemp kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801 mei_me pps_core mei i2c_smbus wmi
> <4>[    4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-next-20231010-next-20231010-gc0a6edb636cb+ #1
> <4>[    4.392061] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
> <4>[    4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
> <4>[    4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b eb b1 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> <4>[    4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
> <4>[    4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX: 0000000000000001
> <4>[    4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI: ffffffff823fdfba
> <4>[    4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09: 0000000000000001
> <4>[    4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12: ffff888103a88818
> <4>[    4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15: 0000000000000001
> <4>[    4.392084] FS:  00007f1f0d6d28c0(0000) GS:ffff88822e680000(0000) knlGS:0000000000000000
> <4>[    4.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[    4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005 CR4: 00000000003706f0
> <4>[    4.392091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>[    4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> <4>[    4.392095] Call Trace:
> <4>[    4.392097]  <TASK>
> <4>[    4.392100]  ? __warn+0x7f/0x170
> <4>[    4.392104]  ? thermal_zone_trip_id+0x61/0x70
> <4>[    4.392109]  ? report_bug+0x1f8/0x200
> <4>[    4.392116]  ? handle_bug+0x3c/0x70
> <4>[    4.392119]  ? exc_invalid_op+0x18/0x70
> <4>[    4.392123]  ? asm_exc_invalid_op+0x1a/0x20
> <4>[    4.392133]  ? thermal_zone_trip_id+0x61/0x70
> <4>[    4.392137]  ? thermal_zone_trip_id+0x5d/0x70
> <4>[    4.392141]  trip_point_show+0x18/0x40
> <4>[    4.392145]  dev_attr_show+0x15/0x60
> <4>[    4.392149]  sysfs_kf_seq_show+0xb5/0x100
> <4>[    4.392154]  seq_read_iter+0x111/0x450
> <4>[    4.392158]  ? check_object+0x133/0x320
> <4>[    4.392164]  vfs_read+0x20d/0x300
> <4>[    4.392175]  ksys_read+0x64/0xe0
> <4>[    4.392180]  do_syscall_64+0x3c/0x90
> <4>[    4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> <4>[    4.392187] RIP: 0033:0x7f1f0e193392
>
> Can you please check what could be the reason for this issue?

Well, one more unuseful lockdep assertion has been added recently to the 
thermal core, sorry about that.

This commit

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5

that will be merged into linux-next tomorrow if all goes well, should 
address this.

Thanks!


> [1] https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/boot0.txt
>
> Regards
>
> Chaitanya
>
>
>
>
>>
>>> From: Wysocki, Rafael J <rafael.j.wysocki at intel.com>
>>> Sent: Saturday, October 7, 2023 2:01 AM
>>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>
>>> Cc: intel-gfx at lists.freedesktop.org; Kurmi, Suresh Kumar
>>> <suresh.kumar.kurmi at intel.com>; Saarinen, Jani
>>> <jani.saarinen at intel.com>
>>> Subject: Re: Regression in linux-next
>>>
>>> Hi,
>>> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
>>> Hello Rafael,
>>>
>>> Hope you are doing well. I am Chaitanya from the linux graphics team in
>> Intel.
>>> This mail is regarding a regression we are seeing in our CI runs[1] on linux-
>> next repository.
>>> Thanks for the report, I think that this is a lockdep assertion failing.
>>> If that is correct, it should be straightforward to fix.
>>> I'll take care of this early next week.
>>> Thanks!
>>>
>>> On next-20231003 [2], we are seeing the following error
>>>
>>> ``````````````````````````````````````````````````````````````````````
>>> ````````` <4>[   14.093075] ------------[ cut here ]------------ <4>[
>>> 14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18
>>> for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
>>> <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W
>>> 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
>>> 14.121305] Hardware name: Intel Corporation Meteor Lake Client
>>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
>>> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[   14.134478] RIP:
>>> 0010:for_each_thermal_trip+0x83/0x90
>>> <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41
>>> 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00
>>> 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
>>> 90 90 90
>>>
>>> Details log can be found in [3].
>>>
>>> After bisecting the tree, the following patch [4] seems to be causing the
>> regression.
>>> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
>>> Author: Rafael J. Wysocki mailto:rafael.j.wysocki at intel.com
>>> Date:   Thu Sep 21 20:02:59 2023 +0200
>>>
>>>       ACPI: thermal: Do not use trip indices for cooling device binding
>>>
>>>       Rearrange the ACPI thermal driver's callback functions used for
>>> cooling
>>>       device binding and unbinding, acpi_thermal_bind_cooling_device()
>>> and
>>>       acpi_thermal_unbind_cooling_device(), respectively, so that they
>>> use trip
>>>       pointers instead of trip indices which is more straightforward
>>> and allows
>>>       the driver to become independent of the ordering of trips in the
>>> thermal
>>>       zone structure.
>>>
>>>       The general functionality is not expected to be changed.
>>>
>>>       Signed-off-by: Rafael J. Wysocki
>>> mailto:rafael.j.wysocki at intel.com
>>>       Reviewed-by: Daniel Lezcano mailto:daniel.lezcano at linaro.org
>>>
>>> We also verified by moving the head of the tree to the previous commit.
>>>
>>> Could you please check why this patch causes the regression and if we can
>> find a solution for it soon?
>>> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
>>> [2]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
>>> mmit/?h=next-20231003 [3]
>>> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/b
>>> oot0.txt [4]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
>>> mmit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb


More information about the Intel-gfx mailing list