[Intel-gfx] Regression in linux-next

Borah, Chaitanya Kumar chaitanya.kumar.borah at intel.com
Fri Oct 13 14:05:28 UTC 2023


Hello Rafael,

> -----Original Message-----
> From: Borah, Chaitanya Kumar
> Sent: Wednesday, October 11, 2023 10:19 PM
> To: Wysocki, Rafael J <rafael.j.wysocki at intel.com>
> Cc: intel-gfx at lists.freedesktop.org; Kurmi, Suresh Kumar
> <Suresh.Kumar.Kurmi at intel.com>; Saarinen, Jani <jani.saarinen at intel.com>
> Subject: RE: Regression in linux-next
> 
> Hello Rafael,
> 
> > -----Original Message-----
> > From: Wysocki, Rafael J <rafael.j.wysocki at intel.com>
> > Sent: Wednesday, October 11, 2023 9:44 PM
> > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>
> > Cc: intel-gfx at lists.freedesktop.org; Kurmi, Suresh Kumar
> > <suresh.kumar.kurmi at intel.com>; Saarinen, Jani
> > <jani.saarinen at intel.com>
> > Subject: Re: Regression in linux-next
> >
> > Hi,
> >
> > On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
> > > Hello Rafael,
> > >
> > >> -----Original Message-----
> > >> From: Wysocki, Rafael J <rafael.j.wysocki at intel.com>
> > >> Sent: Tuesday, October 10, 2023 12:54 AM
> > >> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>
> > >> Cc: intel-gfx at lists.freedesktop.org; Kurmi, Suresh Kumar
> > >> <suresh.kumar.kurmi at intel.com>; Saarinen, Jani
> > >> <jani.saarinen at intel.com>
> > >> Subject: Re: Regression in linux-next
> > >>
> > >> Hi,
> > >>
> > >> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> > >>> Hello Rafael
> > >>>
> > >>>> Thanks for the report, I think that this is a lockdep assertion failing.
> > >>>> If that is correct, it should be straightforward to fix.
> > >>>> I'll take care of this early next week.
> > >>>> Thanks!
> > >>> Thank you for your response.  Please let us know when a fix is available.
> > >> It should be fixed in linux-next from today, by this commit:
> > >>
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> > >> pm.git/commit/?h=linux-
> > >> next&id=b44444027ce7714f309e96b804b7fb088a40d708
> > >>
> > >> Thanks!
> > > Thanks a lot for the fix. This seems to have fixed the issue in most
> > > of the
> > machines but we are still seeing a similar problem in few of the machines.
> >
> > Thanks for reporting this!
> >
> >
> > > This has a different call stack but seems to be from the same
> > > thermal subsystem. Full logs in [1]
> > >
> > > <4>[    4.392015] WARNING: CPU: 1 PID: 306 at
> > drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392022] Modules linked in: x86_pkg_temp_thermal coretemp
> > kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass
> > crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801
> > mei_me pps_core mei i2c_smbus wmi
> > > <4>[    4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-
> > next-20231010-next-20231010-gc0a6edb636cb+ #1
> > > <4>[    4.392061] Hardware name: System manufacturer System Product
> > Name/Z170M-PLUS, BIOS 3610 03/29/2018
> > > <4>[    4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc
> cc
> > cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5
> > <0f> 0b eb b1
> > 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> > > <4>[    4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
> > > <4>[    4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX:
> > 0000000000000001
> > > <4>[    4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI:
> > ffffffff823fdfba
> > > <4>[    4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09:
> > 0000000000000001
> > > <4>[    4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12:
> > ffff888103a88818
> > > <4>[    4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15:
> > 0000000000000001
> > > <4>[    4.392084] FS:  00007f1f0d6d28c0(0000)
> GS:ffff88822e680000(0000)
> > knlGS:0000000000000000
> > > <4>[    4.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > <4>[    4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005
> CR4:
> > 00000000003706f0
> > > <4>[    4.392091] DR0: 0000000000000000 DR1: 0000000000000000
> DR2:
> > 0000000000000000
> > > <4>[    4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > > <4>[    4.392095] Call Trace:
> > > <4>[    4.392097]  <TASK>
> > > <4>[    4.392100]  ? __warn+0x7f/0x170
> > > <4>[    4.392104]  ? thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392109]  ? report_bug+0x1f8/0x200
> > > <4>[    4.392116]  ? handle_bug+0x3c/0x70
> > > <4>[    4.392119]  ? exc_invalid_op+0x18/0x70
> > > <4>[    4.392123]  ? asm_exc_invalid_op+0x1a/0x20
> > > <4>[    4.392133]  ? thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392137]  ? thermal_zone_trip_id+0x5d/0x70
> > > <4>[    4.392141]  trip_point_show+0x18/0x40
> > > <4>[    4.392145]  dev_attr_show+0x15/0x60
> > > <4>[    4.392149]  sysfs_kf_seq_show+0xb5/0x100
> > > <4>[    4.392154]  seq_read_iter+0x111/0x450
> > > <4>[    4.392158]  ? check_object+0x133/0x320
> > > <4>[    4.392164]  vfs_read+0x20d/0x300
> > > <4>[    4.392175]  ksys_read+0x64/0xe0
> > > <4>[    4.392180]  do_syscall_64+0x3c/0x90
> > > <4>[    4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > > <4>[    4.392187] RIP: 0033:0x7f1f0e193392
> > >
> > > Can you please check what could be the reason for this issue?
> >
> > Well, one more unuseful lockdep assertion has been added recently to
> > the thermal core, sorry about that.
> >
> > This commit
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> > pm.git/commit/?h=linux-
> > next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5
> >
> > that will be merged into linux-next tomorrow if all goes well, should
> > address this.
> 
> Thank you for the fix. We will wait for it to get merged in linux-next.
> 

Happy to let to you know that we did not see these issues in the latest linux-next run.

Thanks a lot of your quick resolutions.

Regards

Chaitanya

> Regards
> 
> Chaitanya
> 
> >
> > Thanks!
> >
> >
> > > [1]
> > > https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc
> > > /b
> > > oot0.txt
> > >
> > > Regards
> > >
> > > Chaitanya
> > >
> > >
> > >
> > >
> > >>
> > >>> From: Wysocki, Rafael J <rafael.j.wysocki at intel.com>
> > >>> Sent: Saturday, October 7, 2023 2:01 AM
> > >>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>
> > >>> Cc: intel-gfx at lists.freedesktop.org; Kurmi, Suresh Kumar
> > >>> <suresh.kumar.kurmi at intel.com>; Saarinen, Jani
> > >>> <jani.saarinen at intel.com>
> > >>> Subject: Re: Regression in linux-next
> > >>>
> > >>> Hi,
> > >>> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
> > >>> Hello Rafael,
> > >>>
> > >>> Hope you are doing well. I am Chaitanya from the linux graphics
> > >>> team in
> > >> Intel.
> > >>> This mail is regarding a regression we are seeing in our CI
> > >>> runs[1] on linux-
> > >> next repository.
> > >>> Thanks for the report, I think that this is a lockdep assertion failing.
> > >>> If that is correct, it should be straightforward to fix.
> > >>> I'll take care of this early next week.
> > >>> Thanks!
> > >>>
> > >>> On next-20231003 [2], we are seeing the following error
> > >>>
> > >>> ``````````````````````````````````````````````````````````````````
> > >>> `` `` ````````` <4>[   14.093075] ------------[ cut here
> > >>> ]------------ <4>[ 14.097664] WARNING: CPU: 0 PID: 1 at
> > >>> drivers/thermal/thermal_trip.c:18
> > >>> for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
> > >>> <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G
> > >>> W 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
> > >>> 14.121305] Hardware name: Intel Corporation Meteor Lake Client
> > >>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> > >>> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[   14.134478]
> RIP:
> > >>> 0010:for_each_thermal_trip+0x83/0x90
> > >>> <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c
> > >>> 41 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2
> > >>> 2d 00
> > >>> 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90
> > >>> 90
> > >>> 90
> > >>> 90 90 90
> > >>>
> > >>> Details log can be found in [3].
> > >>>
> > >>> After bisecting the tree, the following patch [4] seems to be
> > >>> causing the
> > >> regression.
> > >>> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
> > >>> Author: Rafael J. Wysocki mailto:rafael.j.wysocki at intel.com
> > >>> Date:   Thu Sep 21 20:02:59 2023 +0200
> > >>>
> > >>>       ACPI: thermal: Do not use trip indices for cooling device
> > >>> binding
> > >>>
> > >>>       Rearrange the ACPI thermal driver's callback functions used
> > >>> for cooling
> > >>>       device binding and unbinding,
> > >>> acpi_thermal_bind_cooling_device()
> > >>> and
> > >>>       acpi_thermal_unbind_cooling_device(), respectively, so that
> > >>> they use trip
> > >>>       pointers instead of trip indices which is more
> > >>> straightforward and allows
> > >>>       the driver to become independent of the ordering of trips in
> > >>> the thermal
> > >>>       zone structure.
> > >>>
> > >>>       The general functionality is not expected to be changed.
> > >>>
> > >>>       Signed-off-by: Rafael J. Wysocki
> > >>> mailto:rafael.j.wysocki at intel.com
> > >>>       Reviewed-by: Daniel Lezcano mailto:daniel.lezcano at linaro.org
> > >>>
> > >>> We also verified by moving the head of the tree to the previous commit.
> > >>>
> > >>> Could you please check why this patch causes the regression and if
> > >>> we can
> > >> find a solution for it soon?
> > >>> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> > >>> [2]
> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.gi
> > >>> t/
> > >>> co
> > >>> mmit/?h=next-20231003 [3]
> > >>> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp
> > >>> -6
> > >>> /b
> > >>> oot0.txt [4]
> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.gi
> > >>> t/
> > >>> co mmit/?h=next-
> > 20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb


More information about the Intel-gfx mailing list