Regression on linux-next (next-20241120) and drm-tip
Lucas De Marchi
lucas.demarchi at intel.com
Tue Dec 3 19:42:21 UTC 2024
On Tue, Dec 03, 2024 at 05:24:27PM +0000, Furquim, Ulisses wrote:
>+Rodrigo
>________________________________
>From: Intel-gfx <intel-gfx-bounces at lists.freedesktop.org> on behalf of Saarinen, Jani <jani.saarinen at intel.com>
>Sent: Tuesday, December 3, 2024 8:22 AM
>To: Thomas Weißschuh <linux at weissschuh.net>; Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>; Krogerus, Heikki <heikki.krogerus at intel.com>
>Cc: Rafael J. Wysocki <rafael at kernel.org>; Kurmi, Suresh Kumar <suresh.kumar.kurmi at intel.com>; Coelho, Luciano <luciano.coelho at intel.com>; Nikula, Jani <jani.nikula at intel.com>; De Marchi, Lucas <lucas.demarchi at intel.com>; intel-gfx at lists.freedesktop.org <intel-gfx at lists.freedesktop.org>; intel-xe at lists.freedesktop.org <intel-xe at lists.freedesktop.org>; linux-pm at vger.kernel.org <linux-pm at vger.kernel.org>; Sebastian Reichel <sebastian.reichel at collabora.com>
>Subject: RE: Regression on linux-next (next-20241120) and drm-tip
>
>+ at Krogerus, Heikki
>
>> -----Original Message-----
>> From: Thomas Weißschuh <linux at weissschuh.net>
>> Sent: Tuesday, 3 December 2024 18.08
>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>
>> Cc: Rafael J. Wysocki <rafael at kernel.org>; Kurmi, Suresh Kumar
>> <suresh.kumar.kurmi at intel.com>; Coelho, Luciano <luciano.coelho at intel.com>;
>> Saarinen, Jani <jani.saarinen at intel.com>; Nikula, Jani <jani.nikula at intel.com>;
>> De Marchi, Lucas <lucas.demarchi at intel.com>; intel-gfx at lists.freedesktop.org;
>> intel-xe at lists.freedesktop.org; linux-pm at vger.kernel.org; Sebastian Reichel
>> <sebastian.reichel at collabora.com>
>> Subject: Re: Regression on linux-next (next-20241120) and drm-tip
>>
>> On 2024-12-03 16:57:21+0100, Thomas Weißschuh wrote:
>> > On 2024-12-03 15:42:23+0000, Borah, Chaitanya Kumar wrote:
>> > >
>> > >
>> > > > -----Original Message-----
>> > > > From: Thomas Weißschuh <linux at weissschuh.net>
>> > > > Sent: Tuesday, December 3, 2024 8:20 PM
>> > > > To: Rafael J. Wysocki <rafael at kernel.org>
>> > > > Cc: Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>;
>> > > > Kurmi, Suresh Kumar <suresh.kumar.kurmi at intel.com>; Coelho,
>> > > > Luciano <luciano.coelho at intel.com>; Saarinen, Jani
>> > > > <jani.saarinen at intel.com>; Nikula, Jani <jani.nikula at intel.com>;
>> > > > De Marchi, Lucas <lucas.demarchi at intel.com>;
>> > > > intel-gfx at lists.freedesktop.org; intel- xe at lists.freedesktop.org;
>> > > > linux-pm at vger.kernel.org; Sebastian Reichel
>> > > > <sebastian.reichel at collabora.com>
>> > > > Subject: Re: Regression on linux-next (next-20241120) and drm-tip
>> > > >
>> > > > On 2024-12-03 15:33:21+0100, Rafael J. Wysocki wrote:
>> > > > > On Tue, Dec 3, 2024 at 1:04 PM Thomas Weißschuh
>> > > > <linux at weissschuh.net> wrote:
>> > > > > >
>> > > > > > On 2024-12-03 12:54:54+0100, Rafael J. Wysocki wrote:
>> > > > > > > On Tue, Dec 3, 2024 at 7:51 AM Thomas Weißschuh
>> > > > <linux at weissschuh.net> wrote:
>> > > > > > > >
>> > > > > > > > (+Cc Sebastian)
>> > > > > > > >
>> > > > > > > > Hi Chaitanya,
>> > > > > > > >
>> > > > > > > > On 2024-12-03 05:07:47+0000, Borah, Chaitanya Kumar wrote:
>> > > > > > > > > Hope you are doing well. I am Chaitanya from the linux
>> > > > > > > > > graphics team
>> > > > in Intel.
>> > > > > > > > >
>> > > > > > > > > This mail is regarding a regression we are seeing in our
>> > > > > > > > > CI runs[1] on
>> > > > linux-next repository.
>> > > > > > > >
>> > > > > > > > Thanks for the report.
>> > > > > > > >
>> > > > > > > > > Since the version next-20241120 [2], we are seeing the
>> > > > > > > > > following regression
>> > > > > > > > >
>> > > > > > > > > `````````````````````````````````````````````````````````````````````````````````
>> > > > > > > > > <4>[ 19.990743] Oops: general protection fault, probably for non-
>> > > > canonical address 0xb11675ef8d1ccbce: 0000 [#1] PREEMPT SMP NOPTI
>> > > > > > > > > <4>[ 19.990760] CPU: 21 UID: 110 PID: 867 Comm: prometheus-
>> > > > node Not tainted 6.12.0-next-20241120-next-20241120-gac24e26aa08f+
>> > > > #1
>> > > > > > > > > <4>[ 19.990771] Hardware name: Intel Corporation Arrow Lake
>> > > > Client Platform/MTL-S UDIMM 2DPC EVCRB, BIOS
>> > > > MTLSFWI1.R00.4400.D85.2410100007 10/10/2024
>> > > > > > > > > <4>[ 19.990782] RIP:
>> 0010:power_supply_get_property+0x3e/0xe0
>> > > > > > > > > ````````````````````````````````````````````````````````
>> > > > > > > > > `````` ``````````````````` Details log can be found in
>> > > > > > > > > [3].
>> > > > > > > > >
>> > > > > > > > > After bisecting the tree, the following patch [4] seems
>> > > > > > > > > to be the first
>> > > > "bad"
>> > > > > > > > > commit
>> > > > > > > > >
>> > > > > > > > > ````````````````````````````````````````````````````````
>> > > > > > > > > `````` ```````````````````````````````````````````
>> > > > > > > > > Commit 49000fee9e639f62ba1f965ed2ae4c5ad18d19e2
>> > > > > > > > > Author: Thomas Weißschuh <mailto:linux at weissschuh.net>
>> > > > > > > > > AuthorDate: Sat Oct 5 12:05:03 2024 +0200
>> > > > > > > > > Commit: Sebastian Reichel
>> > > > <mailto:sebastian.reichel at collabora.com>
>> > > > > > > > > CommitDate: Tue Oct 15 22:22:20 2024 +0200
>> > > > > > > > > power: supply: core: add wakeup source inhibit by
>> > > > > > > > > power_supply_config
>> > > > > > > > > ````````````````````````````````````````````````````````
>> > > > > > > > > `````` ```````````````````````````````````````````
>> > > > > > > > >
>> > > > > > > > > This is now seen in our drm-tip runs as well. [5]
>> > > > > > > > >
>> > > > > > > > > Could you please check why the patch causes this
>> > > > > > > > > regression and
>> > > > provide a fix if necessary?
>> > > > > > > >
>> > > > > > > > I don't see how this patch can lead to this error.
>> > > > > > >
>> > > > > > > It looks like the cfg->no_wakeup_source access reaches
>> > > > > > > beyond the struct boundary for some reason.
>> > > > > >
>> > > > > > But the access to this field is only done in __power_supply_register().
>> > > > > > The error reports however don't show this function at all,
>> > > > > > they come from power_supply_uevent() and
>> > > > > > power_supply_get_property() by which time the call to
>> __power_supply_register() is long over.
>> > > > > >
>> > > > > > FWIW there is an uninitialized 'struct power_supply_config' in
>> > > > > > drivers/hid/hid-corsair-void.c. But I highly doubt the test
>> > > > > > machines are using that. (I'll send a patch later for it)
>> > > > >
>> > > > > So the only way I can think about in which the commit in
>> > > > > question may lead to the reported issues is that changing the
>> > > > > size of struct power_supply_config or its alignment makes an
>> > > > > unexpected functional difference somewhere.
>> > > >
>> > > > Indeed. I'd really like to see this issue reproduced with KASAN.
>> > > >
>> > > > > AFAICS, this commit cannot be reverted by itself, so which
>> > > > > commits on top of it need to be reverted in order to revert it cleanly?
>> > > >
>> > > > All the other patches from this series:
>> > > > https://lore.kernel.org/lkml/20241005-power-supply-no-wakeup-sourc
>> > > > e-v1-
>> > > > 0-1d62bf9bcb1d at weissschuh.net/
>> > > >
>> > > > Could you point me to the full boot log in the drm-tip CI?
>> > >
>> > > Here is the log for drm-tip CI
>> > >
>> > > https://intel-gfx-ci.01.org/tree/drm-tip/IGT_8136/bat-arls-5/boot0.t
>> > > xt
>> >
>> > Thanks!
>> >
>> > > I carried out another bisect and it points to the following commit
>> > >
>> > > commit 226ff2e681d006eada59a9693aa1976d4c15a7d4
>> > > Author: Heikki Krogerus <heikki.krogerus at linux.intel.com>
>> > > Date: Wed Nov 6 17:06:05 2024 +0200
>> > >
>> > > usb: typec: ucsi: Convert connector specific commands to bitmaps
>> > >
>> > > That allows the fields in those command data structures to
>> > > be easily validated. If an unsupported field is accessed, a
>> > > warning is generated.
>> >
>> > Suspicous: The bitmaps introduced in this commit are right before the
>> > psy and psy_desc members of 'struct ucsi_connector'.
>> > So any out-of-bounds writes into these members would corrupt those
>> > fields.
>> > A corrupted power_supply_desc would fit both reported stacktraces.
>>
>> struct ucsi_connector {
>> ...
>>
>> struct typec_capability typec_cap;
>>
>> /* Cached command responses. */
>> DECLARE_BITMAP(cap, UCSI_GET_CONNECTOR_CAPABILITY_SIZE);
>> DECLARE_BITMAP(status, UCSI_GET_CONNECTOR_STATUS_SIZE);
>>
>> DECLARE_BITMAP() takes the size in number of *bits*
>>
>> struct power_supply *psy;
>> struct power_supply_desc psy_desc;
>> u32 rdo;
>>
>> ...
>> }
>>
>> static int ucsi_get_connector_status(struct ucsi_connector *con, bool conn_ack) {
>> u64 command = UCSI_GET_CONNECTOR_STATUS |
>> UCSI_CONNECTOR_NUMBER(con->num);
>> size_t size = min(UCSI_GET_CONNECTOR_STATUS_SIZE,
>> UCSI_MAX_DATA_LENGTH(con->ucsi));
>> int ret;
>>
>> ret = ucsi_send_command_common(con->ucsi, command, &con->status,
>> size, conn_ack);
>>
>> ucsi_send_command_common() takes the size in number of *bytes*.
>> This call corrupts psy and psy_desc in con.
>>
>> return ret < 0 ? ret : 0;
>> }
>>
>> >
>> > > Reverting it seems to help locally. However, to confirm I have sent out a patch
>> to our "try-bot"
Yeah, I can confirm it fixes it for me too - I reverted both
"usb: typec: ucsi: Convert connector specific commands to bitmaps"
"usb: typec: ucsi: Fix a missing bits to bytes conversion in ucsi_init()"
Looking at the patch and what the second commit is fixing it looks
suspicious.... adding some additional warnings shows e.g.:
ucsi_get_connector_status(). It's passing con->status as arg, but
size can be up to 152.
Considering UCSI_GET_CONNECTOR_STATUS_SIZE is used for DECLARE_BITMAP,
that's also bit-sized. This fixes it for me, although it could also use
the BITS_TO_BYTES:
-----------8<-------------
diff --git a/drivers/usb/typec/ucsi/ucsi.c b/drivers/usb/typec/ucsi/ucsi.c
index c435c0835744a..69caae84879e9 100644
--- a/drivers/usb/typec/ucsi/ucsi.c
+++ b/drivers/usb/typec/ucsi/ucsi.c
@@ -651,7 +651,8 @@ static void ucsi_unregister_altmodes(struct ucsi_connector *con, u8 recipient)
static int ucsi_get_connector_status(struct ucsi_connector *con, bool conn_ack)
{
u64 command = UCSI_GET_CONNECTOR_STATUS | UCSI_CONNECTOR_NUMBER(con->num);
- size_t size = min(UCSI_GET_CONNECTOR_STATUS_SIZE, UCSI_MAX_DATA_LENGTH(con->ucsi));
+ size_t size = min(sizeof(con->status),
+ UCSI_MAX_DATA_LENGTH(con->ucsi));
int ret;
ret = ucsi_send_command_common(con->ucsi, command, &con->status, size, conn_ack);
-----------8<-------------
Let me send that as a patch to the right mailing list.
Lucas De Marchi
More information about the Intel-gfx
mailing list