Regression on linux-next (next-20241120) and drm-tip

Furquim, Ulisses ulisses.furquim at intel.com
Tue Dec 3 17:24:27 UTC 2024


+Rodrigo
________________________________
From: Intel-gfx <intel-gfx-bounces at lists.freedesktop.org> on behalf of Saarinen, Jani <jani.saarinen at intel.com>
Sent: Tuesday, December 3, 2024 8:22 AM
To: Thomas Weißschuh <linux at weissschuh.net>; Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>; Krogerus, Heikki <heikki.krogerus at intel.com>
Cc: Rafael J. Wysocki <rafael at kernel.org>; Kurmi, Suresh Kumar <suresh.kumar.kurmi at intel.com>; Coelho, Luciano <luciano.coelho at intel.com>; Nikula, Jani <jani.nikula at intel.com>; De Marchi, Lucas <lucas.demarchi at intel.com>; intel-gfx at lists.freedesktop.org <intel-gfx at lists.freedesktop.org>; intel-xe at lists.freedesktop.org <intel-xe at lists.freedesktop.org>; linux-pm at vger.kernel.org <linux-pm at vger.kernel.org>; Sebastian Reichel <sebastian.reichel at collabora.com>
Subject: RE: Regression on linux-next (next-20241120) and drm-tip

+ at Krogerus, Heikki

> -----Original Message-----
> From: Thomas Weißschuh <linux at weissschuh.net>
> Sent: Tuesday, 3 December 2024 18.08
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>
> Cc: Rafael J. Wysocki <rafael at kernel.org>; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi at intel.com>; Coelho, Luciano <luciano.coelho at intel.com>;
> Saarinen, Jani <jani.saarinen at intel.com>; Nikula, Jani <jani.nikula at intel.com>;
> De Marchi, Lucas <lucas.demarchi at intel.com>; intel-gfx at lists.freedesktop.org;
> intel-xe at lists.freedesktop.org; linux-pm at vger.kernel.org; Sebastian Reichel
> <sebastian.reichel at collabora.com>
> Subject: Re: Regression on linux-next (next-20241120) and drm-tip
>
> On 2024-12-03 16:57:21+0100, Thomas Weißschuh wrote:
> > On 2024-12-03 15:42:23+0000, Borah, Chaitanya Kumar wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Thomas Weißschuh <linux at weissschuh.net>
> > > > Sent: Tuesday, December 3, 2024 8:20 PM
> > > > To: Rafael J. Wysocki <rafael at kernel.org>
> > > > Cc: Borah, Chaitanya Kumar <chaitanya.kumar.borah at intel.com>;
> > > > Kurmi, Suresh Kumar <suresh.kumar.kurmi at intel.com>; Coelho,
> > > > Luciano <luciano.coelho at intel.com>; Saarinen, Jani
> > > > <jani.saarinen at intel.com>; Nikula, Jani <jani.nikula at intel.com>;
> > > > De Marchi, Lucas <lucas.demarchi at intel.com>;
> > > > intel-gfx at lists.freedesktop.org; intel- xe at lists.freedesktop.org;
> > > > linux-pm at vger.kernel.org; Sebastian Reichel
> > > > <sebastian.reichel at collabora.com>
> > > > Subject: Re: Regression on linux-next (next-20241120) and drm-tip
> > > >
> > > > On 2024-12-03 15:33:21+0100, Rafael J. Wysocki wrote:
> > > > > On Tue, Dec 3, 2024 at 1:04 PM Thomas Weißschuh
> > > > <linux at weissschuh.net> wrote:
> > > > > >
> > > > > > On 2024-12-03 12:54:54+0100, Rafael J. Wysocki wrote:
> > > > > > > On Tue, Dec 3, 2024 at 7:51 AM Thomas Weißschuh
> > > > <linux at weissschuh.net> wrote:
> > > > > > > >
> > > > > > > > (+Cc Sebastian)
> > > > > > > >
> > > > > > > > Hi Chaitanya,
> > > > > > > >
> > > > > > > > On 2024-12-03 05:07:47+0000, Borah, Chaitanya Kumar wrote:
> > > > > > > > > Hope you are doing well. I am Chaitanya from the linux
> > > > > > > > > graphics team
> > > > in Intel.
> > > > > > > > >
> > > > > > > > > This mail is regarding a regression we are seeing in our
> > > > > > > > > CI runs[1] on
> > > > linux-next repository.
> > > > > > > >
> > > > > > > > Thanks for the report.
> > > > > > > >
> > > > > > > > > Since the version next-20241120 [2], we are seeing the
> > > > > > > > > following regression
> > > > > > > > >
> > > > > > > > > `````````````````````````````````````````````````````````````````````````````````
> > > > > > > > > <4>[   19.990743] Oops: general protection fault, probably for non-
> > > > canonical address 0xb11675ef8d1ccbce: 0000 [#1] PREEMPT SMP NOPTI
> > > > > > > > > <4>[   19.990760] CPU: 21 UID: 110 PID: 867 Comm: prometheus-
> > > > node Not tainted 6.12.0-next-20241120-next-20241120-gac24e26aa08f+
> > > > #1
> > > > > > > > > <4>[   19.990771] Hardware name: Intel Corporation Arrow Lake
> > > > Client Platform/MTL-S UDIMM 2DPC EVCRB, BIOS
> > > > MTLSFWI1.R00.4400.D85.2410100007 10/10/2024
> > > > > > > > > <4>[   19.990782] RIP:
> 0010:power_supply_get_property+0x3e/0xe0
> > > > > > > > > ````````````````````````````````````````````````````````
> > > > > > > > > `````` ``````````````````` Details log can be found in
> > > > > > > > > [3].
> > > > > > > > >
> > > > > > > > > After bisecting the tree, the following patch [4] seems
> > > > > > > > > to be the first
> > > > "bad"
> > > > > > > > > commit
> > > > > > > > >
> > > > > > > > > ````````````````````````````````````````````````````````
> > > > > > > > > `````` ```````````````````````````````````````````
> > > > > > > > > Commit 49000fee9e639f62ba1f965ed2ae4c5ad18d19e2
> > > > > > > > > Author:     Thomas Weißschuh <mailto:linux at weissschuh.net>
> > > > > > > > > AuthorDate: Sat Oct 5 12:05:03 2024 +0200
> > > > > > > > > Commit:     Sebastian Reichel
> > > > <mailto:sebastian.reichel at collabora.com>
> > > > > > > > > CommitDate: Tue Oct 15 22:22:20 2024 +0200
> > > > > > > > >     power: supply: core: add wakeup source inhibit by
> > > > > > > > > power_supply_config
> > > > > > > > > ````````````````````````````````````````````````````````
> > > > > > > > > `````` ```````````````````````````````````````````
> > > > > > > > >
> > > > > > > > > This is now seen in our drm-tip runs as well. [5]
> > > > > > > > >
> > > > > > > > > Could you please check why the patch causes this
> > > > > > > > > regression and
> > > > provide a fix if necessary?
> > > > > > > >
> > > > > > > > I don't see how this patch can lead to this error.
> > > > > > >
> > > > > > > It looks like the cfg->no_wakeup_source access reaches
> > > > > > > beyond the struct boundary for some reason.
> > > > > >
> > > > > > But the access to this field is only done in __power_supply_register().
> > > > > > The error reports however don't show this function at all,
> > > > > > they come from power_supply_uevent() and
> > > > > > power_supply_get_property() by which time the call to
> __power_supply_register() is long over.
> > > > > >
> > > > > > FWIW there is an uninitialized 'struct power_supply_config' in
> > > > > > drivers/hid/hid-corsair-void.c. But I highly doubt the test
> > > > > > machines are using that. (I'll send a patch later for it)
> > > > >
> > > > > So the only way I can think about in which the commit in
> > > > > question may lead to the reported issues is that changing the
> > > > > size of struct power_supply_config or its alignment makes an
> > > > > unexpected functional difference somewhere.
> > > >
> > > > Indeed. I'd really like to see this issue reproduced with KASAN.
> > > >
> > > > > AFAICS, this commit cannot be reverted by itself, so which
> > > > > commits on top of it need to be reverted in order to revert it cleanly?
> > > >
> > > > All the other patches from this series:
> > > > https://lore.kernel.org/lkml/20241005-power-supply-no-wakeup-sourc
> > > > e-v1-
> > > > 0-1d62bf9bcb1d at weissschuh.net/
> > > >
> > > > Could you point me to the full boot log in the drm-tip CI?
> > >
> > > Here is the log for drm-tip CI
> > >
> > > https://intel-gfx-ci.01.org/tree/drm-tip/IGT_8136/bat-arls-5/boot0.t
> > > xt
> >
> > Thanks!
> >
> > > I carried out another bisect and it points to the following commit
> > >
> > > commit 226ff2e681d006eada59a9693aa1976d4c15a7d4
> > > Author: Heikki Krogerus <heikki.krogerus at linux.intel.com>
> > > Date:   Wed Nov 6 17:06:05 2024 +0200
> > >
> > >     usb: typec: ucsi: Convert connector specific commands to bitmaps
> > >
> > >     That allows the fields in those command data structures to
> > >     be easily validated. If an unsupported field is accessed, a
> > >     warning is generated.
> >
> > Suspicous: The bitmaps introduced in this commit are right before the
> > psy and psy_desc members of 'struct ucsi_connector'.
> > So any out-of-bounds writes into these members would corrupt those
> > fields.
> > A corrupted power_supply_desc would fit both reported stacktraces.
>
> struct ucsi_connector {
>        ...
>
>         struct typec_capability typec_cap;
>
>        /* Cached command responses. */
>        DECLARE_BITMAP(cap, UCSI_GET_CONNECTOR_CAPABILITY_SIZE);
>        DECLARE_BITMAP(status, UCSI_GET_CONNECTOR_STATUS_SIZE);
>
> DECLARE_BITMAP() takes the size in number of *bits*
>
>         struct power_supply *psy;
>         struct power_supply_desc psy_desc;
>         u32 rdo;
>
>        ...
> }
>
> static int ucsi_get_connector_status(struct ucsi_connector *con, bool conn_ack) {
>        u64 command = UCSI_GET_CONNECTOR_STATUS |
> UCSI_CONNECTOR_NUMBER(con->num);
>        size_t size = min(UCSI_GET_CONNECTOR_STATUS_SIZE,
> UCSI_MAX_DATA_LENGTH(con->ucsi));
>        int ret;
>
>        ret = ucsi_send_command_common(con->ucsi, command, &con->status,
> size, conn_ack);
>
> ucsi_send_command_common() takes the size in number of *bytes*.
> This call corrupts psy and psy_desc in con.
>
>        return ret < 0 ? ret : 0;
> }
>
> >
> > > Reverting it seems to help locally. However, to confirm I have sent out a patch
> to our "try-bot"
> > >
> > > https://patchwork.freedesktop.org/series/142049/
> > >
> > > We can wait for its results.
> > >
> > > Regards
> > >
> > > Chaitanya
> > >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-xe/attachments/20241203/7e1abfec/attachment-0001.htm>


More information about the Intel-xe mailing list