[PATCH 5/5] drm/xe: disable wa_15015404425 for PTL B0
Matt Roper
matthew.d.roper at intel.com
Mon Jun 23 23:31:30 UTC 2025
On Mon, Jun 23, 2025 at 05:12:05PM -0400, Rodrigo Vivi wrote:
> On Fri, Jun 20, 2025 at 02:49:20PM -0700, Matt Atwood wrote:
> > This workaround only applies to PTL Compute Die A0. However, this
> > information cannot be determined until after the GT is brought up. This
> > means that we will assume that it is required for the initial bring up of
> > the gt. After GT init, the oob workarounds are enabled for the GT. Use
> > this flag to then manually set the bit in the soc oob bit field to 0
> > which will help performance after device bring up.
> >
> > Signed-off-by: Matt Atwood <matthew.s.atwood at intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_pci.c | 6 ++++++
> > drivers/gpu/drm/xe/xe_wa_oob.rules | 1 +
> > 2 files changed, 7 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > index ded0f3dc8d73..a624c3fb9498 100644
> > --- a/drivers/gpu/drm/xe/xe_pci.c
> > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > @@ -34,6 +34,9 @@
> > #include "xe_tile.h"
> > #include "xe_wa.h"
> >
> > +#include "generated/xe_wa_oob.h"
> > +#include "generated/xe_soc_wa_oob.h"
> > +
> > enum toggle_d3cold {
> > D3COLD_DISABLE,
> > D3COLD_ENABLE,
> > @@ -890,6 +893,9 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> > drm_dbg(&xe->drm, "d3cold: capable=%s\n",
> > str_yes_no(xe->d3cold.capable));
> >
> > + if (XE_WA(xe->tiles->media_gt, 15015404425_disable))
> > + xe->oob[XE_SOC_WA_OOB_15015404425] = 0;
>
> We are discussing this offline, but I need to make it very clear here
> that we should not move forward with this as is.
>
> Two unnaceptable points in here:
> \
> 1. _disable. We either enable or we don't. If we need to wait for the gmdid,
> let it be and enable the workaround after that. GMDID should be one of the
> first things and I confirmed this workaround is so rare that in an a0 situation
> you could wait to only enable after you read the gmd-id and confirm the
> media is A0.
This is exactly what we *can't* do for this workaround. The requirement
is that on any impacted platform, every single register read must be
preceded by four extra dummy writes. There's no such thing as "early
enough to ignore" --- it's specifically noted that even pre-OS firmware
and such needs to be careful about doing this as well. So this causes a
bit of a chicken-and-egg issue: we cannot read the GMD_ID register
without having the workaround active on impacted platforms, but we
cannot figure out whether the platform is impacted until after we've
read that register[*]. We also do a bunch of other register reads
during early driver probe before the xe_gt is initialized enough to be
able to service workaround lookup queries. So there are a few options
here:
* Mark the platform as always being active on PTL, then come back and
disable it later if/when we confirm that we're on a stepping that
isn't impacted by the issue. This is pretty simple conceptually, and
is quite likely something we'll need in the future for other
workarounds. That's the approach MattA has taken here.
* Add two separate workarounds in the driver: 15015404425_early and
15015404425. 15015404425_early is an SoC workaround that applies
unconditionally on PTL, and 15015404425 is a GT workaround that
applies only on specific media steppings. Both do exactly the same
thing (4 dummy writes before any register read), but
15015404425_early is checked before all early MMIO accesses and
15015404425 is checked on all others. The downside of this approach
is that we'd need to use a completely different set of MMIO
operations for early driver boot (i.e., no xe_mmio_read32 and such).
* Ignore this workaround completely if we can confirm that it only
impacts pre-production steppings. I don't think we have confirmation
yet that A-step hardware is preprod-only, so we can't take this easy
path yet.
>
> 2. Don't mix SoC with Media. If the Bug is SoC don't wait of the media stepping
> and check directly for the SoC that needs this. So, don't create an infra that
> already has an exception in it. And if possible, avoid the infra at all. This
> might bring even more confusion to the w/a handling.
>
> This w/a in specific here is soc, but it is getting mapped to our media-ip,
> so let's use the media check...
I think more explanation needs to be added to the patches to clarify
exactly what's going on since it's still a bit non-obvious. The reality
is that our modern platforms aren't actually "SOCs" at all anymore;
they're technically MCP's --- Multi Chip Packages with logic (and
sometimes hardware issues) spread across the various dies. The hardware
teams still refer certain things to be "SoC" logic for historical
reasons, even though that logic is technically distributed across
multiple chips these days.
We absolutely need some kind of "device" workaround framework that isn't
tied to the GT and that can be used before GTs are even up; we have some
non-GT workarounds today that we're not really handling properly, and we
know there are more coming. I think when this workaround first came up
I suggested a "device workaround" infrastructure and using the same
general XE_WA() calls, with a _Generic() implementation to lookup the
status of a workaround in either the xe_device or the xe_gt depending on
which is passed (with the xe_device workaround table being initialized
much earlier in the probe sequence).
Issues outside of the GT can live in different places: the GCD die, the
compute (aka CPU) die, the IO die, etc. Each of these have their own
steppings, and the MCP itself has a stepping too. In some of these
cases, the proper way to determine a stepping is to map PCI revid into a
die stepping. In other cases the proper approach is to "fingerprint" a
die's stepping by inspecting some other IP that lives on the same die
(similar to how we used to fingerprint the PCH back in the day by
inspecting the ISA bus device). For this workaround the stepping we
care about is the compute (CPU) die stepping, and since standalone media
happens to live on that same die, the bspec tells us how to figure out
compute die stepping from media stepping (A-step compute die <=> A-step
media IP in this case, although that isn't something that's guaranteed
to always be true).
[*] An alternative to fingerprinting compute die based on media stepping
would probably be to check the CPU stepping through whatever
mechanism is used to print the stepping in /proc/cpuinfo. We'd have
to find separate documentation on how to map the numeric value there
into somthing like "B0;" I don't know off the top of my head where
that documentation would be but the core kernel guys could probably
point us in the right direction.
Matt
>
> > +
> > return 0;
> >
> > err_driver_cleanup:
> > diff --git a/drivers/gpu/drm/xe/xe_wa_oob.rules b/drivers/gpu/drm/xe/xe_wa_oob.rules
> > index 8c2aa48cb33a..822cbff13819 100644
> > --- a/drivers/gpu/drm/xe/xe_wa_oob.rules
> > +++ b/drivers/gpu/drm/xe/xe_wa_oob.rules
> > @@ -71,3 +71,4 @@ no_media_l3 MEDIA_VERSION(3000)
> > # primary GT GMDID
> > 14022085890 GRAPHICS_VERSION(2001)
> > 16026007364 MEDIA_VERSION(3000)
> > +15015404425_disable PLATFORM(PANTHERLAKE), MEDIA_STEP(B0, FOREVER)
> > --
> > 2.49.0
> >
--
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation
More information about the Intel-xe
mailing list