[PATCH 5/5] drm/xe: disable wa_15015404425 for PTL B0

Tue Jun 24 19:25:37 UTC 2025

On Mon, Jun 23, 2025 at 04:31:30PM -0700, Matt Roper wrote:
> On Mon, Jun 23, 2025 at 05:12:05PM -0400, Rodrigo Vivi wrote:
> > On Fri, Jun 20, 2025 at 02:49:20PM -0700, Matt Atwood wrote:
> > > This workaround only applies to PTL Compute Die A0. However, this
> > > information cannot be determined until after the GT is brought up. This
> > > means that we will assume that it is required for the initial bring up of
> > > the gt. After GT init, the oob workarounds are enabled for the GT. Use
> > > this flag to then manually set the bit in the soc oob bit field to 0
> > > which will help performance after device bring up.
> > > 
> > > Signed-off-by: Matt Atwood <matthew.s.atwood at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_pci.c        | 6 ++++++
> > >  drivers/gpu/drm/xe/xe_wa_oob.rules | 1 +
> > >  2 files changed, 7 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > > index ded0f3dc8d73..a624c3fb9498 100644
> > > --- a/drivers/gpu/drm/xe/xe_pci.c
> > > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > > @@ -34,6 +34,9 @@
> > >  #include "xe_tile.h"
> > >  #include "xe_wa.h"
> > >  
> > > +#include "generated/xe_wa_oob.h"
> > > +#include "generated/xe_soc_wa_oob.h"
> > > +
> > >  enum toggle_d3cold {
> > >  	D3COLD_DISABLE,
> > >  	D3COLD_ENABLE,
> > > @@ -890,6 +893,9 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> > >  	drm_dbg(&xe->drm, "d3cold: capable=%s\n",
> > >  		str_yes_no(xe->d3cold.capable));
> > >  
> > > +	if (XE_WA(xe->tiles->media_gt, 15015404425_disable))
> > > +		xe->oob[XE_SOC_WA_OOB_15015404425] = 0;
> > 
> > We are discussing this offline, but I need to make it very clear here
> > that we should not move forward with this as is.
> > 
> > Two unnaceptable points in here:
> > \
> > 1. _disable. We either enable or we don't. If we need to wait for the gmdid,
> > let it be and enable the workaround after that. GMDID should be one of the
> > first things and I confirmed this workaround is so rare that in an a0 situation
> > you could wait to only enable after you read the gmd-id and confirm the
> > media is A0.
> 
> This is exactly what we *can't* do for this workaround.  The requirement
> is that on any impacted platform, every single register read must be
> preceded by four extra dummy writes.  There's no such thing as "early
> enough to ignore" ---

What I got from the Architects on this is that it would be safe enough to
read the GMDID without this workaround since the condition for the issue
to manifest is not on a single read like this, although the protection is
global.

But okay, let's work with the assumption that it is better to protect
anyway.

> it's specifically noted that even pre-OS firmware
> and such needs to be careful about doing this as well.  So this causes a
> bit of a chicken-and-egg issue:  we cannot read the GMD_ID register
> without having the workaround active on impacted platforms, but we
> cannot figure out whether the platform is impacted until after we've
> read that register[*].  We also do a bunch of other register reads
> during early driver probe before the xe_gt is initialized enough to be
> able to service workaround lookup queries.  So there are a few options
> here:
> 
>  * Mark the platform as always being active on PTL, then come back and
>    disable it later if/when we confirm that we're on a stepping that
>    isn't impacted by the issue.  This is pretty simple conceptually, and
>    is quite likely something we'll need in the future for other
>    workarounds.  That's the approach MattA has taken here.

ack.

> 
>  * Add two separate workarounds in the driver: 15015404425_early and
>    15015404425.  15015404425_early is an SoC workaround that applies
>    unconditionally on PTL, and 15015404425 is a GT workaround that
>    applies only on specific media steppings.  Both do exactly the same
>    thing (4 dummy writes before any register read), but
>    15015404425_early is checked before all early MMIO accesses and
>    15015404425 is checked on all others.  The downside of this approach
>    is that we'd need to use a completely different set of MMIO
>    operations for early driver boot (i.e., no xe_mmio_read32 and such).
> 
>  * Ignore this workaround completely if we can confirm that it only
>    impacts pre-production steppings.  I don't think we have confirmation
>    yet that A-step hardware is preprod-only, so we can't take this easy
>    path yet.
> 
> > 
> > 2. Don't mix SoC with Media. If the Bug is SoC don't wait of the media stepping
> > and check directly for the SoC that needs this. So, don't create an infra that
> > already has an exception in it. And if possible, avoid the infra at all. This
> > might bring even more confusion to the w/a handling.
> > 
> > This w/a in specific here is soc, but it is getting mapped to our media-ip,
> > so let's use the media check...
> 
> I think more explanation needs to be added to the patches to clarify
> exactly what's going on since it's still a bit non-obvious.  The reality
> is that our modern platforms aren't actually "SOCs" at all anymore;
> they're technically MCP's --- Multi Chip Packages with logic (and
> sometimes hardware issues) spread across the various dies.  The hardware
> teams still refer certain things to be "SoC" logic for historical
> reasons, even though that logic is technically distributed across
> multiple chips these days.

Well, I still see the glue of all the chips as the SoC. It is easier to
understand.

> 
> We absolutely need some kind of "device" workaround framework that isn't
> tied to the GT and that can be used before GTs are even up; we have some
> non-GT workarounds today that we're not really handling properly, and we
> know there are more coming.  I think when this workaround first came up
> I suggested a "device workaround" infrastructure and using the same
> general XE_WA() calls, with a _Generic() implementation to lookup the
> status of a workaround in either the xe_device or the xe_gt depending on
> which is passed (with the xe_device workaround table being initialized
> much earlier in the probe sequence).

If we split in device_wa and gt_wa that would make much more sense with
our Xe design indeed.

> 
> Issues outside of the GT can live in different places:  the GCD die, the
> compute (aka CPU) die, the IO die, etc.  Each of these have their own
> steppings, and the MCP itself has a stepping too.  In some of these
> cases, the proper way to determine a stepping is to map PCI revid into a
> die stepping.  In other cases the proper approach is to "fingerprint" a
> die's stepping by inspecting some other IP that lives on the same die
> (similar to how we used to fingerprint the PCH back in the day by
> inspecting the ISA bus device).  For this workaround the stepping we
> care about is the compute (CPU) die stepping, and since standalone media
> happens to live on that same die, the bspec tells us how to figure out
> compute die stepping from media stepping (A-step compute die <=> A-step
> media IP in this case, although that isn't something that's guaranteed
> to always be true).
> 
> 
> [*] An alternative to fingerprinting compute die based on media stepping
>     would probably be to check the CPU stepping through whatever
>     mechanism is used to print the stepping in /proc/cpuinfo.  We'd have
>     to find separate documentation on how to map the numeric value there
>     into somthing like "B0;" I don't know off the top of my head where
>     that documentation would be but the core kernel guys could probably
>     point us in the right direction.

Yeap, in this case it would be the combination of the pci id + cpu-revid.
This makes much more sense for this w/a, but it is hard to generalize indeed.

But also, any device_wa is hard to generalize anyway, since we won't have
a device stepping or anything like that.

So, okay, 15015404425 in xe_device_wa enables it and
15015404425_disable in xe_gt_wa disables it for MEDIA_STEP(B0, FOREVER)

> 
> 
> Matt
> 
> > 
> > > +
> > >  	return 0;
> > >  
> > >  err_driver_cleanup:
> > > diff --git a/drivers/gpu/drm/xe/xe_wa_oob.rules b/drivers/gpu/drm/xe/xe_wa_oob.rules
> > > index 8c2aa48cb33a..822cbff13819 100644
> > > --- a/drivers/gpu/drm/xe/xe_wa_oob.rules
> > > +++ b/drivers/gpu/drm/xe/xe_wa_oob.rules
> > > @@ -71,3 +71,4 @@ no_media_l3	MEDIA_VERSION(3000)
> > >  # primary GT GMDID
> > >  14022085890	GRAPHICS_VERSION(2001)
> > >  16026007364 	MEDIA_VERSION(3000)
> > > +15015404425_disable	PLATFORM(PANTHERLAKE), MEDIA_STEP(B0, FOREVER)
> > > -- 
> > > 2.49.0
> > > 
> 
> -- 
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation