[PATCH v3 5/5] drm/xe: disable wa_15015404425 for PTL B0
Matt Roper
matthew.d.roper at intel.com
Wed Jun 25 20:33:06 UTC 2025
On Wed, Jun 25, 2025 at 12:33:40PM -0700, Matt Atwood wrote:
> This workaround only applies to PTL Compute Die A0. The reality of
Nitpick: it would be better to just say "A-step" rather than A0 since
that's what we're implementing. The workaround is documented as no
longer being necessary from B0 onward, so "A-step" also covers any A1,
A2, etc. steppings (which may or may not exist) and matches how the
code is written.
> modern platforms is we're Multi Chip Packages with logic spread across
> multiple dies. Because this information is not available during PCI
> probe it becomes a bit more complicated.
>
> This workaround needs to be applied on PTL until we prove that we are
> not Compute Die A0 stepping without reading any MMIOs. So use the new
> XE_DEVICE_WA infrastructure to apply early, until we can determine our
> stepping.
It looks like you copied a lot of the text I wrote on an earlier review,
which was just meant for general discussion, not a long-term record of
the change. To be more clear to people reading the commit message and
trying to understand this down the road, it might be better to elaborate
a bit. Maybe something along the lines of:
Wa_15015404425 only needs to be applied on PTL platforms with an
A-step compute die. There's no direct way to map the PCI revid into
a compute die stepping, so the suggested way to figure this out is
by inspecting the media IP's stepping since on PTL the media IP
resides on the compute die. For PTL, the compute die has an A
stepping if and only if the media IP is also A-step (this
relationship isn't something guaranteed to always be true, but it
works out this way on PTL).
Unfortunately there's a bit of a chicken-and-egg problem here;
Wa_15015404425 requires that all register reads be preceded by four
dummy MMIO writes (including during early driver init and even
pre-OS firmware), but the driver needs to perform some MMIO reads
(including the GMD_ID register that holds the media IP version)
before we can determine the media stepping. To handle this safely,
we need to just assume Wa_15015404425 always applies on PTL during
the early parts of device probe, and then go back later and
deactivate the workaround if we determine that we're running on a
later stepping that doesn't require the workaround.
So we implement the overall solution as two workarounds in the
driver:
* 15015404425 - a device OOB workaround that's always active on PTL
* 15015404425_disable - a GT OOB workaround that applies to PTL
platforms with a B0 or later media stepping
The former workaround guards the extra dummy MMIO writes we do when
reading registers, and the latter workaround guards logic that
disables the former once we have the necessary information later in
the probe process.
>
> There are at least two ways to determine Compute Die stepping. This
> patch uses the Media GT stepping to map to Compute Die stepping, in this
> case Compute and Media dies step synchronously.
>
> Since we're using the Media GT information to determine Compute Die
> stepping, use the XE_WA and the oob infrastructure to come back and
> toggle the workaround off when we know its safe, and after GT init.
>
> v2: rename SoC to device, avoid null pointer dereference, update commit
> message.
> v3: rebase
>
> Signed-off-by: Matt Atwood <matthew.s.atwood at intel.com>
> ---
> drivers/gpu/drm/xe/xe_pci.c | 7 +++++++
> drivers/gpu/drm/xe/xe_wa_oob.rules | 2 ++
> 2 files changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> index a9f708608f3e..a294a1b26d8f 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -34,6 +34,9 @@
> #include "xe_tile.h"
> #include "xe_wa.h"
>
> +#include "generated/xe_wa_oob.h"
> +#include "generated/xe_device_wa_oob.h"
> +
> enum toggle_d3cold {
> D3COLD_DISABLE,
> D3COLD_ENABLE,
> @@ -896,6 +899,10 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> drm_dbg(&xe->drm, "d3cold: capable=%s\n",
> str_yes_no(xe->d3cold.capable));
>
> + if (xe->tiles->media_gt != NULL &&
> + XE_WA(xe->tiles->media_gt, 15015404425_disable))
> + xe->oob[XE_DEVICE_WA_OOB_15015404425] = 0;
xe->oob is a bitmask; you're clobbering a whole word here which isn't
what you want to do (and will probably go way past the end of the
bitmask as well). We really shouldn't be mucking with the internals of
the bitmask here anyway; we should have a documented interface from
xe_rtp that allows a workaround to be disabled, and then use that here.
Also note that the condition here can also be simplified slightly if you
use the suggestion below.
> +
> return 0;
>
> err_driver_cleanup:
> diff --git a/drivers/gpu/drm/xe/xe_wa_oob.rules b/drivers/gpu/drm/xe/xe_wa_oob.rules
> index 96cc33da0fb5..255e67113406 100644
> --- a/drivers/gpu/drm/xe/xe_wa_oob.rules
> +++ b/drivers/gpu/drm/xe/xe_wa_oob.rules
> @@ -70,3 +70,5 @@ no_media_l3 MEDIA_VERSION(3000)
> # SoC workaround - currently applies to all platforms with the following
> # primary GT GMDID
> 14022085890 GRAPHICS_VERSION(2001)
> +
> +15015404425_disable PLATFORM(PANTHERLAKE), MEDIA_STEP(B0, FOREVER)
If you use MEDIA_VERSION_ANY_GT() here, then you could make this one
a device workaround as well (since it wouldn't need any specific GT
structures to lookup the media version, as long as we're past the point
where the driver looked those up). It would also have the advantage of
keeping both 15015404425 and 15015404425_disable together in the rules
file. Note that if you do this, you'll need to adjust the suggested
commit message text I wrote above accordingly.
Matt
> --
> 2.49.0
>
--
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation
More information about the Intel-xe
mailing list