[PATCH v1] drm/xe: Rework throttle ABI
Matt Roper
matthew.d.roper at intel.com
Fri Oct 25 20:45:34 UTC 2024
On Fri, Oct 25, 2024 at 10:04:14PM +0300, Raag Jadav wrote:
> On Fri, Oct 25, 2024 at 10:03:41AM -0700, Matt Roper wrote:
> > On Fri, Oct 25, 2024 at 02:52:38PM +0530, Raag Jadav wrote:
> > > Current implementation adds multiple sysfs entries for getting GT
> > > throttle status and its reasons, which forces the user to read multiple
> > > entries for evaluating the result. Since output of each entry is based
> > > on the same underlying hardware register and considering the access type
> > > of this register is RO/v, the value of this register can change at any
> > > given point, even between subsequent sysfs reads. This makes current
> > > implementation fundamentally flawed which can produce inconsistent results.
> > >
> > > Rework throttle ABI and introduce throttle_status attribute which will
> > > provide throttle status through a oneshot register read, making it
> > > relatively less error prone. The new ABI will provide throttle reasons
> > > in a string based list based on the respective bits which are set in the
> > > hardware. Empty output means no bits are set and hence no throttling.
> >
> > But the old ABI is already released, and we have platforms with
> > force_probe lifted already. Presumably userspace software is already
> > using the existing interface (otherwise it never would have been allowed
> > to land upstream), so that means we can't change it in incompatible ways
> > anymore; we're locked into supporting the current interface forever on
> > these platforms.
> >
> > We can change the ABI for _future_ platforms if it makes sense, but it's
> > too late to make compatibility-breaking changes for LNL and BMG.
>
> It landed upstream pretty recently AFAICT, atleast in xe.
It landed upstream about 10 months ago:
commit 1c8e9019033728093c04608f44c6e87fec6822e1
Author: Sujaritha Sundaresan <sujaritha.sundaresan at intel.com>
AuthorDate: Fri Dec 8 00:11:52 2023 -0500
Commit: Rodrigo Vivi <rodrigo.vivi at intel.com>
CommitDate: Thu Dec 21 11:45:28 2023 -0500
drm/xe: Add frequency throttle reasons sysfs attributes
and has been part of the 6.8 kernel onward. So this isn't a case where
something just recently landed in drm-tip and hasn't made its way into a
formal kernel.org release yet.
> So maybe worth reconsidering.
It's not really a decision that we get to make. One of the absolute
rules of Linux kernel development is that we cannot break
kernel<->userspace ABI that's being used. A user must always be able to
safely upgrade their kernel without updating any other software on the
system without suffering breakage/regressions (with slight exceptions
for special security-related cases). In this case, if you remove the
existing sysfs APIs and try to replace them with something different,
then it will immediately break existing tools that were using the old
interface. That's simply not allowed by Linux kernel rules.
>
> > >
> > > $ cat /sys/devices/.../tile0/gt0/freq0/throttle_status
> > > prochot
> > > thermal
> > > ratl
> > > thermalert
> > > tdc
> > > pl4
> > > pl1
> > > pl2
> > >
> > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2810
> >
> > I'm not sure what this ticket is about, but it was already closed
> > several weeks ago due to no longer reproducing.
>
> I'm sure we all have our "it works on my machine" moment :D
The point is that we can't point at a non-active issue as justification
here. If the failure reported in that issue is still relevant, the
ticket should be re-opened and a proper analysis should be posted there.
But changing the ABI like this wouldn't help anyway since it would just
guarantee that the test breaks (because none of the interfaces it tries
to check would exist anymore). This patch wouldn't actually fix or
close that ticket.
Besides, the failure in that ticket likely isn't a KMD issue at all,
just a bug in the test. By design, these sysfs interfaces represent the
"right now" status of the hardware. If you read several sysfs entries
then the status could change as that's happening. It's just a flaw in
the test that it overlooked this and incorrectly assumed the status
would never change while the test is running.
>
> > The ABI change here doesn't seem to be directly related to this ticket.
>
> We can always create new tickets, and that's not the point.
> The point is we're stuck with unreliable ABI.
Unreliable how? We don't let new userspace interfaces into the kernel
unless there's real (not IGT) userspace software lined up to use them,
and the userspace developers have already signed off on the design and
semantics of the interfaces. Presumably the current users are happy
with the current semantics and understand that the status provided is
live rather than sticky.
Sometimes we realize down the road that we did overlook something
important in an interface. It's unfortunate when that happens, but the
path forward then is that we create a new, alternate API on the side and
let existing userspace migrate over or not as it sees fit (e.g., stuff
like the DRM "addfb2" ioctl being added as a more expressive alternative
to "addfb" usage). We still need to support the old interface though;
we can generally only sunset it for future platforms where there's no
pre-existing software already using the old interface.
Matt
>
> Side note: The logs in the ticket may help connect the dots with commit message.
>
> Raag
--
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation
More information about the Intel-xe
mailing list