[Intel-xe] [RFC] drm/xe: A minimal assert for some forcewake domain on xe_mmio.

Matthew Brost matthew.brost at intel.com
Tue May 23 23:52:33 UTC 2023


On Tue, May 23, 2023 at 09:38:49AM -0700, Matt Roper wrote:
> On Tue, May 23, 2023 at 01:56:53PM +0000, Matthew Brost wrote:
> > On Mon, May 22, 2023 at 08:28:05PM -0700, Matt Roper wrote:
> > > On Mon, May 22, 2023 at 06:15:27PM -0400, Rodrigo Vivi wrote:
> > > > It is the maximum protection we can do with the current infrastructure.
> > > > 
> > > > Cc: John Harrison <John.C.Harrison at Intel.com>
> > > > Cc: Matthew Auld <matthew.auld at intel.com>
> > > > Cc: Francois Dugast <francois.dugast at intel.com>
> > > > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> > > > Cc: Jani Nikula <jani.nikula at intel.com>
> > > > Cc: Ville Syrjälä <ville.syrjala at linux.intel.com>
> > > > Cc: Matt Roper <matthew.d.roper at intel.com>
> > > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > ---
> > > > 
> > > > 
> > > > RFC
> > > > ===
> > > > 
> > > > Okay, so, this is more an RFC to brainstorm the future of the force_wake in
> > > > Xe than anything else.
> > > > 
> > > > On i915 the force_wake is built-in the mmio functions at uncore component.
> > > > 
> > > > With that approach we had few historical issues iirc:
> > > > 
> > > > 1. Display performance with vblank evasion that requested uncore to provide
> > > > the 'fw' variantes that are actually the ones that avoid fw (contrary to what
> > > > the name suggests).
> > > 
> > > In i915 there were more differences between fw and non-fw variants
> > > of register functions than just forcewake handling:
> > > 
> > >  * _fw() functions assumed that the caller was already holding
> > >    forcewake, whereas the non-fw functions would obtain it themselves
> > >  * _fw() functions also assumed that the caller was already holding
> > >    uncore->lock, whereas the non-fw functions would obtain and release
> > >    the lock around each register access
> > >  * _fw() functions do no tracing and no debug assertions
> > >  * _fw() functions do not check for unclaimed MMIO
> > > 
> > > I don't think the first bullet there (forcewake) really mattered much
> > > from a performance perspective.  For display registers, a quick binary
> > > search of the FW table (just a handful of CPU cycles) would quickly
> > > determine that no forcewake domains were needed for those register
> > > offsets, so we wouldn't be doing anything with forcewake at the hardware
> > > level at all.  For display registers, the more performance-relevant
> > > aspects of using the _fw() functions was doing all your register
> > > accesses together without contention with other MMIO work.  I.e.,
> > > holding the uncore lock over an entire set of registers rather than
> > > grabbing/releasing it for each one, and (for vblank evasion
> > > specifically) doing it all while interrupts were disabled.  For debug
> > > drivers, there was also a bunch of other stuff that the _fw() functions
> > > bypassed (e.g., tripling of the number of register accesses due to
> > > reading FPGA_DBG before/after each register to look for unclaimed
> > > accesses).
> > > 
> > > > 
> > > > 2. Missed ranges updates. Sometimes we messed up with the ranges, there were
> > > > other times that the spec was updated and we didn't get the notification, and
> > > > there were cases that the BSpec had bugs.
> > > > 
> > > > For these reasons in Xe we have decided to let to the caller the
> > > > responsibility to set the force_wake bits for their domains before doing
> > > > the MMIO.
> > > 
> > > I don't think #2 was a reason to skip forcewake in Xe.  Not noticing a
> > > bspec update (or the bspec itself having incorrect information) would
> > > lead to even more mistakes if the caller has to explicitly grab the
> > > appropriate domains than if the driver does it implicitly.  The explicit
> > > handling only helps in the subset of cases where we blindly grab all of
> > > the domains (as we do during part of init).
> > > 
> > > 
> > 
> > MMIO access that requires forcewakes really only should be in a few
> > places, off the top of my head:
> > 
> > 1. Init
> > 2. Reset
> > 3. Sysfs / Debugfs
> > 
> > In all of these cases I think it fine to blindly grab all forcewakes.
> > 
> > Please chime in if I'm missing something.
> 
> I think as we implement more features in the Xe driver we're going to
> wind up with a lot more places where we need to access GT registers at
> runtime.  A few examples off the top of my head:
> 
>  * EU stall sampling
>  * Perf/OA
>  * EU debugger
>  * Various OOB workarounds that ask us to twiddle a register at certain
>    times in the driver.

Hmm, ok a lot of these I could make the argument that these are not
normal operation so just grab everything and call it a day. If some of
these fall into the category of 'normal enough to be optimized' I can
also make the argument it is no more complex (perhaps less complex) to
explicitly grab the forcewake than having the logic auto-grab the
forcewake.

What a compromise of in a debug mode we auto-generate the asserts based
on a table but we still explicitly grab the forcewake?

> 
> > 
> > > > 
> > > > However I'm seeing many questions and doubts popping up:
> > > > 
> > > > 1. Are we confident that we are not missing any wake-up?
> > > > 2. Are we confident that the domains are set correctly?
> > > > 3. Are we not wasting power if we are waking up ALL the domains instead
> > > >    of some specific one?
> > > 
> > > I think we're only holding ALL domains during init; for most of the
> > > runtime MMIO accesses, I think the code is currently attempting to only
> > > grab the domain(s) that it thinks the registers it's accessing will
> > > need.  Although that might be working right now, I'm a little bit
> > > worried about how that will scale in the long term when a single
> > > register might be in several different domains depending on what
> > > platform you're running on.  There have definitely been cases in the
> > > past where groups of registers migrated from RENDER to GT or vice versa
> > > between platforms, so the exact domain you needed for an operation
> > > varied by platform.  And things are even more complicated if you're
> > > doing any MMIO against the media, since the hardware seems to change
> > > exactly how it splits up the media power wells somewhat often.
> > >
> > > > 4. What about the display disconnection now because i915 and xe have different
> > > >    mmio approaches but reusing i915-display?
> > > > 
> > > > It looks to me that the cons of the current approach are superseding the
> > > > cons of the i915 approach. But I want to hear more thoughts here before
> > > > we decide which route to take.
> > > > 
> > > > Maybe we have that domain as part of the mmio calls themselves? Maybe
> > > > a double approach where the caller is responsible but mmio has the range
> > > > information and double check it?
> > > 
> > > As noted above, part of the challenge with forcewake is that even if the
> > > caller knows it wants to access register FOO, and even if it know FOO is
> > > a GT register that likely needs forcewake, it's sometimes challenging to
> > > make sure it's grabbing the correct domain(s) for every single platform
> > > the Xe driver will eventually support if the power management handling
> > > changes.  I think that was part of the motivation for encoding the
> > > tables into the driver in i915.  It seems like GT power wells don't
> > > change as much these days as they used to, but it's hard to say whether
> > > that will continue in the future or not.  Who knows...maybe they'll
> > > eventually start creating dedicated domains for stuff like blitters,
> > > GuC, etc.  rather than lumping all of those into the "GT" catchall
> > > domain.
> > > 
> > > If we do decide to go back to implicit forcewake handling with the table
> > > encoded into the driver, it might be worth doing something sort of like
> > 
> > Let's not do this intel_uncore.c is an unreadable mess, I don't want
> > anything like this in Xe.
> 
> When you say unreadable, are you referring to the macros that generate
> the table lookup functions?  I don't think that macro magic would be
> needed for Xe since we don't have to support old platforms that have
> very different forcewake behavior, and we also don't need to generate
> separate 8-bit, 16-bit, and 32-bit versions of each operation anymore
> (since I think everything in the GT is 32-bits these days).
> 

Just the entire file is unreadable in general, trying to avoid these
types of files in Xe, avoid traps of the i915 did this so let's do it in
Xe, and over engineering our driver to avoid hypothetical future bugs.

Matt

> The forcewake and shadow tables themselves are pretty clean in i915
> these days, and if we move to autogenerating them from a text file, they
> can become even simpler since the text file will basically be a cleaned
> up copy/paste of part of the bspec table.
> 
> 
> Matt
> 
> > 
> > > what Lucas is doing in the OOB workaround series --- drop the
> > > per-platform tables into a human-readable text file that's more similar
> > > to the format used by the bspec (exact ranges, forcewake domain, MCR
> > > replication type, etc.) and then provide a small parser program that
> > > will convert that into actual code (and do things like consolidating
> > > adjacent ranges).
> > > 
> > > > 
> > > > Any other idea? Thoughts?
> > > 
> > > Once the big GT vs tile refactor stuff I have in flight gets finalized,
> > > I plan to follow up with another series that creates a more appropriate
> > > MMIO target for register operations rather than using "xe_gt" as the
> > > target (even for things completely unrelated to the small GT subset of
> > > hardware).  My idea is that you'd grab an MMIO target structure for MMIO
> > > operations against a specific hardware unit, and then the info inside
> > > the MMIO structure would be able to figure out if there are additional
> > > checks and/or operations it should perform.  E.g.,
> > > 
> > >  * mmio = xe_mmio_for_device(xe_device *xe);
> > >     - Used to submit MMIO operations against the root tile of the PCI
> > >       device.  Only used during init (e.g., to read the registers that
> > >       tell us which tiles exist on the platform) and during top-level
> > >       interrupt enable/disable (since all interrupts are routed through
> > >       the root tile).
> > >     - No forcewake needed for register accesses through a handle of this
> > >       type since you'd only ever be accessing sgunit registers for these
> > >       types of things.
> > >     - Register accesses through mmio can warn on debug builds if the
> > >       register appears to be in a GT-related MMIO range.
> > > 
> > >  * mmio = xe_mmio_for_display(xe_device *xe);
> > >     - Pretty much the same as handles returned by xe_mmio_for_device(),
> > >       but if a handle of this type is used to try to read/write
> > >       registers outside the display range, we could have debug builds
> > >       throw some extra warnings.
> > >     - Unclaimed register detection could be confined to accesses through
> > >       these handles.
> > > 
> > >  * mmio = xe_mmio_for_tile(xe_tile *tile);
> > >     - Used to access non-GT registers that reside in a specific tile.
> > >       I.e., sgunit/soc registers.
> > >     - As above, no forcewake needed, can make MMIO operations warn
> > >       if used to access a GT range.
> > > 
> > >  * mmio = xe_mmio_for_gt(xe_gt *gt);
> > >     - Used to access GT registers in a specific GT
> > >     - Does automatic GSI offset translation for media GTs
> > >     - Can either do automatic forcewake like i915 does, or can do debug
> > >       check+warn like you have here.
> > >     - Can make MMIO operations warn if MMIO offset is outside GT range
> > >     - Can also trigger warnings if a GT non-GSI, non-media engine
> > >       register is accessed from an MMIO obtained from a media GT.
> > >
> > 
> > Ack on this if we keep in managable, worried code / feature bloat those
> > that will result in a similar file to intel_uncore.c in Xe.
> > 
> > Matt
> > 
> > > At some point we'll probably want to add some debug tracepoints for MMIO
> > > access to assist with debugging; the mmio structure can also help
> > > provide meaningful output for each type of MMIO handle as well.
> > > 
> > > 
> > > One other MMIO-related thing:
> > > At the moment Xe does no locking of MMIO access.  I know there were
> > > hardware lockups on old platforms if simultaneous MMIO accesses
> > > happened, but I've never seen that tied to a specific HSD ticket that
> > > would let us know whether modern platforms are still susceptible or not.
> > > If it turns out they are and that we need to bring in the equivalent of
> > > i915's uncore lock (and figure out if any register access is
> > > problematic, or only certain types/ranges).
> > > 
> > > 
> > > Matt
> > > 
> > > > 
> > > > Thanks,
> > > > Rodrigo.
> > > > 
> > > >  drivers/gpu/drm/xe/xe_mmio.h | 26 ++++++++++++++++++++++++++
> > > >  1 file changed, 26 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.h b/drivers/gpu/drm/xe/xe_mmio.h
> > > > index 1407f1189b0d..6302ca85e3f4 100644
> > > > --- a/drivers/gpu/drm/xe/xe_mmio.h
> > > > +++ b/drivers/gpu/drm/xe/xe_mmio.h
> > > > @@ -18,8 +18,26 @@ struct xe_device;
> > > >  
> > > >  int xe_mmio_init(struct xe_device *xe);
> > > >  
> > > > +static inline void xe_mmio_assert_forcewake(struct xe_gt *gt,
> > > > +					    struct xe_reg reg)
> > > > +{
> > > > +	struct xe_force_wake *fw = &gt->mmio.fw;
> > > > +
> > > > +	/* No need for forcewake in this range */
> > > > +	if (reg.addr >= 0x40000 && reg.addr < 0x116000)
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * XXX: Weak. It checks for some domain, but impossible to determine
> > > > +	 * if the right one is selected, or if it was set by the same caller
> > > > +	 */
> > > > +	WARN_ON(!fw->awake_domains);
> > > > +}
> > > > +
> > > >  static inline u8 xe_mmio_read8(struct xe_gt *gt, struct xe_reg reg)
> > > >  {
> > > > +	xe_mmio_assert_forcewake(gt, reg);
> > > > +
> > > >  	if (reg.addr < gt->mmio.adj_limit)
> > > >  		reg.addr += gt->mmio.adj_offset;
> > > >  
> > > > @@ -29,6 +47,8 @@ static inline u8 xe_mmio_read8(struct xe_gt *gt, struct xe_reg reg)
> > > >  static inline void xe_mmio_write32(struct xe_gt *gt,
> > > >  				   struct xe_reg reg, u32 val)
> > > >  {
> > > > +	xe_mmio_assert_forcewake(gt, reg);
> > > > +
> > > >  	if (reg.addr < gt->mmio.adj_limit)
> > > >  		reg.addr += gt->mmio.adj_offset;
> > > >  
> > > > @@ -37,6 +57,8 @@ static inline void xe_mmio_write32(struct xe_gt *gt,
> > > >  
> > > >  static inline u32 xe_mmio_read32(struct xe_gt *gt, struct xe_reg reg)
> > > >  {
> > > > +	xe_mmio_assert_forcewake(gt, reg);
> > > > +
> > > >  	if (reg.addr < gt->mmio.adj_limit)
> > > >  		reg.addr += gt->mmio.adj_offset;
> > > >  
> > > > @@ -58,6 +80,8 @@ static inline u32 xe_mmio_rmw32(struct xe_gt *gt, struct xe_reg reg, u32 clr,
> > > >  static inline void xe_mmio_write64(struct xe_gt *gt,
> > > >  				   struct xe_reg reg, u64 val)
> > > >  {
> > > > +	xe_mmio_assert_forcewake(gt, reg);
> > > > +
> > > >  	if (reg.addr < gt->mmio.adj_limit)
> > > >  		reg.addr += gt->mmio.adj_offset;
> > > >  
> > > > @@ -66,6 +90,8 @@ static inline void xe_mmio_write64(struct xe_gt *gt,
> > > >  
> > > >  static inline u64 xe_mmio_read64(struct xe_gt *gt, struct xe_reg reg)
> > > >  {
> > > > +	xe_mmio_assert_forcewake(gt, reg);
> > > > +
> > > >  	if (reg.addr < gt->mmio.adj_limit)
> > > >  		reg.addr += gt->mmio.adj_offset;
> > > >  
> > > > -- 
> > > > 2.39.2
> > > > 
> > > 
> > > -- 
> > > Matt Roper
> > > Graphics Software Engineer
> > > Linux GPU Platform Enablement
> > > Intel Corporation
> 
> -- 
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation


More information about the Intel-xe mailing list