[Intel-xe] [RFC] drm/xe: A minimal assert for some forcewake domain on xe_mmio.

Luca Coelho luca at coelho.fi
Mon Oct 16 10:22:32 UTC 2023


On Thu, 2023-10-12 at 16:04 -0400, Rodrigo Vivi wrote:
> On Tue, Oct 10, 2023 at 10:00:06AM +0300, Luca Coelho wrote:
> > On Mon, 2023-10-09 at 17:15 -0400, Rodrigo Vivi wrote:
> > > On Mon, Oct 09, 2023 at 12:22:44PM +0300, Luca Coelho wrote:
> > > > Hi everyone,
> > > > 
> > > > On Tue, 2023-05-23 at 23:52 +0000, Matthew Brost wrote:
> > > > > On Tue, May 23, 2023 at 09:38:49AM -0700, Matt Roper wrote:
> > > > > > On Tue, May 23, 2023 at 01:56:53PM +0000, Matthew Brost wrote:
> > > > > > > On Mon, May 22, 2023 at 08:28:05PM -0700, Matt Roper wrote:
> > > > > > > > On Mon, May 22, 2023 at 06:15:27PM -0400, Rodrigo Vivi wrote:
> > > > > > > > > It is the maximum protection we can do with the current infrastructure.
> > > > > > > > > 
> > > > > > > > > Cc: John Harrison <John.C.Harrison at Intel.com>
> > > > > > > > > Cc: Matthew Auld <matthew.auld at intel.com>
> > > > > > > > > Cc: Francois Dugast <francois.dugast at intel.com>
> > > > > > > > > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> > > > > > > > > Cc: Jani Nikula <jani.nikula at intel.com>
> > > > > > > > > Cc: Ville Syrjälä <ville.syrjala at linux.intel.com>
> > > > > > > > > Cc: Matt Roper <matthew.d.roper at intel.com>
> > > > > > > > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > > > > > > > Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> > > > > > > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > > > > > > ---
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > RFC
> > > > > > > > > ===
> > > > > > > > > 
> > > > > > > > > Okay, so, this is more an RFC to brainstorm the future of the force_wake in
> > > > > > > > > Xe than anything else.
> > > > > > > > > 
> > > > > > > > > On i915 the force_wake is built-in the mmio functions at uncore component.
> > > > > > > > > 
> > > > > > > > > With that approach we had few historical issues iirc:
> > > > > > > > > 
> > > > > > > > > 1. Display performance with vblank evasion that requested uncore to provide
> > > > > > > > > the 'fw' variantes that are actually the ones that avoid fw (contrary to what
> > > > > > > > > the name suggests).
> > > > > > > > 
> > > > > > > > In i915 there were more differences between fw and non-fw variants
> > > > > > > > of register functions than just forcewake handling:
> > > > > > > > 
> > > > > > > >  * _fw() functions assumed that the caller was already holding
> > > > > > > >    forcewake, whereas the non-fw functions would obtain it themselves
> > > > > > > >  * _fw() functions also assumed that the caller was already holding
> > > > > > > >    uncore->lock, whereas the non-fw functions would obtain and release
> > > > > > > >    the lock around each register access
> > > > > > > >  * _fw() functions do no tracing and no debug assertions
> > > > > > > >  * _fw() functions do not check for unclaimed MMIO
> > > > > > > > 
> > > > > > > > I don't think the first bullet there (forcewake) really mattered much
> > > > > > > > from a performance perspective.  For display registers, a quick binary
> > > > > > > > search of the FW table (just a handful of CPU cycles) would quickly
> > > > > > > > determine that no forcewake domains were needed for those register
> > > > > > > > offsets, so we wouldn't be doing anything with forcewake at the hardware
> > > > > > > > level at all.  For display registers, the more performance-relevant
> > > > > > > > aspects of using the _fw() functions was doing all your register
> > > > > > > > accesses together without contention with other MMIO work.  I.e.,
> > > > > > > > holding the uncore lock over an entire set of registers rather than
> > > > > > > > grabbing/releasing it for each one, and (for vblank evasion
> > > > > > > > specifically) doing it all while interrupts were disabled.  For debug
> > > > > > > > drivers, there was also a bunch of other stuff that the _fw() functions
> > > > > > > > bypassed (e.g., tripling of the number of register accesses due to
> > > > > > > > reading FPGA_DBG before/after each register to look for unclaimed
> > > > > > > > accesses).
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 2. Missed ranges updates. Sometimes we messed up with the ranges, there were
> > > > > > > > > other times that the spec was updated and we didn't get the notification, and
> > > > > > > > > there were cases that the BSpec had bugs.
> > > > > > > > > 
> > > > > > > > > For these reasons in Xe we have decided to let to the caller the
> > > > > > > > > responsibility to set the force_wake bits for their domains before doing
> > > > > > > > > the MMIO.
> > > > > > > > 
> > > > > > > > I don't think #2 was a reason to skip forcewake in Xe.  Not noticing a
> > > > > > > > bspec update (or the bspec itself having incorrect information) would
> > > > > > > > lead to even more mistakes if the caller has to explicitly grab the
> > > > > > > > appropriate domains than if the driver does it implicitly.  The explicit
> > > > > > > > handling only helps in the subset of cases where we blindly grab all of
> > > > > > > > the domains (as we do during part of init).
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > MMIO access that requires forcewakes really only should be in a few
> > > > > > > places, off the top of my head:
> > > > > > > 
> > > > > > > 1. Init
> > > > > > > 2. Reset
> > > > > > > 3. Sysfs / Debugfs
> > > > > > > 
> > > > > > > In all of these cases I think it fine to blindly grab all forcewakes.
> > > > > > > 
> > > > > > > Please chime in if I'm missing something.
> > > > > > 
> > > > > > I think as we implement more features in the Xe driver we're going to
> > > > > > wind up with a lot more places where we need to access GT registers at
> > > > > > runtime.  A few examples off the top of my head:
> > > > > > 
> > > > > >  * EU stall sampling
> > > > > >  * Perf/OA
> > > > > >  * EU debugger
> > > > > >  * Various OOB workarounds that ask us to twiddle a register at certain
> > > > > >    times in the driver.
> > > > > 
> > > > > Hmm, ok a lot of these I could make the argument that these are not
> > > > > normal operation so just grab everything and call it a day. If some of
> > > > > these fall into the category of 'normal enough to be optimized' I can
> > > > > also make the argument it is no more complex (perhaps less complex) to
> > > > > explicitly grab the forcewake than having the logic auto-grab the
> > > > > forcewake.
> > > > > 
> > > > > What a compromise of in a debug mode we auto-generate the asserts based
> > > > > on a table but we still explicitly grab the forcewake?
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > However I'm seeing many questions and doubts popping up:
> > > > > > > > > 
> > > > > > > > > 1. Are we confident that we are not missing any wake-up?
> > > > > > > > > 2. Are we confident that the domains are set correctly?
> > > > > > > > > 3. Are we not wasting power if we are waking up ALL the domains instead
> > > > > > > > >    of some specific one?
> > > > > > > > 
> > > > > > > > I think we're only holding ALL domains during init; for most of the
> > > > > > > > runtime MMIO accesses, I think the code is currently attempting to only
> > > > > > > > grab the domain(s) that it thinks the registers it's accessing will
> > > > > > > > need.  Although that might be working right now, I'm a little bit
> > > > > > > > worried about how that will scale in the long term when a single
> > > > > > > > register might be in several different domains depending on what
> > > > > > > > platform you're running on.  There have definitely been cases in the
> > > > > > > > past where groups of registers migrated from RENDER to GT or vice versa
> > > > > > > > between platforms, so the exact domain you needed for an operation
> > > > > > > > varied by platform.  And things are even more complicated if you're
> > > > > > > > doing any MMIO against the media, since the hardware seems to change
> > > > > > > > exactly how it splits up the media power wells somewhat often.
> > > > > > > > 
> > > > > > > > > 4. What about the display disconnection now because i915 and xe have different
> > > > > > > > >    mmio approaches but reusing i915-display?
> > > > > > > > > 
> > > > > > > > > It looks to me that the cons of the current approach are superseding the
> > > > > > > > > cons of the i915 approach. But I want to hear more thoughts here before
> > > > > > > > > we decide which route to take.
> > > > > > > > > 
> > > > > > > > > Maybe we have that domain as part of the mmio calls themselves? Maybe
> > > > > > > > > a double approach where the caller is responsible but mmio has the range
> > > > > > > > > information and double check it?
> > > > > > > > 
> > > > > > > > As noted above, part of the challenge with forcewake is that even if the
> > > > > > > > caller knows it wants to access register FOO, and even if it know FOO is
> > > > > > > > a GT register that likely needs forcewake, it's sometimes challenging to
> > > > > > > > make sure it's grabbing the correct domain(s) for every single platform
> > > > > > > > the Xe driver will eventually support if the power management handling
> > > > > > > > changes.  I think that was part of the motivation for encoding the
> > > > > > > > tables into the driver in i915.  It seems like GT power wells don't
> > > > > > > > change as much these days as they used to, but it's hard to say whether
> > > > > > > > that will continue in the future or not.  Who knows...maybe they'll
> > > > > > > > eventually start creating dedicated domains for stuff like blitters,
> > > > > > > > GuC, etc.  rather than lumping all of those into the "GT" catchall
> > > > > > > > domain.
> > > > > > > > 
> > > > > > > > If we do decide to go back to implicit forcewake handling with the table
> > > > > > > > encoded into the driver, it might be worth doing something sort of like
> > > > > > > 
> > > > > > > Let's not do this intel_uncore.c is an unreadable mess, I don't want
> > > > > > > anything like this in Xe.
> > > > > > 
> > > > > > When you say unreadable, are you referring to the macros that generate
> > > > > > the table lookup functions?  I don't think that macro magic would be
> > > > > > needed for Xe since we don't have to support old platforms that have
> > > > > > very different forcewake behavior, and we also don't need to generate
> > > > > > separate 8-bit, 16-bit, and 32-bit versions of each operation anymore
> > > > > > (since I think everything in the GT is 32-bits these days).
> > > > > > 
> > > > > 
> > > > > Just the entire file is unreadable in general, trying to avoid these
> > > > > types of files in Xe, avoid traps of the i915 did this so let's do it in
> > > > > Xe, and over engineering our driver to avoid hypothetical future bugs.
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > The forcewake and shadow tables themselves are pretty clean in i915
> > > > > > these days, and if we move to autogenerating them from a text file, they
> > > > > > can become even simpler since the text file will basically be a cleaned
> > > > > > up copy/paste of part of the bspec table.
> > > > > > 
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > 
> > > > > > > > what Lucas is doing in the OOB workaround series --- drop the
> > > > > > > > per-platform tables into a human-readable text file that's more similar
> > > > > > > > to the format used by the bspec (exact ranges, forcewake domain, MCR
> > > > > > > > replication type, etc.) and then provide a small parser program that
> > > > > > > > will convert that into actual code (and do things like consolidating
> > > > > > > > adjacent ranges).
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Any other idea? Thoughts?
> > > > > > > > 
> > > > > > > > Once the big GT vs tile refactor stuff I have in flight gets finalized,
> > > > > > > > I plan to follow up with another series that creates a more appropriate
> > > > > > > > MMIO target for register operations rather than using "xe_gt" as the
> > > > > > > > target (even for things completely unrelated to the small GT subset of
> > > > > > > > hardware).  My idea is that you'd grab an MMIO target structure for MMIO
> > > > > > > > operations against a specific hardware unit, and then the info inside
> > > > > > > > the MMIO structure would be able to figure out if there are additional
> > > > > > > > checks and/or operations it should perform.  E.g.,
> > > > > > > > 
> > > > > > > >  * mmio = xe_mmio_for_device(xe_device *xe);
> > > > > > > >     - Used to submit MMIO operations against the root tile of the PCI
> > > > > > > >       device.  Only used during init (e.g., to read the registers that
> > > > > > > >       tell us which tiles exist on the platform) and during top-level
> > > > > > > >       interrupt enable/disable (since all interrupts are routed through
> > > > > > > >       the root tile).
> > > > > > > >     - No forcewake needed for register accesses through a handle of this
> > > > > > > >       type since you'd only ever be accessing sgunit registers for these
> > > > > > > >       types of things.
> > > > > > > >     - Register accesses through mmio can warn on debug builds if the
> > > > > > > >       register appears to be in a GT-related MMIO range.
> > > > > > > > 
> > > > > > > >  * mmio = xe_mmio_for_display(xe_device *xe);
> > > > > > > >     - Pretty much the same as handles returned by xe_mmio_for_device(),
> > > > > > > >       but if a handle of this type is used to try to read/write
> > > > > > > >       registers outside the display range, we could have debug builds
> > > > > > > >       throw some extra warnings.
> > > > > > > >     - Unclaimed register detection could be confined to accesses through
> > > > > > > >       these handles.
> > > > > > > > 
> > > > > > > >  * mmio = xe_mmio_for_tile(xe_tile *tile);
> > > > > > > >     - Used to access non-GT registers that reside in a specific tile.
> > > > > > > >       I.e., sgunit/soc registers.
> > > > > > > >     - As above, no forcewake needed, can make MMIO operations warn
> > > > > > > >       if used to access a GT range.
> > > > > > > > 
> > > > > > > >  * mmio = xe_mmio_for_gt(xe_gt *gt);
> > > > > > > >     - Used to access GT registers in a specific GT
> > > > > > > >     - Does automatic GSI offset translation for media GTs
> > > > > > > >     - Can either do automatic forcewake like i915 does, or can do debug
> > > > > > > >       check+warn like you have here.
> > > > > > > >     - Can make MMIO operations warn if MMIO offset is outside GT range
> > > > > > > >     - Can also trigger warnings if a GT non-GSI, non-media engine
> > > > > > > >       register is accessed from an MMIO obtained from a media GT.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ack on this if we keep in managable, worried code / feature bloat those
> > > > > > > that will result in a similar file to intel_uncore.c in Xe.
> > > > 
> > > > Just to revive this thread.
> > > > 
> > > > I've been discussing this proposal in the context of a change we need
> > > > to make in the display, which is to introduce a wakelock for some
> > > > registers access in order to be able to remove the DMC trap.
> > > > 
> > > > We came to the conclusion that this new implementation is very much the
> > > > same as the forcewake proposal here.  From the display point-of-view,
> > > > we could simply have a new domain and range of registers to protect.
> > > > 
> > > > We _could_ implement a separate "wakelock" mechanism for the display
> > > > part, but that would be mostly duplicating the entire forcewake
> > > > implementation.
> > > > 
> > > > So, are there any plans to implement the current proposal? Or any other
> > > > plans related to the forcewake implementation for Xe? As I see it, the
> > > > "wakelock" implementation in the display depends on this.
> > > > 
> > > > Any thoughts?
> > > 
> > > Hi Luca, thanks for raising this again here.
> > 
> > Hi Lucas! Thanks for your comments.
> > 
> > 
> > > So, as I said in the past, I don't have strong opinions regarding the
> > > overall forcewake approach.
> > > 
> > > The since GT MMIO is not used very frequently, I don't see a problem of
> > > leaving that up to the caller to take the right domain when needed, or even
> > > the FORCEWAKE_ALL domain. Instead of forcing all the callers to
> > > go through this extra steps and then have to opt-out with the '_fw' [sic]
> > > alternatives for the cases where the forcewake cannot be checked underneath.
> > > 
> > > So, for the new display wakelocks, that's the problem of adding them to
> > > the drivers/gpu/drm/xe/compat-i915-headers/intel_uncore.h where the converions
> > > from i915's intel uncore are happening to xe_mmio? So, when using the
> > > intel_uncore's '_fw' variant you skip the wakelock and when doing the regular
> > > mmio calls you just add a wakelock_get/put around the xe_mmio call with
> > > the display domain. And in i915 you implement inside the intel_uncore.
> > > 
> > > What's the downside of this approach?
> > 
> > That's more or less what I was thinking.  I don't think there's any
> > problem on the display side in calling the same framework.
> > 
> > 
> > > Trying to understand all the pros and cons of different approaches, and
> > > bringing some people to the discussion.
> > > 
> > > If we implement the forwareke underneath the mmio call, display needs
> > > to anyway implement it twice on the i915's intel_uncore and on the
> > > xe_forcewake, no?!
> > 
> > Yes, IMHO we should start making Xe the base implementation and
> > backporting the new implementation into i915.
> > 
> > So,for this specific case, I think we can just make i915 call the new
> > functions in xe and have them backported in intel_uncore.h for i915.
> > 
> > What I was thinking was basically this:
> > 
> > 1. Implement the xe_mmio_for_<hw_unit>() proposal in Xe​
> 
> hmm probably a new component then?! xe_display_wakelock that
> implements and export the get/put domain in a similar way
> of xe_forcewake, and then on the wrapper you add the get/put
> around the existing xe_mmio calls?!

Yes, I guess a new component, but part of Xe.  I don't exactly know
what you mean with "xe_forcewake", but it's the same as Matt proposed
with the different functions for different domains.  These functions
would do everything that is needed for the specific domain, such as
keeping it awake, checking if the registers are in the right range etc.


> > 2. Display code uses new APIs (xe_mmio_for_display ops)​
> > 
> > 3. Update uncore to handle "wakelock domain"​
> 
> then on uncore you would need to parse the mmio range and then
> call the get and puts, that in that case would be static inside
> the uncore itself like the forcewakes ?!

The callers will use the new Xe APIs, so in i915, the display code (and
probably GT and others too) will be the same as in Xe.  I see it as a
backport.  We can create wrappers that call the uncore APIs to achieve
the same functionality as before.

There are things that we may not need to be backported, for instance,
we don't really need the new wakelock that is needed in LNL (and future
HW), because i915 may not need to support it.  So we can just make a
dummy wrapper for this.  And it should still be easy to properly
backport if needed.

Do you see any problem with this approach?

--
Cheers,
Luca.


More information about the Intel-xe mailing list