[Intel-xe] [RFC] drm/xe: A minimal assert for some forcewake domain on xe_mmio.

Rodrigo Vivi rodrigo.vivi at intel.com
Mon Oct 16 18:09:05 UTC 2023


On Mon, Oct 16, 2023 at 01:22:32PM +0300, Luca Coelho wrote:
> On Thu, 2023-10-12 at 16:04 -0400, Rodrigo Vivi wrote:
> > On Tue, Oct 10, 2023 at 10:00:06AM +0300, Luca Coelho wrote:
> > > On Mon, 2023-10-09 at 17:15 -0400, Rodrigo Vivi wrote:
> > > > On Mon, Oct 09, 2023 at 12:22:44PM +0300, Luca Coelho wrote:
> > > > > Hi everyone,
> > > > > 
> > > > > On Tue, 2023-05-23 at 23:52 +0000, Matthew Brost wrote:
> > > > > > On Tue, May 23, 2023 at 09:38:49AM -0700, Matt Roper wrote:
> > > > > > > On Tue, May 23, 2023 at 01:56:53PM +0000, Matthew Brost wrote:
> > > > > > > > On Mon, May 22, 2023 at 08:28:05PM -0700, Matt Roper wrote:
> > > > > > > > > On Mon, May 22, 2023 at 06:15:27PM -0400, Rodrigo Vivi wrote:
> > > > > > > > > > It is the maximum protection we can do with the current infrastructure.
> > > > > > > > > > 
> > > > > > > > > > Cc: John Harrison <John.C.Harrison at Intel.com>
> > > > > > > > > > Cc: Matthew Auld <matthew.auld at intel.com>
> > > > > > > > > > Cc: Francois Dugast <francois.dugast at intel.com>
> > > > > > > > > > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> > > > > > > > > > Cc: Jani Nikula <jani.nikula at intel.com>
> > > > > > > > > > Cc: Ville Syrjälä <ville.syrjala at linux.intel.com>
> > > > > > > > > > Cc: Matt Roper <matthew.d.roper at intel.com>
> > > > > > > > > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > > > > > > > > Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> > > > > > > > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > > > > > > > ---
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > RFC
> > > > > > > > > > ===
> > > > > > > > > > 
> > > > > > > > > > Okay, so, this is more an RFC to brainstorm the future of the force_wake in
> > > > > > > > > > Xe than anything else.
> > > > > > > > > > 
> > > > > > > > > > On i915 the force_wake is built-in the mmio functions at uncore component.
> > > > > > > > > > 
> > > > > > > > > > With that approach we had few historical issues iirc:
> > > > > > > > > > 
> > > > > > > > > > 1. Display performance with vblank evasion that requested uncore to provide
> > > > > > > > > > the 'fw' variantes that are actually the ones that avoid fw (contrary to what
> > > > > > > > > > the name suggests).
> > > > > > > > > 
> > > > > > > > > In i915 there were more differences between fw and non-fw variants
> > > > > > > > > of register functions than just forcewake handling:
> > > > > > > > > 
> > > > > > > > >  * _fw() functions assumed that the caller was already holding
> > > > > > > > >    forcewake, whereas the non-fw functions would obtain it themselves
> > > > > > > > >  * _fw() functions also assumed that the caller was already holding
> > > > > > > > >    uncore->lock, whereas the non-fw functions would obtain and release
> > > > > > > > >    the lock around each register access
> > > > > > > > >  * _fw() functions do no tracing and no debug assertions
> > > > > > > > >  * _fw() functions do not check for unclaimed MMIO
> > > > > > > > > 
> > > > > > > > > I don't think the first bullet there (forcewake) really mattered much
> > > > > > > > > from a performance perspective.  For display registers, a quick binary
> > > > > > > > > search of the FW table (just a handful of CPU cycles) would quickly
> > > > > > > > > determine that no forcewake domains were needed for those register
> > > > > > > > > offsets, so we wouldn't be doing anything with forcewake at the hardware
> > > > > > > > > level at all.  For display registers, the more performance-relevant
> > > > > > > > > aspects of using the _fw() functions was doing all your register
> > > > > > > > > accesses together without contention with other MMIO work.  I.e.,
> > > > > > > > > holding the uncore lock over an entire set of registers rather than
> > > > > > > > > grabbing/releasing it for each one, and (for vblank evasion
> > > > > > > > > specifically) doing it all while interrupts were disabled.  For debug
> > > > > > > > > drivers, there was also a bunch of other stuff that the _fw() functions
> > > > > > > > > bypassed (e.g., tripling of the number of register accesses due to
> > > > > > > > > reading FPGA_DBG before/after each register to look for unclaimed
> > > > > > > > > accesses).
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 2. Missed ranges updates. Sometimes we messed up with the ranges, there were
> > > > > > > > > > other times that the spec was updated and we didn't get the notification, and
> > > > > > > > > > there were cases that the BSpec had bugs.
> > > > > > > > > > 
> > > > > > > > > > For these reasons in Xe we have decided to let to the caller the
> > > > > > > > > > responsibility to set the force_wake bits for their domains before doing
> > > > > > > > > > the MMIO.
> > > > > > > > > 
> > > > > > > > > I don't think #2 was a reason to skip forcewake in Xe.  Not noticing a
> > > > > > > > > bspec update (or the bspec itself having incorrect information) would
> > > > > > > > > lead to even more mistakes if the caller has to explicitly grab the
> > > > > > > > > appropriate domains than if the driver does it implicitly.  The explicit
> > > > > > > > > handling only helps in the subset of cases where we blindly grab all of
> > > > > > > > > the domains (as we do during part of init).
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > MMIO access that requires forcewakes really only should be in a few
> > > > > > > > places, off the top of my head:
> > > > > > > > 
> > > > > > > > 1. Init
> > > > > > > > 2. Reset
> > > > > > > > 3. Sysfs / Debugfs
> > > > > > > > 
> > > > > > > > In all of these cases I think it fine to blindly grab all forcewakes.
> > > > > > > > 
> > > > > > > > Please chime in if I'm missing something.
> > > > > > > 
> > > > > > > I think as we implement more features in the Xe driver we're going to
> > > > > > > wind up with a lot more places where we need to access GT registers at
> > > > > > > runtime.  A few examples off the top of my head:
> > > > > > > 
> > > > > > >  * EU stall sampling
> > > > > > >  * Perf/OA
> > > > > > >  * EU debugger
> > > > > > >  * Various OOB workarounds that ask us to twiddle a register at certain
> > > > > > >    times in the driver.
> > > > > > 
> > > > > > Hmm, ok a lot of these I could make the argument that these are not
> > > > > > normal operation so just grab everything and call it a day. If some of
> > > > > > these fall into the category of 'normal enough to be optimized' I can
> > > > > > also make the argument it is no more complex (perhaps less complex) to
> > > > > > explicitly grab the forcewake than having the logic auto-grab the
> > > > > > forcewake.
> > > > > > 
> > > > > > What a compromise of in a debug mode we auto-generate the asserts based
> > > > > > on a table but we still explicitly grab the forcewake?
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > However I'm seeing many questions and doubts popping up:
> > > > > > > > > > 
> > > > > > > > > > 1. Are we confident that we are not missing any wake-up?
> > > > > > > > > > 2. Are we confident that the domains are set correctly?
> > > > > > > > > > 3. Are we not wasting power if we are waking up ALL the domains instead
> > > > > > > > > >    of some specific one?
> > > > > > > > > 
> > > > > > > > > I think we're only holding ALL domains during init; for most of the
> > > > > > > > > runtime MMIO accesses, I think the code is currently attempting to only
> > > > > > > > > grab the domain(s) that it thinks the registers it's accessing will
> > > > > > > > > need.  Although that might be working right now, I'm a little bit
> > > > > > > > > worried about how that will scale in the long term when a single
> > > > > > > > > register might be in several different domains depending on what
> > > > > > > > > platform you're running on.  There have definitely been cases in the
> > > > > > > > > past where groups of registers migrated from RENDER to GT or vice versa
> > > > > > > > > between platforms, so the exact domain you needed for an operation
> > > > > > > > > varied by platform.  And things are even more complicated if you're
> > > > > > > > > doing any MMIO against the media, since the hardware seems to change
> > > > > > > > > exactly how it splits up the media power wells somewhat often.
> > > > > > > > > 
> > > > > > > > > > 4. What about the display disconnection now because i915 and xe have different
> > > > > > > > > >    mmio approaches but reusing i915-display?
> > > > > > > > > > 
> > > > > > > > > > It looks to me that the cons of the current approach are superseding the
> > > > > > > > > > cons of the i915 approach. But I want to hear more thoughts here before
> > > > > > > > > > we decide which route to take.
> > > > > > > > > > 
> > > > > > > > > > Maybe we have that domain as part of the mmio calls themselves? Maybe
> > > > > > > > > > a double approach where the caller is responsible but mmio has the range
> > > > > > > > > > information and double check it?
> > > > > > > > > 
> > > > > > > > > As noted above, part of the challenge with forcewake is that even if the
> > > > > > > > > caller knows it wants to access register FOO, and even if it know FOO is
> > > > > > > > > a GT register that likely needs forcewake, it's sometimes challenging to
> > > > > > > > > make sure it's grabbing the correct domain(s) for every single platform
> > > > > > > > > the Xe driver will eventually support if the power management handling
> > > > > > > > > changes.  I think that was part of the motivation for encoding the
> > > > > > > > > tables into the driver in i915.  It seems like GT power wells don't
> > > > > > > > > change as much these days as they used to, but it's hard to say whether
> > > > > > > > > that will continue in the future or not.  Who knows...maybe they'll
> > > > > > > > > eventually start creating dedicated domains for stuff like blitters,
> > > > > > > > > GuC, etc.  rather than lumping all of those into the "GT" catchall
> > > > > > > > > domain.
> > > > > > > > > 
> > > > > > > > > If we do decide to go back to implicit forcewake handling with the table
> > > > > > > > > encoded into the driver, it might be worth doing something sort of like
> > > > > > > > 
> > > > > > > > Let's not do this intel_uncore.c is an unreadable mess, I don't want
> > > > > > > > anything like this in Xe.
> > > > > > > 
> > > > > > > When you say unreadable, are you referring to the macros that generate
> > > > > > > the table lookup functions?  I don't think that macro magic would be
> > > > > > > needed for Xe since we don't have to support old platforms that have
> > > > > > > very different forcewake behavior, and we also don't need to generate
> > > > > > > separate 8-bit, 16-bit, and 32-bit versions of each operation anymore
> > > > > > > (since I think everything in the GT is 32-bits these days).
> > > > > > > 
> > > > > > 
> > > > > > Just the entire file is unreadable in general, trying to avoid these
> > > > > > types of files in Xe, avoid traps of the i915 did this so let's do it in
> > > > > > Xe, and over engineering our driver to avoid hypothetical future bugs.
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > The forcewake and shadow tables themselves are pretty clean in i915
> > > > > > > these days, and if we move to autogenerating them from a text file, they
> > > > > > > can become even simpler since the text file will basically be a cleaned
> > > > > > > up copy/paste of part of the bspec table.
> > > > > > > 
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > 
> > > > > > > > > what Lucas is doing in the OOB workaround series --- drop the
> > > > > > > > > per-platform tables into a human-readable text file that's more similar
> > > > > > > > > to the format used by the bspec (exact ranges, forcewake domain, MCR
> > > > > > > > > replication type, etc.) and then provide a small parser program that
> > > > > > > > > will convert that into actual code (and do things like consolidating
> > > > > > > > > adjacent ranges).
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Any other idea? Thoughts?
> > > > > > > > > 
> > > > > > > > > Once the big GT vs tile refactor stuff I have in flight gets finalized,
> > > > > > > > > I plan to follow up with another series that creates a more appropriate
> > > > > > > > > MMIO target for register operations rather than using "xe_gt" as the
> > > > > > > > > target (even for things completely unrelated to the small GT subset of
> > > > > > > > > hardware).  My idea is that you'd grab an MMIO target structure for MMIO
> > > > > > > > > operations against a specific hardware unit, and then the info inside
> > > > > > > > > the MMIO structure would be able to figure out if there are additional
> > > > > > > > > checks and/or operations it should perform.  E.g.,
> > > > > > > > > 
> > > > > > > > >  * mmio = xe_mmio_for_device(xe_device *xe);
> > > > > > > > >     - Used to submit MMIO operations against the root tile of the PCI
> > > > > > > > >       device.  Only used during init (e.g., to read the registers that
> > > > > > > > >       tell us which tiles exist on the platform) and during top-level
> > > > > > > > >       interrupt enable/disable (since all interrupts are routed through
> > > > > > > > >       the root tile).
> > > > > > > > >     - No forcewake needed for register accesses through a handle of this
> > > > > > > > >       type since you'd only ever be accessing sgunit registers for these
> > > > > > > > >       types of things.
> > > > > > > > >     - Register accesses through mmio can warn on debug builds if the
> > > > > > > > >       register appears to be in a GT-related MMIO range.
> > > > > > > > > 
> > > > > > > > >  * mmio = xe_mmio_for_display(xe_device *xe);
> > > > > > > > >     - Pretty much the same as handles returned by xe_mmio_for_device(),
> > > > > > > > >       but if a handle of this type is used to try to read/write
> > > > > > > > >       registers outside the display range, we could have debug builds
> > > > > > > > >       throw some extra warnings.
> > > > > > > > >     - Unclaimed register detection could be confined to accesses through
> > > > > > > > >       these handles.
> > > > > > > > > 
> > > > > > > > >  * mmio = xe_mmio_for_tile(xe_tile *tile);
> > > > > > > > >     - Used to access non-GT registers that reside in a specific tile.
> > > > > > > > >       I.e., sgunit/soc registers.
> > > > > > > > >     - As above, no forcewake needed, can make MMIO operations warn
> > > > > > > > >       if used to access a GT range.
> > > > > > > > > 
> > > > > > > > >  * mmio = xe_mmio_for_gt(xe_gt *gt);
> > > > > > > > >     - Used to access GT registers in a specific GT
> > > > > > > > >     - Does automatic GSI offset translation for media GTs
> > > > > > > > >     - Can either do automatic forcewake like i915 does, or can do debug
> > > > > > > > >       check+warn like you have here.
> > > > > > > > >     - Can make MMIO operations warn if MMIO offset is outside GT range
> > > > > > > > >     - Can also trigger warnings if a GT non-GSI, non-media engine
> > > > > > > > >       register is accessed from an MMIO obtained from a media GT.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Ack on this if we keep in managable, worried code / feature bloat those
> > > > > > > > that will result in a similar file to intel_uncore.c in Xe.
> > > > > 
> > > > > Just to revive this thread.
> > > > > 
> > > > > I've been discussing this proposal in the context of a change we need
> > > > > to make in the display, which is to introduce a wakelock for some
> > > > > registers access in order to be able to remove the DMC trap.
> > > > > 
> > > > > We came to the conclusion that this new implementation is very much the
> > > > > same as the forcewake proposal here.  From the display point-of-view,
> > > > > we could simply have a new domain and range of registers to protect.
> > > > > 
> > > > > We _could_ implement a separate "wakelock" mechanism for the display
> > > > > part, but that would be mostly duplicating the entire forcewake
> > > > > implementation.
> > > > > 
> > > > > So, are there any plans to implement the current proposal? Or any other
> > > > > plans related to the forcewake implementation for Xe? As I see it, the
> > > > > "wakelock" implementation in the display depends on this.
> > > > > 
> > > > > Any thoughts?
> > > > 
> > > > Hi Luca, thanks for raising this again here.
> > > 
> > > Hi Lucas! Thanks for your comments.
> > > 
> > > 
> > > > So, as I said in the past, I don't have strong opinions regarding the
> > > > overall forcewake approach.
> > > > 
> > > > The since GT MMIO is not used very frequently, I don't see a problem of
> > > > leaving that up to the caller to take the right domain when needed, or even
> > > > the FORCEWAKE_ALL domain. Instead of forcing all the callers to
> > > > go through this extra steps and then have to opt-out with the '_fw' [sic]
> > > > alternatives for the cases where the forcewake cannot be checked underneath.
> > > > 
> > > > So, for the new display wakelocks, that's the problem of adding them to
> > > > the drivers/gpu/drm/xe/compat-i915-headers/intel_uncore.h where the converions
> > > > from i915's intel uncore are happening to xe_mmio? So, when using the
> > > > intel_uncore's '_fw' variant you skip the wakelock and when doing the regular
> > > > mmio calls you just add a wakelock_get/put around the xe_mmio call with
> > > > the display domain. And in i915 you implement inside the intel_uncore.
> > > > 
> > > > What's the downside of this approach?
> > > 
> > > That's more or less what I was thinking.  I don't think there's any
> > > problem on the display side in calling the same framework.
> > > 
> > > 
> > > > Trying to understand all the pros and cons of different approaches, and
> > > > bringing some people to the discussion.
> > > > 
> > > > If we implement the forwareke underneath the mmio call, display needs
> > > > to anyway implement it twice on the i915's intel_uncore and on the
> > > > xe_forcewake, no?!
> > > 
> > > Yes, IMHO we should start making Xe the base implementation and
> > > backporting the new implementation into i915.
> > > 
> > > So,for this specific case, I think we can just make i915 call the new
> > > functions in xe and have them backported in intel_uncore.h for i915.
> > > 
> > > What I was thinking was basically this:
> > > 
> > > 1. Implement the xe_mmio_for_<hw_unit>() proposal in Xe​
> > 
> > hmm probably a new component then?! xe_display_wakelock that
> > implements and export the get/put domain in a similar way
> > of xe_forcewake, and then on the wrapper you add the get/put
> > around the existing xe_mmio calls?!
> 
> Yes, I guess a new component, but part of Xe.  I don't exactly know
> what you mean with "xe_forcewake", but it's the same as Matt proposed
> with the different functions for different domains.  These functions
> would do everything that is needed for the specific domain, such as
> keeping it awake, checking if the registers are in the right range etc.
> 
> 
> > > 2. Display code uses new APIs (xe_mmio_for_display ops)​
> > > 
> > > 3. Update uncore to handle "wakelock domain"​
> > 
> > then on uncore you would need to parse the mmio range and then
> > call the get and puts, that in that case would be static inside
> > the uncore itself like the forcewakes ?!
> 
> The callers will use the new Xe APIs, so in i915, the display code (and
> probably GT and others too) will be the same as in Xe. 

I confess a got a bit lost now. What's GT need here? Does GT need this
new wakelock or is it a display only thing?

if it is a display only thing, we have a new xe_display_wakelock component.
The mmio wrappers that translate intel_de mmio calls to xe_mmio would make
usage of the xe_display_wakelock functions. And no body else. no need to
disrupt the working GT code.

But maybe I'm completely misunderstanding what the wakelock is about.

>  I see it as a
> backport.  We can create wrappers that call the uncore APIs to achieve
> the same functionality as before.
> 
> There are things that we may not need to be backported, for instance,
> we don't really need the new wakelock that is needed in LNL (and future
> HW), because i915 may not need to support it.  So we can just make a
> dummy wrapper for this.  And it should still be easy to properly
> backport if needed.
> 
> Do you see any problem with this approach?
> 
> --
> Cheers,
> Luca.


More information about the Intel-xe mailing list