[Intel-xe] [PATCH 11/17] drm/xe: Introduce xe_reg_t

Matt Roper matthew.d.roper at intel.com
Wed Apr 19 23:14:45 UTC 2023


On Wed, Apr 19, 2023 at 02:09:20PM -0700, Lucas De Marchi wrote:
> On Wed, Apr 19, 2023 at 01:24:31PM -0700, Matt Roper wrote:
> > On Wed, Apr 19, 2023 at 01:19:49PM -0700, Lucas De Marchi wrote:
> > > On Wed, Apr 19, 2023 at 03:30:17PM -0400, Rodrigo Vivi wrote:
> > > > On Wed, Apr 19, 2023 at 11:49:56AM -0700, Lucas De Marchi wrote:
> > > > > On Wed, Apr 19, 2023 at 10:33:45AM -0700, Matt Roper wrote:
> > > > > > On Wed, Apr 19, 2023 at 12:44:34AM -0700, Lucas De Marchi wrote:
> > > > > > > Stop using i915 types for register our own xe_reg_t. Differently from
> > > > > > > i915, this will keep under this will keep under the register definition
> > > > > > > the knowledge for the different types of registers. For now, the "flags"
> > > > > > > are mcr and masked, although only the former is being used.
> > > > > > >
> > > > > > > Most of the driver is agnostic to the register differences. Convert the
> > > > > > > few places that care about that, namely xe_gt_mcr.c, to take the generic
> > > > > > > type and warn if the wrong register is used.
> > > > > >
> > > > > > The disadvantage of this approach is that we don't get the nice
> > > > > > type-checking that we have in i915 to catch register misuse at build
> > > > > > time.  Instead we wind up with a bunch of run-time checks that only tell
> > > > > > you that you used the wrong register semantics after the fact.  Wouldn't
> > > > > > it be better to keep the strict types and let the compiler find mistakes
> > > > > > for us at build time?
> > > > > >
> > > > > > The only real problem with using strict types is that there are a few
> > > > > > places where the driver is just trying to refer to registers/offsets
> > > > > > without actually accessing them (e.g., whitelists, OA register groups,
> > > > >
> > > > > that is not true.
> > > > >
> > > > > > etc.) and the strict typing makes it harder to provide a heterogeneous
> > > > > > list of those registers.  But I still feel the benefits of compile-time
> > > > > > checking outweighs the extra complexity we wind up with in a handful of
> > > > > > places.
> > > > >
> > > > > I disagree here because the strict type checking is only checking for
> > > > > "regular vs mcr". Add "masked" in the mix (that has been a huge source
> > > > > of mistakes in i915) and you already have 4 types(?) to deal with.
> > > > >
> > > > > There are also places in the code where we want that "this is an
> > > > > mcr/regular/masked register" info to be used later. See how the reg-sr is
> > > > > handling that.
> > > > >
> > > > > In the end I believe having this info encoded in the regs/*.h is
> > > > > sufficient to avoid the mistakes we had in the past.
> > > > >
> > > > > Any change in the code triggering these warnings has a very easy an
> > > > > actionable fix.
> > > > >
> > > > > If for some reason we want to only differentiate mcr/normal as strict
> > > > > type checks,  then I think we need to provide a better struct for the
> > > > > places that don't care about that. Something like below maybe
> > > > >
> > > > > 	typedef {
> > > > > 		xe_reg_t reg;
> > > > > 	} xe_reg_mcr_t;
> > > > >
> > > > > but reg.mcr still being there. Or going the other way around:
> > > > >
> > > > > 	typedef union {
> > > > > 		xe_reg_normal_t normal;
> > > > > 		xe_reg_mcr_t	mcr;
> > > > > 	} xe_reg_t;
> > > > >
> > > > > but xe_reg_normal_t and xe_reg_mcr_t actually having the same defintion.
> > > > >
> > > > > Lucas De Marchi
> > > >
> > > > I honestly liked the approach you took on this patch here. Getting rid of
> > > > the i915_mcr is already a win by itself.
> > > >
> > > > One case that came to my mind though is some time sensitive mmio operations
> > > > like display inside vblank areas. any runtime check could potentially
> > > > increase latency, no?! If that's irrelevant time, than let's move with this
> > > > but if it is really a problem, then explore your latest suggestion here
> > > > with the compiler doing the checks for us seems a good alternative.
> > > 
> > > the additional check is inside xe_gt_mcr.c... I think the additional
> > > check is negligible compared to the other operations there, with each
> > > access to reg X requiring other 2 mmio accesses to take place.
> > 
> > The super critical checks (which don't exist yet in this series; they'd
> > presumably be added in a follow-up series) are the ones to make sure you
> > don't try to call xe_mmio_{read,write}32() on an MCR register.  Doing
> > that can lead to all kinds of subtle bugs (which become even more
> > problematic on the more recent platforms), and unfortunately that kind
> > of mistake is what people are going to write in the code 99% of the time
> > out of habit.
> 
> I'm not sure when that would happen... when the next GPU changes a register
> from normal to MCR, we rightfully updated the header to update its type
> and forgot to change the calls using that register?

It happened really frequently on i915 until we finally added the strict
typing.  It's not a matter of adding new platforms, it's a matter of
just writing new code in the driver (generally for platforms we already
have and with register definitions that already exist).  When someone
needs to add a new register read/write operation for something they're
working on, they'll generally reach for xe_mmio_*() out of habit (or
intel_uncore_*() on i915) since that's what gets used for 90% of
register accesses.  If the register is actually an MCR register, and the
compiler doesn't catch this for you, then you just wind up with
incorrect semantics at runtime.

Of course nothing is foolproof.  If the read/write is against a
completely new register, and the patch is adding both an incorrect
register definition _and_ an incorrect usage at the same time, then the
compiler won't help there.  Although the fact that a completely new
register is being defined makes it slightly more likely that the
reviewers will scrutinize the register details more carefully and catch
the incorrect definition.  It's a shame that the MCR nature of registers
isn't documented more clearly in the bspec...

Cases where the same register changed type between platforms is also a
problematic case that will never be foolproof since we wind up with
multiple definitions for the same register offset; if you pick the wrong
definition _and_ the wrong access semantics, then the compiler won't be
able to catch that either.  But again, the multiple definitions with
different types will still make it more likely that the developer and/or
reviewer take a closer look and realize the mistake.

> 
> Currently xe_mmio_* receives u32 and we just do FOO.reg, with reg being the

Right, this is obviously completely wrong and was something I raised as
a concern months ago when the first Xe megapatch got posted.  I started
typing up a fix for that a few weeks ago but got sidetracked and didn't
finish it.  xe_mmio_* should obviously be accepting i915_reg_t (or a new
Xe-specific equivalent) rather than a u32.  That will protect against
both the 'misordered parameters' problem Jani mentioned, as well as the
'used wrong function on an MCR register' problem that was problematic on
i915.

> name both in i915_reg_t and i915_mcr_reg_t. So, right now the different
> types are not helping anything.  I'm planning to add xe_reg_t to
> xe_mmio_* and have a companion xe_mmio_raw_* for the cases we use a u32
> (one of the reasons being because xe_gt_mcr.c itself also needs to call
> xe_mmio_*)
> 
> How does the type protect us against "daily task of enabling the next
> generation of GPUs?":
> 
> 	- gpu version N+1 changed register FOO from normal to MCR.
> 	  we didn't notice the register changed. Type doesn't help
> 	  because we are still using the old one.  If we did notice,
> 	  the type would only help if adopted the approach "change the
> 	  type", build and see what breaks to fix it, which is IMO
> 	  worse than a git-grep approach.
> 
> That would mean the hw changed a register from being MCR to being a
> normal register which is more rare than the oppposite: a normal register
> becoming MCR.

The strict type safety protects us in both directions.  xe_mmio_* only
takes unicast registers (after the obvious fixes to replace the u32
parameters), xe_gt_mcr_* only takes MCR registers.  If you pass the
wrong register type to either type of function, the compiler catches it
immediately.  As you noted, it's still possible to combine two bugs
(incorrect register definition + wrong usage semantics) to sneak by the
compiler; that will surely still happen, but we've raised the bar and
made it harder.

Having registers change between MCR and non-MCR between platforms is
still a problem, but not the most common one based on past history.  The
much more common source of problems is people just writing new code that
uses registers we already have in the driver.  Assuming the definition
is correct, the new code has to be correct too to get past the compiler.
If all registers become the same type, as you're moving toward in this
series, you could also probably get similar guardrails by turning
xe_mmio_*() into macros that do a BUILD_BUG_ON() before calling the a
real __xe_mmio_*() function, although then you run into issues anywhere
that non-constant register variables get passed up/down the callstack.

> 
> Even if we do the strict type, it doesn't protect us from such mistakes
> when we have a normal register becoming MCR. We will have this:
> 
> 	#define FOO	XE_REG(10)
> 
> and will continue to happily call xe_mmio_* even if in version X of the
> GPU the register is now mcr.

When enabling a new platform and implementing the MCR stuff for the
first time, we do need to be careful to not only add the new ranges to
the driver, but also to go analyze any registers that live in a range
that changed.  If there's code that will still be executed that works on
those registers, then we need to figure out how that code needs to
change.  We can't automatically assume that writes should become
broadcast_writes and that reads should become read_any's (although
that's the most common outcome); in some situations we actually need to
re-write the code to loop through the various hardware instances and
handle them all independently.

> 
> Hence my point: the header with the register definition must be the
> source of truth on the different register behaviors*.  The additional
> type checks only help in some cases, and make it harder to handle
> others.

With xe it seems like we have a lot fewer areas where multiple types
become harder.  IIRC, the biggest pain points on i915 were GVT and
cmd_parser since both of those needed to provide giant lists of
registers with MCR and non-MCR mixed together.  Both of those are no
longer relevant to Xe, so I think the main areas that will still have a
bit of pain are:
 * OA/perf register lists
 * userspace whitelist (although I'm not sure if we ever whitelist MCR
   registers...userspace would have difficult steering to specific
   instances)

> 
> On the other hand, I think the different types would be handy when/if we
> have the following:
> 
> 	#define xe_mmio_read32(reg) _Generic((reg),			\
> 		xe_reg_t: xe_mmio_raw_read32((reg).reg),		\
> 		xe_reg_mcr_t: xe_gt_mcr_read32(reg),

Unfortunately there's a trap here.  There are multiple ways to read (and
write) a multicast register and it's very important that you explicitly
pick the proper semantics to use.  Sometimes you want just want to read
any random instance's value (e.g., when verifying that workarounds
stuck), but in other cases you need to read and process every single
instance independently (various error reporting, INSTDONE, etc.
registers).  Likewise, sometimes it's okay to broadcast a single value
into every instance of the register, but other times we potentially need
to write different values into each instance, or even do per-instance
RMW's to avoid clobbering other bitfields.


Matt

> 	)
> 
> because that would eliminate an entire class of bugs, encoding in each
> call place what is the right function to be called based on the
> declaration in the header, with no additional overhead...
> 
> Lucas De Marchi
> 
> * https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/204 and ideally
>   be generated from the spec with proper filters in place so we don't
>   blow out the size of our headers.  My goal is to clean our headers and
>   have them in a state that would be 90% identical to something
>   generated by a script parsing the spec.
> 
> > 
> > 
> > Matt
> > 
> > > 
> > > Besides, there is no such user in display/
> > > 
> > > 	$ git grep xe_gt_mcr -- drivers/gpu/drm/xe/display/
> > > 
> > > 
> > > Lucas De Marchi
> > 
> > -- 
> > Matt Roper
> > Graphics Software Engineer
> > Linux GPU Platform Enablement
> > Intel Corporation

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation


More information about the Intel-xe mailing list