[Mesa-dev] i965 implementation of the ARB_shader_image_load_store built-ins. (v3)

Fri May 15 09:51:37 PDT 2015

Jason Ekstrand <jason at jlekstrand.net> writes:

> I haven't said much about this series up until now.  I've mostly sat
> and watched and focused my time on other things.  As I said in the
> meeting today, I think that part of the problem here is that there are
> at least 3 "refactors" in here besides the new feature.  By
> "refactors", I mean "new ways of solving problem X".  Part of the
> problem with the discussion is that all of this has been conflated and
> we aren't actually talking about how to solve each of those problems.
> I'm going to try and break it out here to try and aid in the
> discussion.
>
Thanks a lot Jason for giving a more constructive turn to this
discussion. :)

> 1) The builder.  I think *everyone* likes the builder.  The argument
> here is not over whether or not we want it.  People have expressed
> concern that adding the builder now without actually doing the
> refactor will result in duplicated code that will get out-of-sync.  I
> think the best solution here is to go ahead and do the builder stuff
> immediately after the branch point.
>
Agreed.

> 2) SIMD8 instruction splitting.  As I said in the call, we've done
> this a variety of ways in the past usually by emitting the split
> instructions in the visitor and manually using half() and
> inst->force_sechalf.  We also have a few places that I think still
> split the code in the generator.  What you're doing is a somewhat more
> elegant version of the "emit it in the visitor" stuff that we've done
> in the past.  You've got your zip/unzip functions etc. to handle
> splitting and combining and then you have a loop to actually emit the
> sends.  Really, it's fairly nice and concise.  The question is whether
> or not it's really what we want to do (more on this in a bit).
>
There is another closely related problem that this infrastructure solves
and may not be obvious in this series (it was in my original branch
though [1]), but I think it's worth keeping in mind, I'll call it 2.1:
Some of the surface messages lacked SIMD4x2 variants, so a strided copy
is required to transpose the original AoS vector into SoA form.  This
turns out to be algorithmically identical to the strided copy required
to "unzip" a SIMD16 vector, so it can be accomplished without additional
effort.

These transformations are largely transparent for the functions building
typed/untyped surface messages, they are hidden behind an interface that
doesn't seem to have received any attention yet:

 - struct vector_layout: This represents the layout of a vector in a
   payload.  It's initialized based on the restrictions of the hardware
   for the specific message we want to send, and the current code
   generation parameters (i.e. SIMD mode).

 - emit_insert(): This takes a vector with the layout native to the EU
   (e.g. an fs_reg) and performs the required transformations to turn it
   into a form suitable for the shared unit according to a vector_layout
   struct (what may involve unzipping, a strided copy or no work at
   all).

 - emit_extract(): The converse of emit_insert().  This takes a payload
   fragment (array_reg) and transforms it into a vector of native
   layout, according to a vector_layout struct.

My intention was to come up with an interface more general than SIMD16
vector splitting and SIMD4x2 to SIMD8 conversion alone, I was also
thinking texturing instruction payload quirks (which might require
padding or different permutations of the same values depending on the
generation even if the SIMD mode is natively supported -- let's call
this 2.2) and framebuffer writes (2.3).

> 3) Array reg.  This is something of a helper to (2).

In fact the primary motivation for array_reg wasn't instruction
splitting, but emit_collect(), more on that later.

> I agree that it's nice to be able to have access to the size of a
> register without having to look at the allocator.  The problem here
> (and why we don't have a size in fs_reg) is that we copy the reigster
> around.  Every register does have some sort of "underlying storage"
> that it shares with other registers with the same number.  But that
> storage is represented by the size in the allocator and nothing else.
> Is that bad?  Maybe.  Should we have a way of passing the size around
> with it?  Maybe.  In any case, I think it's best to table this
> discussion until we've talked a bit about (2) because I think the
> resolution there will make (3) more clear.
>
I think the mechanism we use to determine the size of a register hardly
ever involves the allocator, and there are good reasons not to rely on
it: It only works for VGRFs, can easily break with register coalescing,
and it makes "slicing" and other zero-copy transformations of registers
difficult (e.g. taking an individual element out of a VGRF array without
copying it into a one-component register).  Typically the size of a
register is implicitly assumed by the instruction using it, and where
it's not (because the instruction expects a variable length payload) it
is provided separately from the register via "mlen".  Neither of these
two options works here as will be obvious shortly.

> One note on 3 before we table it.  People have been pulling
> emit_collect out as a straw-man and beating on it but I really do like
> it in principle.  There are piles of times where you want a payload
> and you have to allocate an array, do stuff, and then put it into the
> payload and it's really annoying.  It would be much easier if we had a
> helper that just did it all in one function and that's what
> emit_collect tries to do.  *Thank you* for finally trying to solve
> that problem and make payloads less painful!  However, I do wish it
> were implemented a bit differently but I'll make those comments on the
> patch itself.
>

Yeah.  I think it was Matt who mentioned in the call that emit_collect()
is really complex.  I completely agree with that statement.  It needs to
take care of a bunch of things to build the payload correctly:

 - Allocate a register that will hold the result, calculating the
   correct total size based on the amount of data we want to
   concatenate.

 - Calculate the correct header size.

 - Allocate and release memory of the correct size to hold the array of
   LOAD_PAYLOAD sources, also based on the amount of data we want to
   concatenate.

 - Split up the arguments to be passed to LOAD_PAYLOAD as sources in
   scalars of the correct width (which depends on whether the argument
   is part of the header or not, and what the dispatch width is), and
   initialize the array.

 - Create a LOAD_PAYLOAD instruction with the result of the previous
   steps.

Each one of these tasks is non-trivial, repetitive, can conceivably go
wrong, and until now duplicated wherever we had to build a payload --
obscuring the actually meaningful work being done.  This is why I think
that the assertion that emit_collect() makes code difficult to
understand is backwards.  Thanks to emit_collect() the task of building
a payload is tremendously simplified.

For emit_collect() to be possible there has to be some way to represent
a register+size pair, because it needs to know the size of each
argument, and because its result is itself a register+size pair -- The
caller cannot do anything useful with a payload of unknown size, unless
it re-calculates the size of the result what would defeat much of the
purpose of emit_collect().

The biggest benefit from being able to represent a register+size pair as
a value is that emit_collect() can easily be composed with other
functions (including with itself), so the construction of specific
chunks of the payload can easily be decoupled and re-used.  My code
takes advantage of this feature extensively in order to share code that
constructs surface message headers or transforms some value into the
form expected by the shared unit (doing vector splitting or a strided
copy).

> ### SIMD16 Instruction Splitting ###
>
> SIMD16 instruction splitting is an unfortunate fact of our hardware.
> There are a variety of times we have to do it including dual-source FB
> writes, some texturing, math ops on older gens and maybe another place
> or two.  Classically, this has been done in one of two places: The
> visitor as we emit the code, or the generator.  The problem with doing
> it in the generator is that we can't schedule it and, if it involves a
> payload, it's not really possible.  The result is that we usually do
> it in the visitor these days.
>
> Unfortunately, even in the visitor, it's gen-specific and annoying.
> It gets even worse when you're working with something such as the
> untyped surface read/write messages that work with multi-component
> values that have to be zipped/unzipped to use Curro's terminology.
> Curro came up with some helper functions to make it substantially less
> annoying but it still involves nasty looping.
>
> At some point in the past I proposed a completely different and more
> solution to this problem.  Unfortunately, while I've talked to Matt &
> Ken about it, it's never really been discussed all that publicly so
> Curro may not be aware of it.  I'm going to lay it out here for
> Curro's sake as well as the sake of public record.
>
> The solution involves first changing the way we handle sends into a
> two step process.  First, we emit a logical instruction that contains
> all of the data needed for the actual instruction.  Then, we convert
> from the logical to the actual in a lowering pass.  Take, for example,
> FB writes with which I am fairly familiar.  We would first emit a
> logical FS_FB_WRITE_# instruction that has separate sources for color,
> depth, replicated alpha, etc.   Then, in the lower_fb_writes pass
> (which we would have to implement), we would construct the payload
> from the sources provided on the logical instruction and emit the
> actual LOAD_PAYLOAD and FB_WRITE instructions.  This lower_fb_writes
> function would then get called before the optimization loop so that
> the rest of the optimization could would never see it.
>
> Second, we add a split_instruction helper that would take a SIMD16
> instruction and naively split it into two SIMD8 instructions.  Such a
> helper really shouldn't be that hard to write.  It would have to know
> how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that
> shouldn't be bad.  Any of these new logical send instructions would
> have their values as separate sources so they should be safe to split.
>
> Third, we add a lower_simd16_to_simd8 pass that walks the
> instructions, picks out the ones that need splitting, and calls
> split_instruction on them.  All of the gen-specific SIMD8 vs. SIMD16
> knowledge would be contained in this one pass.  This pass would happen
> between actually emitting code and running lower_fb_writes (or
> whatever other send lowering passes we have).
>
> Finally, and this is the icing on the cake, we write a
> lower_simd32_to_simd16 pass that goes through and lowers all SIMD32
> instructions (that act on full-sized data-types) to SIMD16
> instructions.  Once the rest of the work is done, we get this pass,
> and with it SIMD32 mode, almost for free.
>
> I know this approach looks like more work and, to be honest, it may
> be.  However, I think it makes a lot of things far more
> straightforward.  In particular, it means that people working on the
> visitor code don't have to think about whether or not an instruction
> needs splitting.  You also don't have to deal with the complexity of
> zipping/unzipping sources every time.  Instead, we put all that code
> in one place and get to stop thinking about it.  Also, if we *ever*
> want to get SIMD32, we will need some sort of automatic instruction
> splitting and this seems like a reasonable way to do it.
>
> I've talked to Ken about this approach and he's 100% on-board.  I
> don't remember what Matt thinks of it.  If we like the approach, then
> we should just split the tasks up and make it happen.  It's a bit of
> refactoring but it shouldn't be terrible.  If we wanted to demo it, I
> would probably suggest starting with FB writes as those are fairly
> complex but yet self-contained.  They also have a case where we do
> split an instruction so it would be interesting to see what the code
> looks like before and after.
>

I generally like your proposal.  I guess the question we need to answer
is whether we want this complexity to be in a lowering pass or in a
helper function used to build the send-like instruction -- In either
case we need code to handle zipping and unzipping of SIMD16 vectors,
it's just about whether this code is called by a lowering pass or higher
up in the visitor.

I can think of several benefits of the approach you propose over mine:

 - It's more transparent for the visitor code emitting the message -- I
   completely agree with you that the explicit loops are rather ugly.

 - Instructions with explicit separate sources are likely to be more
   suitable for certain optimization passes.  Pull constant loads use a
   similar approach with an expression-style opcode which is at some
   point lowered to a load payload and send message.  This may not be
   terribly important at this point because of the optimizations already
   performed in GLSL IR and NIR and due to the nature of the majority of
   opcodes that don't support SIMD16, but it still seems appealing.

Some disadvantages come to my mind too:

 - It increases the amount of work required to add a new send-like
   instruction because you need lowered and unlowered variants for each.

 - It seems tricky to get right when splitting an instruction in halves
   involves changing the actual contents of the payload beyond zipping
   and unzipping its arguments -- This might not seem like a big deal
   right now, but it will be a problem when we implement SIMD32.  The
   surface messages that take a sample mask as argument are a good
   example, because they only have 16 bits of space for it so you
   actually need to provide different values depending on the "slot
   group" the message is meant for.  This can be worked around easily in
   the visitor by shifting the sample mask register but it seems harder
   to fix up later.

 - My intention was to handle vector layout peculiarities beyond
   SIMDN-to-SIMD8 splitting under a consistent framework, leaving it
   under the control of the helper function that builds the message
   which quirks to apply.  With the lowering pass approach you'd either
   need message-specific lowering passes, message-specific code in a
   general lowering pass, or some way for the builder code to pass
   vector_layout-like metadata to the lowering pass -- In the first two
   cases the policy on how to format each source in the payload would be
   "delocalized" between the visitor and lowering pass, what I don't
   particularly like.

> Thoughts?
>
> Question for Curro: Supposing for the moment that we decide to do
> SIMD16 -> SIMD8 splitting this way, how much of the array_reg stuff
> would still be needed for image_load_store?
>
I think I've already answered this question, but I'll answer it again
here for clarity -- It would still help with building and passing around
payloads concisely using emit_collect(), regardless of how we implement
SIMD16 instruction splitting.

> I hope this e-mail provides us with some good talking points and a way forward.
> --Jason

Thanks Jason!

[1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150515/059c6d6e/attachment.sig>