[Mesa-dev] i965 implementation of the ARB_shader_image_load_store built-ins. (v3)

Wed May 13 12:00:09 PDT 2015

I haven't said much about this series up until now.  I've mostly sat
and watched and focused my time on other things.  As I said in the
meeting today, I think that part of the problem here is that there are
at least 3 "refactors" in here besides the new feature.  By
"refactors", I mean "new ways of solving problem X".  Part of the
problem with the discussion is that all of this has been conflated and
we aren't actually talking about how to solve each of those problems.
I'm going to try and break it out here to try and aid in the
discussion.

1) The builder.  I think *everyone* likes the builder.  The argument
here is not over whether or not we want it.  People have expressed
concern that adding the builder now without actually doing the
refactor will result in duplicated code that will get out-of-sync.  I
think the best solution here is to go ahead and do the builder stuff
immediately after the branch point.

2) SIMD8 instruction splitting.  As I said in the call, we've done
this a variety of ways in the past usually by emitting the split
instructions in the visitor and manually using half() and
inst->force_sechalf.  We also have a few places that I think still
split the code in the generator.  What you're doing is a somewhat more
elegant version of the "emit it in the visitor" stuff that we've done
in the past.  You've got your zip/unzip functions etc. to handle
splitting and combining and then you have a loop to actually emit the
sends.  Really, it's fairly nice and concise.  The question is whether
or not it's really what we want to do (more on this in a bit).

3) Array reg.  This is something of a helper to (2).  I agree that
it's nice to be able to have access to the size of a register without
having to look at the allocator.  The problem here (and why we don't
have a size in fs_reg) is that we copy the reigster around.  Every
register does have some sort of "underlying storage" that it shares
with other registers with the same number.  But that storage is
represented by the size in the allocator and nothing else.  Is that
bad?  Maybe.  Should we have a way of passing the size around with it?
 Maybe.  In any case, I think it's best to table this discussion until
we've talked a bit about (2) because I think the resolution there will
make (3) more clear.

One note on 3 before we table it.  People have been pulling
emit_collect out as a straw-man and beating on it but I really do like
it in principle.  There are piles of times where you want a payload
and you have to allocate an array, do stuff, and then put it into the
payload and it's really annoying.  It would be much easier if we had a
helper that just did it all in one function and that's what
emit_collect tries to do.  *Thank you* for finally trying to solve
that problem and make payloads less painful!  However, I do wish it
were implemented a bit differently but I'll make those comments on the
patch itself.

### SIMD16 Instruction Splitting ###

SIMD16 instruction splitting is an unfortunate fact of our hardware.
There are a variety of times we have to do it including dual-source FB
writes, some texturing, math ops on older gens and maybe another place
or two.  Classically, this has been done in one of two places: The
visitor as we emit the code, or the generator.  The problem with doing
it in the generator is that we can't schedule it and, if it involves a
payload, it's not really possible.  The result is that we usually do
it in the visitor these days.

Unfortunately, even in the visitor, it's gen-specific and annoying.
It gets even worse when you're working with something such as the
untyped surface read/write messages that work with multi-component
values that have to be zipped/unzipped to use Curro's terminology.
Curro came up with some helper functions to make it substantially less
annoying but it still involves nasty looping.

At some point in the past I proposed a completely different and more
solution to this problem.  Unfortunately, while I've talked to Matt &
Ken about it, it's never really been discussed all that publicly so
Curro may not be aware of it.  I'm going to lay it out here for
Curro's sake as well as the sake of public record.

The solution involves first changing the way we handle sends into a
two step process.  First, we emit a logical instruction that contains
all of the data needed for the actual instruction.  Then, we convert
from the logical to the actual in a lowering pass.  Take, for example,
FB writes with which I am fairly familiar.  We would first emit a
logical FS_FB_WRITE_# instruction that has separate sources for color,
depth, replicated alpha, etc.   Then, in the lower_fb_writes pass
(which we would have to implement), we would construct the payload
from the sources provided on the logical instruction and emit the
actual LOAD_PAYLOAD and FB_WRITE instructions.  This lower_fb_writes
function would then get called before the optimization loop so that
the rest of the optimization could would never see it.

Second, we add a split_instruction helper that would take a SIMD16
instruction and naively split it into two SIMD8 instructions.  Such a
helper really shouldn't be that hard to write.  It would have to know
how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that
shouldn't be bad.  Any of these new logical send instructions would
have their values as separate sources so they should be safe to split.

Third, we add a lower_simd16_to_simd8 pass that walks the
instructions, picks out the ones that need splitting, and calls
split_instruction on them.  All of the gen-specific SIMD8 vs. SIMD16
knowledge would be contained in this one pass.  This pass would happen
between actually emitting code and running lower_fb_writes (or
whatever other send lowering passes we have).

Finally, and this is the icing on the cake, we write a
lower_simd32_to_simd16 pass that goes through and lowers all SIMD32
instructions (that act on full-sized data-types) to SIMD16
instructions.  Once the rest of the work is done, we get this pass,
and with it SIMD32 mode, almost for free.

I know this approach looks like more work and, to be honest, it may
be.  However, I think it makes a lot of things far more
straightforward.  In particular, it means that people working on the
visitor code don't have to think about whether or not an instruction
needs splitting.  You also don't have to deal with the complexity of
zipping/unzipping sources every time.  Instead, we put all that code
in one place and get to stop thinking about it.  Also, if we *ever*
want to get SIMD32, we will need some sort of automatic instruction
splitting and this seems like a reasonable way to do it.

I've talked to Ken about this approach and he's 100% on-board.  I
don't remember what Matt thinks of it.  If we like the approach, then
we should just split the tasks up and make it happen.  It's a bit of
refactoring but it shouldn't be terrible.  If we wanted to demo it, I
would probably suggest starting with FB writes as those are fairly
complex but yet self-contained.  They also have a case where we do
split an instruction so it would be interesting to see what the code
looks like before and after.

Thoughts?

Question for Curro: Supposing for the moment that we decide to do
SIMD16 -> SIMD8 splitting this way, how much of the array_reg stuff
would still be needed for image_load_store?

I hope this e-mail provides us with some good talking points and a way forward.
--Jason