[Mesa-dev] i965 implementation of the ARB_shader_image_load_store built-ins. (v3)

Mon May 18 10:34:01 PDT 2015

Francisco Jerez <currojerez at riseup.net> writes:

> Jason Ekstrand <jason at jlekstrand.net> writes:
>
>> On May 15, 2015 2:40 PM, "Francisco Jerez" <currojerez at riseup.net> wrote:
>>>
>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>
>>> > On Fri, May 15, 2015 at 9:51 AM, Francisco Jerez <currojerez at riseup.net>
>> wrote:
>>> >> Jason Ekstrand <jason at jlekstrand.net> writes:
>>> >>
>>> >>> I haven't said much about this series up until now.  I've mostly sat
>>> >>> and watched and focused my time on other things.  As I said in the
>>> >>> meeting today, I think that part of the problem here is that there are
>>> >>> at least 3 "refactors" in here besides the new feature.  By
>>> >>> "refactors", I mean "new ways of solving problem X".  Part of the
>>> >>> problem with the discussion is that all of this has been conflated and
>>> >>> we aren't actually talking about how to solve each of those problems.
>>> >>> I'm going to try and break it out here to try and aid in the
>>> >>> discussion.
>>> >>>
>>> >> Thanks a lot Jason for giving a more constructive turn to this
>>> >> discussion. :)
>>> >>
>>> >>> 1) The builder.  I think *everyone* likes the builder.  The argument
>>> >>> here is not over whether or not we want it.  People have expressed
>>> >>> concern that adding the builder now without actually doing the
>>> >>> refactor will result in duplicated code that will get out-of-sync.  I
>>> >>> think the best solution here is to go ahead and do the builder stuff
>>> >>> immediately after the branch point.
>>> >>>
>>> >> Agreed.
>>> >>
>>> >>> 2) SIMD8 instruction splitting.  As I said in the call, we've done
>>> >>> this a variety of ways in the past usually by emitting the split
>>> >>> instructions in the visitor and manually using half() and
>>> >>> inst->force_sechalf.  We also have a few places that I think still
>>> >>> split the code in the generator.  What you're doing is a somewhat more
>>> >>> elegant version of the "emit it in the visitor" stuff that we've done
>>> >>> in the past.  You've got your zip/unzip functions etc. to handle
>>> >>> splitting and combining and then you have a loop to actually emit the
>>> >>> sends.  Really, it's fairly nice and concise.  The question is whether
>>> >>> or not it's really what we want to do (more on this in a bit).
>>> >>>
>>> >> There is another closely related problem that this infrastructure
>> solves
>>> >> and may not be obvious in this series (it was in my original branch
>>> >> though [1]), but I think it's worth keeping in mind, I'll call it 2.1:
>>> >> Some of the surface messages lacked SIMD4x2 variants, so a strided copy
>>> >> is required to transpose the original AoS vector into SoA form.  This
>>> >> turns out to be algorithmically identical to the strided copy required
>>> >> to "unzip" a SIMD16 vector, so it can be accomplished without
>> additional
>>> >> effort.
>>> >>
>>> >> These transformations are largely transparent for the functions
>> building
>>> >> typed/untyped surface messages, they are hidden behind an interface
>> that
>>> >> doesn't seem to have received any attention yet:
>>> >>
>>> >>  - struct vector_layout: This represents the layout of a vector in a
>>> >>    payload.  It's initialized based on the restrictions of the hardware
>>> >>    for the specific message we want to send, and the current code
>>> >>    generation parameters (i.e. SIMD mode).
>>> >>
>>> >>  - emit_insert(): This takes a vector with the layout native to the EU
>>> >>    (e.g. an fs_reg) and performs the required transformations to turn
>> it
>>> >>    into a form suitable for the shared unit according to a
>> vector_layout
>>> >>    struct (what may involve unzipping, a strided copy or no work at
>>> >>    all).
>>> >>
>>> >>  - emit_extract(): The converse of emit_insert().  This takes a payload
>>> >>    fragment (array_reg) and transforms it into a vector of native
>>> >>    layout, according to a vector_layout struct.
>>> >>
>>> >> My intention was to come up with an interface more general than SIMD16
>>> >> vector splitting and SIMD4x2 to SIMD8 conversion alone, I was also
>>> >> thinking texturing instruction payload quirks (which might require
>>> >> padding or different permutations of the same values depending on the
>>> >> generation even if the SIMD mode is natively supported -- let's call
>>> >> this 2.2) and framebuffer writes (2.3).
>>> >
>>> > Thank you for pointing out what else you were trying to do with it.  I
>>> > understand that you're trying to solve a dozen different problems and
>>> > that the infrastructure you've created is one that you think solves
>>> > all of them at the same time.  It helps to know what all problems you
>>> > were trying to solve.  However, since we are putting SIMD4x2 aside for
>>> > now, in the interest of keeping us from getting side-tracked, let's
>>> > table this for now.  It's a problem that may need solving, but it's a
>>> > *different* problem.
>>> >
>>> If it can be solved with the same approach, it is a closely related
>>> problem, isn't it?  And wouldn't it be a bit silly to knowingly
>>> implement a less general solution when the more general solution
>>> involves *less* work and while we know we'll want to address the
>>> remaining problems solved by the more general solution eventually? ;)
>>
>> Perhaps, but trying to solve all the problems at the same time is why the
>> conversation has gone nowhere on the past.  We need to keep things split
>> out.
>>
>
>
>>> >>> 3) Array reg.  This is something of a helper to (2).
>>> >>
>>> >> In fact the primary motivation for array_reg wasn't instruction
>>> >> splitting, but emit_collect(), more on that later.
>>> >>
>>> >>> I agree that it's nice to be able to have access to the size of a
>>> >>> register without having to look at the allocator.  The problem here
>>> >>> (and why we don't have a size in fs_reg) is that we copy the reigster
>>> >>> around.  Every register does have some sort of "underlying storage"
>>> >>> that it shares with other registers with the same number.  But that
>>> >>> storage is represented by the size in the allocator and nothing else.
>>> >>> Is that bad?  Maybe.  Should we have a way of passing the size around
>>> >>> with it?  Maybe.  In any case, I think it's best to table this
>>> >>> discussion until we've talked a bit about (2) because I think the
>>> >>> resolution there will make (3) more clear.
>>> >>>
>>> >> I think the mechanism we use to determine the size of a register hardly
>>> >> ever involves the allocator, and there are good reasons not to rely on
>>> >> it: It only works for VGRFs, can easily break with register coalescing,
>>> >> and it makes "slicing" and other zero-copy transformations of registers
>>> >> difficult (e.g. taking an individual element out of a VGRF array
>> without
>>> >> copying it into a one-component register).  Typically the size of a
>>> >> register is implicitly assumed by the instruction using it, and where
>>> >> it's not (because the instruction expects a variable length payload) it
>>> >> is provided separately from the register via "mlen".  Neither of these
>>> >> two options works here as will be obvious shortly.
>>> >
>>> > Understood.
>>> >
>>> >>> One note on 3 before we table it.  People have been pulling
>>> >>> emit_collect out as a straw-man and beating on it but I really do like
>>> >>> it in principle.  There are piles of times where you want a payload
>>> >>> and you have to allocate an array, do stuff, and then put it into the
>>> >>> payload and it's really annoying.  It would be much easier if we had a
>>> >>> helper that just did it all in one function and that's what
>>> >>> emit_collect tries to do.  *Thank you* for finally trying to solve
>>> >>> that problem and make payloads less painful!  However, I do wish it
>>> >>> were implemented a bit differently but I'll make those comments on the
>>> >>> patch itself.
>>> >>>
>>> >>
>>> >> Yeah.  I think it was Matt who mentioned in the call that
>> emit_collect()
>>> >> is really complex.  I completely agree with that statement.  It needs
>> to
>>> >> take care of a bunch of things to build the payload correctly:
>>> >>
>>> >>  - Allocate a register that will hold the result, calculating the
>>> >>    correct total size based on the amount of data we want to
>>> >>    concatenate.
>>> >>
>>> >>  - Calculate the correct header size.
>>> >>
>>> >>  - Allocate and release memory of the correct size to hold the array of
>>> >>    LOAD_PAYLOAD sources, also based on the amount of data we want to
>>> >>    concatenate.
>>> >>
>>> >>  - Split up the arguments to be passed to LOAD_PAYLOAD as sources in
>>> >>    scalars of the correct width (which depends on whether the argument
>>> >>    is part of the header or not, and what the dispatch width is), and
>>> >>    initialize the array.
>>> >>
>>> >>  - Create a LOAD_PAYLOAD instruction with the result of the previous
>>> >>    steps.
>>> >>
>>> >> Each one of these tasks is non-trivial, repetitive, can conceivably go
>>> >> wrong, and until now duplicated wherever we had to build a payload --
>>> >> obscuring the actually meaningful work being done.  This is why I think
>>> >> that the assertion that emit_collect() makes code difficult to
>>> >> understand is backwards.  Thanks to emit_collect() the task of building
>>> >> a payload is tremendously simplified.
>>> >
>>> > Yes.  I really don't think people are disagreeing with you too badly
>>> > about emit_collect().  I also don't think it's as complicated as you
>>> > make it sound.
>>> >
>>> Heh, maybe "tremendously" wasn't the word I was looking for, but it's
>>> appreciably simplified. :P
>>>
>>> >> For emit_collect() to be possible there has to be some way to represent
>>> >> a register+size pair, because it needs to know the size of each
>>> >> argument, and because its result is itself a register+size pair -- The
>>> >> caller cannot do anything useful with a payload of unknown size, unless
>>> >> it re-calculates the size of the result what would defeat much of the
>>> >> purpose of emit_collect().
>>> >>
>>> >> The biggest benefit from being able to represent a register+size pair
>> as
>>> >> a value is that emit_collect() can easily be composed with other
>>> >> functions (including with itself), so the construction of specific
>>> >> chunks of the payload can easily be decoupled and re-used.  My code
>>> >> takes advantage of this feature extensively in order to share code that
>>> >> constructs surface message headers or transforms some value into the
>>> >> form expected by the shared unit (doing vector splitting or a strided
>>> >> copy).
>>> >
>>> > Yes, I understand that it's useful for that.  Hence the suggestion of
>>> > either a reg/size pair struct or passing the size as an argument.  The
>>> > struct has the advantage, as you said, of allowing chaining.
>>> >
>>> >>> ### SIMD16 Instruction Splitting ###
>>> >>>
>>> >>> SIMD16 instruction splitting is an unfortunate fact of our hardware.
>>> >>> There are a variety of times we have to do it including dual-source FB
>>> >>> writes, some texturing, math ops on older gens and maybe another place
>>> >>> or two.  Classically, this has been done in one of two places: The
>>> >>> visitor as we emit the code, or the generator.  The problem with doing
>>> >>> it in the generator is that we can't schedule it and, if it involves a
>>> >>> payload, it's not really possible.  The result is that we usually do
>>> >>> it in the visitor these days.
>>> >>>
>>> >>> Unfortunately, even in the visitor, it's gen-specific and annoying.
>>> >>> It gets even worse when you're working with something such as the
>>> >>> untyped surface read/write messages that work with multi-component
>>> >>> values that have to be zipped/unzipped to use Curro's terminology.
>>> >>> Curro came up with some helper functions to make it substantially less
>>> >>> annoying but it still involves nasty looping.
>>> >>>
>>> >>> At some point in the past I proposed a completely different and more
>>> >>> solution to this problem.  Unfortunately, while I've talked to Matt &
>>> >>> Ken about it, it's never really been discussed all that publicly so
>>> >>> Curro may not be aware of it.  I'm going to lay it out here for
>>> >>> Curro's sake as well as the sake of public record.
>>> >>>
>>> >>> The solution involves first changing the way we handle sends into a
>>> >>> two step process.  First, we emit a logical instruction that contains
>>> >>> all of the data needed for the actual instruction.  Then, we convert
>>> >>> from the logical to the actual in a lowering pass.  Take, for example,
>>> >>> FB writes with which I am fairly familiar.  We would first emit a
>>> >>> logical FS_FB_WRITE_# instruction that has separate sources for color,
>>> >>> depth, replicated alpha, etc.   Then, in the lower_fb_writes pass
>>> >>> (which we would have to implement), we would construct the payload
>>> >>> from the sources provided on the logical instruction and emit the
>>> >>> actual LOAD_PAYLOAD and FB_WRITE instructions.  This lower_fb_writes
>>> >>> function would then get called before the optimization loop so that
>>> >>> the rest of the optimization could would never see it.
>>> >>>
>>> >>> Second, we add a split_instruction helper that would take a SIMD16
>>> >>> instruction and naively split it into two SIMD8 instructions.  Such a
>>> >>> helper really shouldn't be that hard to write.  It would have to know
>>> >>> how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that
>>> >>> shouldn't be bad.  Any of these new logical send instructions would
>>> >>> have their values as separate sources so they should be safe to split.
>>> >>>
>>> >>> Third, we add a lower_simd16_to_simd8 pass that walks the
>>> >>> instructions, picks out the ones that need splitting, and calls
>>> >>> split_instruction on them.  All of the gen-specific SIMD8 vs. SIMD16
>>> >>> knowledge would be contained in this one pass.  This pass would happen
>>> >>> between actually emitting code and running lower_fb_writes (or
>>> >>> whatever other send lowering passes we have).
>>> >>>
>>> >>> Finally, and this is the icing on the cake, we write a
>>> >>> lower_simd32_to_simd16 pass that goes through and lowers all SIMD32
>>> >>> instructions (that act on full-sized data-types) to SIMD16
>>> >>> instructions.  Once the rest of the work is done, we get this pass,
>>> >>> and with it SIMD32 mode, almost for free.
>>> >>>
>>> >>> I know this approach looks like more work and, to be honest, it may
>>> >>> be.  However, I think it makes a lot of things far more
>>> >>> straightforward.  In particular, it means that people working on the
>>> >>> visitor code don't have to think about whether or not an instruction
>>> >>> needs splitting.  You also don't have to deal with the complexity of
>>> >>> zipping/unzipping sources every time.  Instead, we put all that code
>>> >>> in one place and get to stop thinking about it.  Also, if we *ever*
>>> >>> want to get SIMD32, we will need some sort of automatic instruction
>>> >>> splitting and this seems like a reasonable way to do it.
>>> >>>
>>> >>> I've talked to Ken about this approach and he's 100% on-board.  I
>>> >>> don't remember what Matt thinks of it.  If we like the approach, then
>>> >>> we should just split the tasks up and make it happen.  It's a bit of
>>> >>> refactoring but it shouldn't be terrible.  If we wanted to demo it, I
>>> >>> would probably suggest starting with FB writes as those are fairly
>>> >>> complex but yet self-contained.  They also have a case where we do
>>> >>> split an instruction so it would be interesting to see what the code
>>> >>> looks like before and after.
>>> >>>
>>> >>
>>> >> I generally like your proposal.  I guess the question we need to answer
>>> >> is whether we want this complexity to be in a lowering pass or in a
>>> >> helper function used to build the send-like instruction -- In either
>>> >> case we need code to handle zipping and unzipping of SIMD16 vectors,
>>> >> it's just about whether this code is called by a lowering pass or
>> higher
>>> >> up in the visitor.
>>> >>
>>> >> I can think of several benefits of the approach you propose over mine:
>>> >>
>>> >>  - It's more transparent for the visitor code emitting the message -- I
>>> >>    completely agree with you that the explicit loops are rather ugly.
>>> >>
>>> >>  - Instructions with explicit separate sources are likely to be more
>>> >>    suitable for certain optimization passes.  Pull constant loads use a
>>> >>    similar approach with an expression-style opcode which is at some
>>> >>    point lowered to a load payload and send message.  This may not be
>>> >>    terribly important at this point because of the optimizations
>> already
>>> >>    performed in GLSL IR and NIR and due to the nature of the majority
>> of
>>> >>    opcodes that don't support SIMD16, but it still seems appealing.
>>> >
>>> > There's one more really big one that you missed:
>>> >
>>> > It scales!  We can't afford to have a for loop for every ADD and MUL
>>> > instruction.  Sure, we might be able to afford it on sends, but not
>>> > for everything.
>>> >
>>> Well, I doubt you'd want to implement your proposal to the letter for
>>> non-send instructions: For those we don't need separate lowered and
>>> non-lowered instructions because there's no payload to assemble, so we
>>> can just do with a single opcode with different execution widths.  When
>>> you take payloads out of the picture all the disadvantages mentioned of
>>> the lowering pass approach no longer apply, so I totally agree with you
>>> that we want a general lowering pass to lower instructions that expect
>>> their arguments as separate sources in their final form.  In any case
>>> send-like instructions need a somewhat different (and less scalable)
>>> treatment.
>>>
>>> >> Some disadvantages come to my mind too:
>>> >>
>>> >>  - It increases the amount of work required to add a new send-like
>>> >>    instruction because you need lowered and unlowered variants for
>> each.
>>> >>
>>> >>  - It seems tricky to get right when splitting an instruction in halves
>>> >>    involves changing the actual contents of the payload beyond zipping
>>> >>    and unzipping its arguments -- This might not seem like a big deal
>>> >>    right now, but it will be a problem when we implement SIMD32.  The
>>> >>    surface messages that take a sample mask as argument are a good
>>> >>    example, because they only have 16 bits of space for it so you
>>> >>    actually need to provide different values depending on the "slot
>>> >>    group" the message is meant for.  This can be worked around easily
>> in
>>> >>    the visitor by shifting the sample mask register but it seems harder
>>> >>    to fix up later.
>>> >
>>> > Why do sample masks need to be part of the logical instruction?  Can't
>>> > we figure that out when we lower from logical to physical based on the
>>> > quarter control?
>>> >
>>> You can surely do anything you want during the logical-to-physical
>>> conversion, including rewriting the header, the problem is that it that
>>> forces you to have a pile of message-specific handling code in the
>>> lowering pass.  How are you planning to address that?  With a separate
>>> lowering pass for each message opcode or a general one with
>>> message-specific knowledge?
>>
>> Yes, the lowering pass will have *all* of the message-specific
>> information.  Probably broken out into helper functions exactly the way the
>> message emit code is broken out now.  The lowering pass then just knows
>> what helper to call for what instruction.
>>
> I think I only buy your proposal if it saves us more work than it
> creates in the long term, e.g. by using general splitting and payload
> assembly algorithms shared among all opcodes with minimal
> message-specific information.  Otherwise what you are describing sounds
> like a "bureaucratic" variant of my proposal, with lowered and unlowered
> versions of each opcode and with the payload assembly code (functionally
> almost the same as mine) hidden behind a lowering pass under a
> switch-case statement instead of being called up front.
>

I've given this idea a shot.  Can you have a look at the
image-load-store-lower branch of my tree [1]?  It's just a quick and
dirty proof of concept, so don't bother to review it carefully, just let
me know if you agree with the general design before I spend more time on
it.

[1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store-lower

>>> >>  - My intention was to handle vector layout peculiarities beyond
>>> >>    SIMDN-to-SIMD8 splitting under a consistent framework, leaving it
>>> >>    under the control of the helper function that builds the message
>>> >>    which quirks to apply.  With the lowering pass approach you'd either
>>> >>    need message-specific lowering passes, message-specific code in a
>>> >>    general lowering pass, or some way for the builder code to pass
>>> >>    vector_layout-like metadata to the lowering pass -- In the first two
>>> >>    cases the policy on how to format each source in the payload would
>> be
>>> >>    "delocalized" between the visitor and lowering pass, what I don't
>>> >>    particularly like.
>>> >
>>> > We already do that and we would continue to.  The logical instruction
>>> > wouldn't be much more than a serialized version of the helper
>>> > function.
>>> >
>>> > Thanks for your reply.
>>> > --Jason
>>> >
>>> >>> Thoughts?
>>> >>>
>>> >>> Question for Curro: Supposing for the moment that we decide to do
>>> >>> SIMD16 -> SIMD8 splitting this way, how much of the array_reg stuff
>>> >>> would still be needed for image_load_store?
>>> >>>
>>> >> I think I've already answered this question, but I'll answer it again
>>> >> here for clarity -- It would still help with building and passing
>> around
>>> >> payloads concisely using emit_collect(), regardless of how we implement
>>> >> SIMD16 instruction splitting.
>>> >>
>>> >>> I hope this e-mail provides us with some good talking points and a
>> way forward.
>>> >>> --Jason
>>> >>
>>> >> Thanks Jason!
>>> >>
>>> >> [1]
>> http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150518/27a0f431/attachment.sig>