[Mesa-dev] i965 implementation of the ARB_shader_image_load_store built-ins. (v3)

Fri May 15 14:39:26 PDT 2015

Jason Ekstrand <jason at jlekstrand.net> writes:

> On Fri, May 15, 2015 at 9:51 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>>> I haven't said much about this series up until now.  I've mostly sat
>>> and watched and focused my time on other things.  As I said in the
>>> meeting today, I think that part of the problem here is that there are
>>> at least 3 "refactors" in here besides the new feature.  By
>>> "refactors", I mean "new ways of solving problem X".  Part of the
>>> problem with the discussion is that all of this has been conflated and
>>> we aren't actually talking about how to solve each of those problems.
>>> I'm going to try and break it out here to try and aid in the
>>> discussion.
>>>
>> Thanks a lot Jason for giving a more constructive turn to this
>> discussion. :)
>>
>>> 1) The builder.  I think *everyone* likes the builder.  The argument
>>> here is not over whether or not we want it.  People have expressed
>>> concern that adding the builder now without actually doing the
>>> refactor will result in duplicated code that will get out-of-sync.  I
>>> think the best solution here is to go ahead and do the builder stuff
>>> immediately after the branch point.
>>>
>> Agreed.
>>
>>> 2) SIMD8 instruction splitting.  As I said in the call, we've done
>>> this a variety of ways in the past usually by emitting the split
>>> instructions in the visitor and manually using half() and
>>> inst->force_sechalf.  We also have a few places that I think still
>>> split the code in the generator.  What you're doing is a somewhat more
>>> elegant version of the "emit it in the visitor" stuff that we've done
>>> in the past.  You've got your zip/unzip functions etc. to handle
>>> splitting and combining and then you have a loop to actually emit the
>>> sends.  Really, it's fairly nice and concise.  The question is whether
>>> or not it's really what we want to do (more on this in a bit).
>>>
>> There is another closely related problem that this infrastructure solves
>> and may not be obvious in this series (it was in my original branch
>> though [1]), but I think it's worth keeping in mind, I'll call it 2.1:
>> Some of the surface messages lacked SIMD4x2 variants, so a strided copy
>> is required to transpose the original AoS vector into SoA form.  This
>> turns out to be algorithmically identical to the strided copy required
>> to "unzip" a SIMD16 vector, so it can be accomplished without additional
>> effort.
>>
>> These transformations are largely transparent for the functions building
>> typed/untyped surface messages, they are hidden behind an interface that
>> doesn't seem to have received any attention yet:
>>
>>  - struct vector_layout: This represents the layout of a vector in a
>>    payload.  It's initialized based on the restrictions of the hardware
>>    for the specific message we want to send, and the current code
>>    generation parameters (i.e. SIMD mode).
>>
>>  - emit_insert(): This takes a vector with the layout native to the EU
>>    (e.g. an fs_reg) and performs the required transformations to turn it
>>    into a form suitable for the shared unit according to a vector_layout
>>    struct (what may involve unzipping, a strided copy or no work at
>>    all).
>>
>>  - emit_extract(): The converse of emit_insert().  This takes a payload
>>    fragment (array_reg) and transforms it into a vector of native
>>    layout, according to a vector_layout struct.
>>
>> My intention was to come up with an interface more general than SIMD16
>> vector splitting and SIMD4x2 to SIMD8 conversion alone, I was also
>> thinking texturing instruction payload quirks (which might require
>> padding or different permutations of the same values depending on the
>> generation even if the SIMD mode is natively supported -- let's call
>> this 2.2) and framebuffer writes (2.3).
>
> Thank you for pointing out what else you were trying to do with it.  I
> understand that you're trying to solve a dozen different problems and
> that the infrastructure you've created is one that you think solves
> all of them at the same time.  It helps to know what all problems you
> were trying to solve.  However, since we are putting SIMD4x2 aside for
> now, in the interest of keeping us from getting side-tracked, let's
> table this for now.  It's a problem that may need solving, but it's a
> *different* problem.
>
If it can be solved with the same approach, it is a closely related
problem, isn't it?  And wouldn't it be a bit silly to knowingly
implement a less general solution when the more general solution
involves *less* work and while we know we'll want to address the
remaining problems solved by the more general solution eventually? ;)

>>> 3) Array reg.  This is something of a helper to (2).
>>
>> In fact the primary motivation for array_reg wasn't instruction
>> splitting, but emit_collect(), more on that later.
>>
>>> I agree that it's nice to be able to have access to the size of a
>>> register without having to look at the allocator.  The problem here
>>> (and why we don't have a size in fs_reg) is that we copy the reigster
>>> around.  Every register does have some sort of "underlying storage"
>>> that it shares with other registers with the same number.  But that
>>> storage is represented by the size in the allocator and nothing else.
>>> Is that bad?  Maybe.  Should we have a way of passing the size around
>>> with it?  Maybe.  In any case, I think it's best to table this
>>> discussion until we've talked a bit about (2) because I think the
>>> resolution there will make (3) more clear.
>>>
>> I think the mechanism we use to determine the size of a register hardly
>> ever involves the allocator, and there are good reasons not to rely on
>> it: It only works for VGRFs, can easily break with register coalescing,
>> and it makes "slicing" and other zero-copy transformations of registers
>> difficult (e.g. taking an individual element out of a VGRF array without
>> copying it into a one-component register).  Typically the size of a
>> register is implicitly assumed by the instruction using it, and where
>> it's not (because the instruction expects a variable length payload) it
>> is provided separately from the register via "mlen".  Neither of these
>> two options works here as will be obvious shortly.
>
> Understood.
>
>>> One note on 3 before we table it.  People have been pulling
>>> emit_collect out as a straw-man and beating on it but I really do like
>>> it in principle.  There are piles of times where you want a payload
>>> and you have to allocate an array, do stuff, and then put it into the
>>> payload and it's really annoying.  It would be much easier if we had a
>>> helper that just did it all in one function and that's what
>>> emit_collect tries to do.  *Thank you* for finally trying to solve
>>> that problem and make payloads less painful!  However, I do wish it
>>> were implemented a bit differently but I'll make those comments on the
>>> patch itself.
>>>
>>
>> Yeah.  I think it was Matt who mentioned in the call that emit_collect()
>> is really complex.  I completely agree with that statement.  It needs to
>> take care of a bunch of things to build the payload correctly:
>>
>>  - Allocate a register that will hold the result, calculating the
>>    correct total size based on the amount of data we want to
>>    concatenate.
>>
>>  - Calculate the correct header size.
>>
>>  - Allocate and release memory of the correct size to hold the array of
>>    LOAD_PAYLOAD sources, also based on the amount of data we want to
>>    concatenate.
>>
>>  - Split up the arguments to be passed to LOAD_PAYLOAD as sources in
>>    scalars of the correct width (which depends on whether the argument
>>    is part of the header or not, and what the dispatch width is), and
>>    initialize the array.
>>
>>  - Create a LOAD_PAYLOAD instruction with the result of the previous
>>    steps.
>>
>> Each one of these tasks is non-trivial, repetitive, can conceivably go
>> wrong, and until now duplicated wherever we had to build a payload --
>> obscuring the actually meaningful work being done.  This is why I think
>> that the assertion that emit_collect() makes code difficult to
>> understand is backwards.  Thanks to emit_collect() the task of building
>> a payload is tremendously simplified.
>
> Yes.  I really don't think people are disagreeing with you too badly
> about emit_collect().  I also don't think it's as complicated as you
> make it sound.
>
Heh, maybe "tremendously" wasn't the word I was looking for, but it's
appreciably simplified. :P

>> For emit_collect() to be possible there has to be some way to represent
>> a register+size pair, because it needs to know the size of each
>> argument, and because its result is itself a register+size pair -- The
>> caller cannot do anything useful with a payload of unknown size, unless
>> it re-calculates the size of the result what would defeat much of the
>> purpose of emit_collect().
>>
>> The biggest benefit from being able to represent a register+size pair as
>> a value is that emit_collect() can easily be composed with other
>> functions (including with itself), so the construction of specific
>> chunks of the payload can easily be decoupled and re-used.  My code
>> takes advantage of this feature extensively in order to share code that
>> constructs surface message headers or transforms some value into the
>> form expected by the shared unit (doing vector splitting or a strided
>> copy).
>
> Yes, I understand that it's useful for that.  Hence the suggestion of
> either a reg/size pair struct or passing the size as an argument.  The
> struct has the advantage, as you said, of allowing chaining.
>
>>> ### SIMD16 Instruction Splitting ###
>>>
>>> SIMD16 instruction splitting is an unfortunate fact of our hardware.
>>> There are a variety of times we have to do it including dual-source FB
>>> writes, some texturing, math ops on older gens and maybe another place
>>> or two.  Classically, this has been done in one of two places: The
>>> visitor as we emit the code, or the generator.  The problem with doing
>>> it in the generator is that we can't schedule it and, if it involves a
>>> payload, it's not really possible.  The result is that we usually do
>>> it in the visitor these days.
>>>
>>> Unfortunately, even in the visitor, it's gen-specific and annoying.
>>> It gets even worse when you're working with something such as the
>>> untyped surface read/write messages that work with multi-component
>>> values that have to be zipped/unzipped to use Curro's terminology.
>>> Curro came up with some helper functions to make it substantially less
>>> annoying but it still involves nasty looping.
>>>
>>> At some point in the past I proposed a completely different and more
>>> solution to this problem.  Unfortunately, while I've talked to Matt &
>>> Ken about it, it's never really been discussed all that publicly so
>>> Curro may not be aware of it.  I'm going to lay it out here for
>>> Curro's sake as well as the sake of public record.
>>>
>>> The solution involves first changing the way we handle sends into a
>>> two step process.  First, we emit a logical instruction that contains
>>> all of the data needed for the actual instruction.  Then, we convert
>>> from the logical to the actual in a lowering pass.  Take, for example,
>>> FB writes with which I am fairly familiar.  We would first emit a
>>> logical FS_FB_WRITE_# instruction that has separate sources for color,
>>> depth, replicated alpha, etc.   Then, in the lower_fb_writes pass
>>> (which we would have to implement), we would construct the payload
>>> from the sources provided on the logical instruction and emit the
>>> actual LOAD_PAYLOAD and FB_WRITE instructions.  This lower_fb_writes
>>> function would then get called before the optimization loop so that
>>> the rest of the optimization could would never see it.
>>>
>>> Second, we add a split_instruction helper that would take a SIMD16
>>> instruction and naively split it into two SIMD8 instructions.  Such a
>>> helper really shouldn't be that hard to write.  It would have to know
>>> how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that
>>> shouldn't be bad.  Any of these new logical send instructions would
>>> have their values as separate sources so they should be safe to split.
>>>
>>> Third, we add a lower_simd16_to_simd8 pass that walks the
>>> instructions, picks out the ones that need splitting, and calls
>>> split_instruction on them.  All of the gen-specific SIMD8 vs. SIMD16
>>> knowledge would be contained in this one pass.  This pass would happen
>>> between actually emitting code and running lower_fb_writes (or
>>> whatever other send lowering passes we have).
>>>
>>> Finally, and this is the icing on the cake, we write a
>>> lower_simd32_to_simd16 pass that goes through and lowers all SIMD32
>>> instructions (that act on full-sized data-types) to SIMD16
>>> instructions.  Once the rest of the work is done, we get this pass,
>>> and with it SIMD32 mode, almost for free.
>>>
>>> I know this approach looks like more work and, to be honest, it may
>>> be.  However, I think it makes a lot of things far more
>>> straightforward.  In particular, it means that people working on the
>>> visitor code don't have to think about whether or not an instruction
>>> needs splitting.  You also don't have to deal with the complexity of
>>> zipping/unzipping sources every time.  Instead, we put all that code
>>> in one place and get to stop thinking about it.  Also, if we *ever*
>>> want to get SIMD32, we will need some sort of automatic instruction
>>> splitting and this seems like a reasonable way to do it.
>>>
>>> I've talked to Ken about this approach and he's 100% on-board.  I
>>> don't remember what Matt thinks of it.  If we like the approach, then
>>> we should just split the tasks up and make it happen.  It's a bit of
>>> refactoring but it shouldn't be terrible.  If we wanted to demo it, I
>>> would probably suggest starting with FB writes as those are fairly
>>> complex but yet self-contained.  They also have a case where we do
>>> split an instruction so it would be interesting to see what the code
>>> looks like before and after.
>>>
>>
>> I generally like your proposal.  I guess the question we need to answer
>> is whether we want this complexity to be in a lowering pass or in a
>> helper function used to build the send-like instruction -- In either
>> case we need code to handle zipping and unzipping of SIMD16 vectors,
>> it's just about whether this code is called by a lowering pass or higher
>> up in the visitor.
>>
>> I can think of several benefits of the approach you propose over mine:
>>
>>  - It's more transparent for the visitor code emitting the message -- I
>>    completely agree with you that the explicit loops are rather ugly.
>>
>>  - Instructions with explicit separate sources are likely to be more
>>    suitable for certain optimization passes.  Pull constant loads use a
>>    similar approach with an expression-style opcode which is at some
>>    point lowered to a load payload and send message.  This may not be
>>    terribly important at this point because of the optimizations already
>>    performed in GLSL IR and NIR and due to the nature of the majority of
>>    opcodes that don't support SIMD16, but it still seems appealing.
>
> There's one more really big one that you missed:
>
> It scales!  We can't afford to have a for loop for every ADD and MUL
> instruction.  Sure, we might be able to afford it on sends, but not
> for everything.
>
Well, I doubt you'd want to implement your proposal to the letter for
non-send instructions: For those we don't need separate lowered and
non-lowered instructions because there's no payload to assemble, so we
can just do with a single opcode with different execution widths.  When
you take payloads out of the picture all the disadvantages mentioned of
the lowering pass approach no longer apply, so I totally agree with you
that we want a general lowering pass to lower instructions that expect
their arguments as separate sources in their final form.  In any case
send-like instructions need a somewhat different (and less scalable)
treatment.

>> Some disadvantages come to my mind too:
>>
>>  - It increases the amount of work required to add a new send-like
>>    instruction because you need lowered and unlowered variants for each.
>>
>>  - It seems tricky to get right when splitting an instruction in halves
>>    involves changing the actual contents of the payload beyond zipping
>>    and unzipping its arguments -- This might not seem like a big deal
>>    right now, but it will be a problem when we implement SIMD32.  The
>>    surface messages that take a sample mask as argument are a good
>>    example, because they only have 16 bits of space for it so you
>>    actually need to provide different values depending on the "slot
>>    group" the message is meant for.  This can be worked around easily in
>>    the visitor by shifting the sample mask register but it seems harder
>>    to fix up later.
>
> Why do sample masks need to be part of the logical instruction?  Can't
> we figure that out when we lower from logical to physical based on the
> quarter control?
>
You can surely do anything you want during the logical-to-physical
conversion, including rewriting the header, the problem is that it that
forces you to have a pile of message-specific handling code in the
lowering pass.  How are you planning to address that?  With a separate
lowering pass for each message opcode or a general one with
message-specific knowledge?

>>  - My intention was to handle vector layout peculiarities beyond
>>    SIMDN-to-SIMD8 splitting under a consistent framework, leaving it
>>    under the control of the helper function that builds the message
>>    which quirks to apply.  With the lowering pass approach you'd either
>>    need message-specific lowering passes, message-specific code in a
>>    general lowering pass, or some way for the builder code to pass
>>    vector_layout-like metadata to the lowering pass -- In the first two
>>    cases the policy on how to format each source in the payload would be
>>    "delocalized" between the visitor and lowering pass, what I don't
>>    particularly like.
>
> We already do that and we would continue to.  The logical instruction
> wouldn't be much more than a serialized version of the helper
> function.
>
> Thanks for your reply.
> --Jason
>
>>> Thoughts?
>>>
>>> Question for Curro: Supposing for the moment that we decide to do
>>> SIMD16 -> SIMD8 splitting this way, how much of the array_reg stuff
>>> would still be needed for image_load_store?
>>>
>> I think I've already answered this question, but I'll answer it again
>> here for clarity -- It would still help with building and passing around
>> payloads concisely using emit_collect(), regardless of how we implement
>> SIMD16 instruction splitting.
>>
>>> I hope this e-mail provides us with some good talking points and a way forward.
>>> --Jason
>>
>> Thanks Jason!
>>
>> [1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150516/7617becf/attachment-0001.sig>