[Mesa-dev] i965 implementation of the ARB_shader_image_load_store built-ins. (v3)
Francisco Jerez
currojerez at riseup.net
Sat May 16 03:39:20 PDT 2015
Jason Ekstrand <jason at jlekstrand.net> writes:
> On May 15, 2015 2:40 PM, "Francisco Jerez" <currojerez at riseup.net> wrote:
>>
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>> > On Fri, May 15, 2015 at 9:51 AM, Francisco Jerez <currojerez at riseup.net>
> wrote:
>> >> Jason Ekstrand <jason at jlekstrand.net> writes:
>> >>
>> >>> I haven't said much about this series up until now. I've mostly sat
>> >>> and watched and focused my time on other things. As I said in the
>> >>> meeting today, I think that part of the problem here is that there are
>> >>> at least 3 "refactors" in here besides the new feature. By
>> >>> "refactors", I mean "new ways of solving problem X". Part of the
>> >>> problem with the discussion is that all of this has been conflated and
>> >>> we aren't actually talking about how to solve each of those problems.
>> >>> I'm going to try and break it out here to try and aid in the
>> >>> discussion.
>> >>>
>> >> Thanks a lot Jason for giving a more constructive turn to this
>> >> discussion. :)
>> >>
>> >>> 1) The builder. I think *everyone* likes the builder. The argument
>> >>> here is not over whether or not we want it. People have expressed
>> >>> concern that adding the builder now without actually doing the
>> >>> refactor will result in duplicated code that will get out-of-sync. I
>> >>> think the best solution here is to go ahead and do the builder stuff
>> >>> immediately after the branch point.
>> >>>
>> >> Agreed.
>> >>
>> >>> 2) SIMD8 instruction splitting. As I said in the call, we've done
>> >>> this a variety of ways in the past usually by emitting the split
>> >>> instructions in the visitor and manually using half() and
>> >>> inst->force_sechalf. We also have a few places that I think still
>> >>> split the code in the generator. What you're doing is a somewhat more
>> >>> elegant version of the "emit it in the visitor" stuff that we've done
>> >>> in the past. You've got your zip/unzip functions etc. to handle
>> >>> splitting and combining and then you have a loop to actually emit the
>> >>> sends. Really, it's fairly nice and concise. The question is whether
>> >>> or not it's really what we want to do (more on this in a bit).
>> >>>
>> >> There is another closely related problem that this infrastructure
> solves
>> >> and may not be obvious in this series (it was in my original branch
>> >> though [1]), but I think it's worth keeping in mind, I'll call it 2.1:
>> >> Some of the surface messages lacked SIMD4x2 variants, so a strided copy
>> >> is required to transpose the original AoS vector into SoA form. This
>> >> turns out to be algorithmically identical to the strided copy required
>> >> to "unzip" a SIMD16 vector, so it can be accomplished without
> additional
>> >> effort.
>> >>
>> >> These transformations are largely transparent for the functions
> building
>> >> typed/untyped surface messages, they are hidden behind an interface
> that
>> >> doesn't seem to have received any attention yet:
>> >>
>> >> - struct vector_layout: This represents the layout of a vector in a
>> >> payload. It's initialized based on the restrictions of the hardware
>> >> for the specific message we want to send, and the current code
>> >> generation parameters (i.e. SIMD mode).
>> >>
>> >> - emit_insert(): This takes a vector with the layout native to the EU
>> >> (e.g. an fs_reg) and performs the required transformations to turn
> it
>> >> into a form suitable for the shared unit according to a
> vector_layout
>> >> struct (what may involve unzipping, a strided copy or no work at
>> >> all).
>> >>
>> >> - emit_extract(): The converse of emit_insert(). This takes a payload
>> >> fragment (array_reg) and transforms it into a vector of native
>> >> layout, according to a vector_layout struct.
>> >>
>> >> My intention was to come up with an interface more general than SIMD16
>> >> vector splitting and SIMD4x2 to SIMD8 conversion alone, I was also
>> >> thinking texturing instruction payload quirks (which might require
>> >> padding or different permutations of the same values depending on the
>> >> generation even if the SIMD mode is natively supported -- let's call
>> >> this 2.2) and framebuffer writes (2.3).
>> >
>> > Thank you for pointing out what else you were trying to do with it. I
>> > understand that you're trying to solve a dozen different problems and
>> > that the infrastructure you've created is one that you think solves
>> > all of them at the same time. It helps to know what all problems you
>> > were trying to solve. However, since we are putting SIMD4x2 aside for
>> > now, in the interest of keeping us from getting side-tracked, let's
>> > table this for now. It's a problem that may need solving, but it's a
>> > *different* problem.
>> >
>> If it can be solved with the same approach, it is a closely related
>> problem, isn't it? And wouldn't it be a bit silly to knowingly
>> implement a less general solution when the more general solution
>> involves *less* work and while we know we'll want to address the
>> remaining problems solved by the more general solution eventually? ;)
>
> Perhaps, but trying to solve all the problems at the same time is why the
> conversation has gone nowhere on the past. We need to keep things split
> out.
>
>> >>> 3) Array reg. This is something of a helper to (2).
>> >>
>> >> In fact the primary motivation for array_reg wasn't instruction
>> >> splitting, but emit_collect(), more on that later.
>> >>
>> >>> I agree that it's nice to be able to have access to the size of a
>> >>> register without having to look at the allocator. The problem here
>> >>> (and why we don't have a size in fs_reg) is that we copy the reigster
>> >>> around. Every register does have some sort of "underlying storage"
>> >>> that it shares with other registers with the same number. But that
>> >>> storage is represented by the size in the allocator and nothing else.
>> >>> Is that bad? Maybe. Should we have a way of passing the size around
>> >>> with it? Maybe. In any case, I think it's best to table this
>> >>> discussion until we've talked a bit about (2) because I think the
>> >>> resolution there will make (3) more clear.
>> >>>
>> >> I think the mechanism we use to determine the size of a register hardly
>> >> ever involves the allocator, and there are good reasons not to rely on
>> >> it: It only works for VGRFs, can easily break with register coalescing,
>> >> and it makes "slicing" and other zero-copy transformations of registers
>> >> difficult (e.g. taking an individual element out of a VGRF array
> without
>> >> copying it into a one-component register). Typically the size of a
>> >> register is implicitly assumed by the instruction using it, and where
>> >> it's not (because the instruction expects a variable length payload) it
>> >> is provided separately from the register via "mlen". Neither of these
>> >> two options works here as will be obvious shortly.
>> >
>> > Understood.
>> >
>> >>> One note on 3 before we table it. People have been pulling
>> >>> emit_collect out as a straw-man and beating on it but I really do like
>> >>> it in principle. There are piles of times where you want a payload
>> >>> and you have to allocate an array, do stuff, and then put it into the
>> >>> payload and it's really annoying. It would be much easier if we had a
>> >>> helper that just did it all in one function and that's what
>> >>> emit_collect tries to do. *Thank you* for finally trying to solve
>> >>> that problem and make payloads less painful! However, I do wish it
>> >>> were implemented a bit differently but I'll make those comments on the
>> >>> patch itself.
>> >>>
>> >>
>> >> Yeah. I think it was Matt who mentioned in the call that
> emit_collect()
>> >> is really complex. I completely agree with that statement. It needs
> to
>> >> take care of a bunch of things to build the payload correctly:
>> >>
>> >> - Allocate a register that will hold the result, calculating the
>> >> correct total size based on the amount of data we want to
>> >> concatenate.
>> >>
>> >> - Calculate the correct header size.
>> >>
>> >> - Allocate and release memory of the correct size to hold the array of
>> >> LOAD_PAYLOAD sources, also based on the amount of data we want to
>> >> concatenate.
>> >>
>> >> - Split up the arguments to be passed to LOAD_PAYLOAD as sources in
>> >> scalars of the correct width (which depends on whether the argument
>> >> is part of the header or not, and what the dispatch width is), and
>> >> initialize the array.
>> >>
>> >> - Create a LOAD_PAYLOAD instruction with the result of the previous
>> >> steps.
>> >>
>> >> Each one of these tasks is non-trivial, repetitive, can conceivably go
>> >> wrong, and until now duplicated wherever we had to build a payload --
>> >> obscuring the actually meaningful work being done. This is why I think
>> >> that the assertion that emit_collect() makes code difficult to
>> >> understand is backwards. Thanks to emit_collect() the task of building
>> >> a payload is tremendously simplified.
>> >
>> > Yes. I really don't think people are disagreeing with you too badly
>> > about emit_collect(). I also don't think it's as complicated as you
>> > make it sound.
>> >
>> Heh, maybe "tremendously" wasn't the word I was looking for, but it's
>> appreciably simplified. :P
>>
>> >> For emit_collect() to be possible there has to be some way to represent
>> >> a register+size pair, because it needs to know the size of each
>> >> argument, and because its result is itself a register+size pair -- The
>> >> caller cannot do anything useful with a payload of unknown size, unless
>> >> it re-calculates the size of the result what would defeat much of the
>> >> purpose of emit_collect().
>> >>
>> >> The biggest benefit from being able to represent a register+size pair
> as
>> >> a value is that emit_collect() can easily be composed with other
>> >> functions (including with itself), so the construction of specific
>> >> chunks of the payload can easily be decoupled and re-used. My code
>> >> takes advantage of this feature extensively in order to share code that
>> >> constructs surface message headers or transforms some value into the
>> >> form expected by the shared unit (doing vector splitting or a strided
>> >> copy).
>> >
>> > Yes, I understand that it's useful for that. Hence the suggestion of
>> > either a reg/size pair struct or passing the size as an argument. The
>> > struct has the advantage, as you said, of allowing chaining.
>> >
>> >>> ### SIMD16 Instruction Splitting ###
>> >>>
>> >>> SIMD16 instruction splitting is an unfortunate fact of our hardware.
>> >>> There are a variety of times we have to do it including dual-source FB
>> >>> writes, some texturing, math ops on older gens and maybe another place
>> >>> or two. Classically, this has been done in one of two places: The
>> >>> visitor as we emit the code, or the generator. The problem with doing
>> >>> it in the generator is that we can't schedule it and, if it involves a
>> >>> payload, it's not really possible. The result is that we usually do
>> >>> it in the visitor these days.
>> >>>
>> >>> Unfortunately, even in the visitor, it's gen-specific and annoying.
>> >>> It gets even worse when you're working with something such as the
>> >>> untyped surface read/write messages that work with multi-component
>> >>> values that have to be zipped/unzipped to use Curro's terminology.
>> >>> Curro came up with some helper functions to make it substantially less
>> >>> annoying but it still involves nasty looping.
>> >>>
>> >>> At some point in the past I proposed a completely different and more
>> >>> solution to this problem. Unfortunately, while I've talked to Matt &
>> >>> Ken about it, it's never really been discussed all that publicly so
>> >>> Curro may not be aware of it. I'm going to lay it out here for
>> >>> Curro's sake as well as the sake of public record.
>> >>>
>> >>> The solution involves first changing the way we handle sends into a
>> >>> two step process. First, we emit a logical instruction that contains
>> >>> all of the data needed for the actual instruction. Then, we convert
>> >>> from the logical to the actual in a lowering pass. Take, for example,
>> >>> FB writes with which I am fairly familiar. We would first emit a
>> >>> logical FS_FB_WRITE_# instruction that has separate sources for color,
>> >>> depth, replicated alpha, etc. Then, in the lower_fb_writes pass
>> >>> (which we would have to implement), we would construct the payload
>> >>> from the sources provided on the logical instruction and emit the
>> >>> actual LOAD_PAYLOAD and FB_WRITE instructions. This lower_fb_writes
>> >>> function would then get called before the optimization loop so that
>> >>> the rest of the optimization could would never see it.
>> >>>
>> >>> Second, we add a split_instruction helper that would take a SIMD16
>> >>> instruction and naively split it into two SIMD8 instructions. Such a
>> >>> helper really shouldn't be that hard to write. It would have to know
>> >>> how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that
>> >>> shouldn't be bad. Any of these new logical send instructions would
>> >>> have their values as separate sources so they should be safe to split.
>> >>>
>> >>> Third, we add a lower_simd16_to_simd8 pass that walks the
>> >>> instructions, picks out the ones that need splitting, and calls
>> >>> split_instruction on them. All of the gen-specific SIMD8 vs. SIMD16
>> >>> knowledge would be contained in this one pass. This pass would happen
>> >>> between actually emitting code and running lower_fb_writes (or
>> >>> whatever other send lowering passes we have).
>> >>>
>> >>> Finally, and this is the icing on the cake, we write a
>> >>> lower_simd32_to_simd16 pass that goes through and lowers all SIMD32
>> >>> instructions (that act on full-sized data-types) to SIMD16
>> >>> instructions. Once the rest of the work is done, we get this pass,
>> >>> and with it SIMD32 mode, almost for free.
>> >>>
>> >>> I know this approach looks like more work and, to be honest, it may
>> >>> be. However, I think it makes a lot of things far more
>> >>> straightforward. In particular, it means that people working on the
>> >>> visitor code don't have to think about whether or not an instruction
>> >>> needs splitting. You also don't have to deal with the complexity of
>> >>> zipping/unzipping sources every time. Instead, we put all that code
>> >>> in one place and get to stop thinking about it. Also, if we *ever*
>> >>> want to get SIMD32, we will need some sort of automatic instruction
>> >>> splitting and this seems like a reasonable way to do it.
>> >>>
>> >>> I've talked to Ken about this approach and he's 100% on-board. I
>> >>> don't remember what Matt thinks of it. If we like the approach, then
>> >>> we should just split the tasks up and make it happen. It's a bit of
>> >>> refactoring but it shouldn't be terrible. If we wanted to demo it, I
>> >>> would probably suggest starting with FB writes as those are fairly
>> >>> complex but yet self-contained. They also have a case where we do
>> >>> split an instruction so it would be interesting to see what the code
>> >>> looks like before and after.
>> >>>
>> >>
>> >> I generally like your proposal. I guess the question we need to answer
>> >> is whether we want this complexity to be in a lowering pass or in a
>> >> helper function used to build the send-like instruction -- In either
>> >> case we need code to handle zipping and unzipping of SIMD16 vectors,
>> >> it's just about whether this code is called by a lowering pass or
> higher
>> >> up in the visitor.
>> >>
>> >> I can think of several benefits of the approach you propose over mine:
>> >>
>> >> - It's more transparent for the visitor code emitting the message -- I
>> >> completely agree with you that the explicit loops are rather ugly.
>> >>
>> >> - Instructions with explicit separate sources are likely to be more
>> >> suitable for certain optimization passes. Pull constant loads use a
>> >> similar approach with an expression-style opcode which is at some
>> >> point lowered to a load payload and send message. This may not be
>> >> terribly important at this point because of the optimizations
> already
>> >> performed in GLSL IR and NIR and due to the nature of the majority
> of
>> >> opcodes that don't support SIMD16, but it still seems appealing.
>> >
>> > There's one more really big one that you missed:
>> >
>> > It scales! We can't afford to have a for loop for every ADD and MUL
>> > instruction. Sure, we might be able to afford it on sends, but not
>> > for everything.
>> >
>> Well, I doubt you'd want to implement your proposal to the letter for
>> non-send instructions: For those we don't need separate lowered and
>> non-lowered instructions because there's no payload to assemble, so we
>> can just do with a single opcode with different execution widths. When
>> you take payloads out of the picture all the disadvantages mentioned of
>> the lowering pass approach no longer apply, so I totally agree with you
>> that we want a general lowering pass to lower instructions that expect
>> their arguments as separate sources in their final form. In any case
>> send-like instructions need a somewhat different (and less scalable)
>> treatment.
>>
>> >> Some disadvantages come to my mind too:
>> >>
>> >> - It increases the amount of work required to add a new send-like
>> >> instruction because you need lowered and unlowered variants for
> each.
>> >>
>> >> - It seems tricky to get right when splitting an instruction in halves
>> >> involves changing the actual contents of the payload beyond zipping
>> >> and unzipping its arguments -- This might not seem like a big deal
>> >> right now, but it will be a problem when we implement SIMD32. The
>> >> surface messages that take a sample mask as argument are a good
>> >> example, because they only have 16 bits of space for it so you
>> >> actually need to provide different values depending on the "slot
>> >> group" the message is meant for. This can be worked around easily
> in
>> >> the visitor by shifting the sample mask register but it seems harder
>> >> to fix up later.
>> >
>> > Why do sample masks need to be part of the logical instruction? Can't
>> > we figure that out when we lower from logical to physical based on the
>> > quarter control?
>> >
>> You can surely do anything you want during the logical-to-physical
>> conversion, including rewriting the header, the problem is that it that
>> forces you to have a pile of message-specific handling code in the
>> lowering pass. How are you planning to address that? With a separate
>> lowering pass for each message opcode or a general one with
>> message-specific knowledge?
>
> Yes, the lowering pass will have *all* of the message-specific
> information. Probably broken out into helper functions exactly the way the
> message emit code is broken out now. The lowering pass then just knows
> what helper to call for what instruction.
>
I think I only buy your proposal if it saves us more work than it
creates in the long term, e.g. by using general splitting and payload
assembly algorithms shared among all opcodes with minimal
message-specific information. Otherwise what you are describing sounds
like a "bureaucratic" variant of my proposal, with lowered and unlowered
versions of each opcode and with the payload assembly code (functionally
almost the same as mine) hidden behind a lowering pass under a
switch-case statement instead of being called up front.
>> >> - My intention was to handle vector layout peculiarities beyond
>> >> SIMDN-to-SIMD8 splitting under a consistent framework, leaving it
>> >> under the control of the helper function that builds the message
>> >> which quirks to apply. With the lowering pass approach you'd either
>> >> need message-specific lowering passes, message-specific code in a
>> >> general lowering pass, or some way for the builder code to pass
>> >> vector_layout-like metadata to the lowering pass -- In the first two
>> >> cases the policy on how to format each source in the payload would
> be
>> >> "delocalized" between the visitor and lowering pass, what I don't
>> >> particularly like.
>> >
>> > We already do that and we would continue to. The logical instruction
>> > wouldn't be much more than a serialized version of the helper
>> > function.
>> >
>> > Thanks for your reply.
>> > --Jason
>> >
>> >>> Thoughts?
>> >>>
>> >>> Question for Curro: Supposing for the moment that we decide to do
>> >>> SIMD16 -> SIMD8 splitting this way, how much of the array_reg stuff
>> >>> would still be needed for image_load_store?
>> >>>
>> >> I think I've already answered this question, but I'll answer it again
>> >> here for clarity -- It would still help with building and passing
> around
>> >> payloads concisely using emit_collect(), regardless of how we implement
>> >> SIMD16 instruction splitting.
>> >>
>> >>> I hope this e-mail provides us with some good talking points and a
> way forward.
>> >>> --Jason
>> >>
>> >> Thanks Jason!
>> >>
>> >> [1]
> http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150516/a7857069/attachment-0001.sig>
More information about the mesa-dev
mailing list