[Mesa-dev] i965 implementation of the ARB_shader_image_load_store built-ins. (v3)
Francisco Jerez
currojerez at riseup.net
Fri May 15 14:39:26 PDT 2015
Jason Ekstrand <jason at jlekstrand.net> writes:
> On Fri, May 15, 2015 at 9:51 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>>> I haven't said much about this series up until now. I've mostly sat
>>> and watched and focused my time on other things. As I said in the
>>> meeting today, I think that part of the problem here is that there are
>>> at least 3 "refactors" in here besides the new feature. By
>>> "refactors", I mean "new ways of solving problem X". Part of the
>>> problem with the discussion is that all of this has been conflated and
>>> we aren't actually talking about how to solve each of those problems.
>>> I'm going to try and break it out here to try and aid in the
>>> discussion.
>>>
>> Thanks a lot Jason for giving a more constructive turn to this
>> discussion. :)
>>
>>> 1) The builder. I think *everyone* likes the builder. The argument
>>> here is not over whether or not we want it. People have expressed
>>> concern that adding the builder now without actually doing the
>>> refactor will result in duplicated code that will get out-of-sync. I
>>> think the best solution here is to go ahead and do the builder stuff
>>> immediately after the branch point.
>>>
>> Agreed.
>>
>>> 2) SIMD8 instruction splitting. As I said in the call, we've done
>>> this a variety of ways in the past usually by emitting the split
>>> instructions in the visitor and manually using half() and
>>> inst->force_sechalf. We also have a few places that I think still
>>> split the code in the generator. What you're doing is a somewhat more
>>> elegant version of the "emit it in the visitor" stuff that we've done
>>> in the past. You've got your zip/unzip functions etc. to handle
>>> splitting and combining and then you have a loop to actually emit the
>>> sends. Really, it's fairly nice and concise. The question is whether
>>> or not it's really what we want to do (more on this in a bit).
>>>
>> There is another closely related problem that this infrastructure solves
>> and may not be obvious in this series (it was in my original branch
>> though [1]), but I think it's worth keeping in mind, I'll call it 2.1:
>> Some of the surface messages lacked SIMD4x2 variants, so a strided copy
>> is required to transpose the original AoS vector into SoA form. This
>> turns out to be algorithmically identical to the strided copy required
>> to "unzip" a SIMD16 vector, so it can be accomplished without additional
>> effort.
>>
>> These transformations are largely transparent for the functions building
>> typed/untyped surface messages, they are hidden behind an interface that
>> doesn't seem to have received any attention yet:
>>
>> - struct vector_layout: This represents the layout of a vector in a
>> payload. It's initialized based on the restrictions of the hardware
>> for the specific message we want to send, and the current code
>> generation parameters (i.e. SIMD mode).
>>
>> - emit_insert(): This takes a vector with the layout native to the EU
>> (e.g. an fs_reg) and performs the required transformations to turn it
>> into a form suitable for the shared unit according to a vector_layout
>> struct (what may involve unzipping, a strided copy or no work at
>> all).
>>
>> - emit_extract(): The converse of emit_insert(). This takes a payload
>> fragment (array_reg) and transforms it into a vector of native
>> layout, according to a vector_layout struct.
>>
>> My intention was to come up with an interface more general than SIMD16
>> vector splitting and SIMD4x2 to SIMD8 conversion alone, I was also
>> thinking texturing instruction payload quirks (which might require
>> padding or different permutations of the same values depending on the
>> generation even if the SIMD mode is natively supported -- let's call
>> this 2.2) and framebuffer writes (2.3).
>
> Thank you for pointing out what else you were trying to do with it. I
> understand that you're trying to solve a dozen different problems and
> that the infrastructure you've created is one that you think solves
> all of them at the same time. It helps to know what all problems you
> were trying to solve. However, since we are putting SIMD4x2 aside for
> now, in the interest of keeping us from getting side-tracked, let's
> table this for now. It's a problem that may need solving, but it's a
> *different* problem.
>
If it can be solved with the same approach, it is a closely related
problem, isn't it? And wouldn't it be a bit silly to knowingly
implement a less general solution when the more general solution
involves *less* work and while we know we'll want to address the
remaining problems solved by the more general solution eventually? ;)
>>> 3) Array reg. This is something of a helper to (2).
>>
>> In fact the primary motivation for array_reg wasn't instruction
>> splitting, but emit_collect(), more on that later.
>>
>>> I agree that it's nice to be able to have access to the size of a
>>> register without having to look at the allocator. The problem here
>>> (and why we don't have a size in fs_reg) is that we copy the reigster
>>> around. Every register does have some sort of "underlying storage"
>>> that it shares with other registers with the same number. But that
>>> storage is represented by the size in the allocator and nothing else.
>>> Is that bad? Maybe. Should we have a way of passing the size around
>>> with it? Maybe. In any case, I think it's best to table this
>>> discussion until we've talked a bit about (2) because I think the
>>> resolution there will make (3) more clear.
>>>
>> I think the mechanism we use to determine the size of a register hardly
>> ever involves the allocator, and there are good reasons not to rely on
>> it: It only works for VGRFs, can easily break with register coalescing,
>> and it makes "slicing" and other zero-copy transformations of registers
>> difficult (e.g. taking an individual element out of a VGRF array without
>> copying it into a one-component register). Typically the size of a
>> register is implicitly assumed by the instruction using it, and where
>> it's not (because the instruction expects a variable length payload) it
>> is provided separately from the register via "mlen". Neither of these
>> two options works here as will be obvious shortly.
>
> Understood.
>
>>> One note on 3 before we table it. People have been pulling
>>> emit_collect out as a straw-man and beating on it but I really do like
>>> it in principle. There are piles of times where you want a payload
>>> and you have to allocate an array, do stuff, and then put it into the
>>> payload and it's really annoying. It would be much easier if we had a
>>> helper that just did it all in one function and that's what
>>> emit_collect tries to do. *Thank you* for finally trying to solve
>>> that problem and make payloads less painful! However, I do wish it
>>> were implemented a bit differently but I'll make those comments on the
>>> patch itself.
>>>
>>
>> Yeah. I think it was Matt who mentioned in the call that emit_collect()
>> is really complex. I completely agree with that statement. It needs to
>> take care of a bunch of things to build the payload correctly:
>>
>> - Allocate a register that will hold the result, calculating the
>> correct total size based on the amount of data we want to
>> concatenate.
>>
>> - Calculate the correct header size.
>>
>> - Allocate and release memory of the correct size to hold the array of
>> LOAD_PAYLOAD sources, also based on the amount of data we want to
>> concatenate.
>>
>> - Split up the arguments to be passed to LOAD_PAYLOAD as sources in
>> scalars of the correct width (which depends on whether the argument
>> is part of the header or not, and what the dispatch width is), and
>> initialize the array.
>>
>> - Create a LOAD_PAYLOAD instruction with the result of the previous
>> steps.
>>
>> Each one of these tasks is non-trivial, repetitive, can conceivably go
>> wrong, and until now duplicated wherever we had to build a payload --
>> obscuring the actually meaningful work being done. This is why I think
>> that the assertion that emit_collect() makes code difficult to
>> understand is backwards. Thanks to emit_collect() the task of building
>> a payload is tremendously simplified.
>
> Yes. I really don't think people are disagreeing with you too badly
> about emit_collect(). I also don't think it's as complicated as you
> make it sound.
>
Heh, maybe "tremendously" wasn't the word I was looking for, but it's
appreciably simplified. :P
>> For emit_collect() to be possible there has to be some way to represent
>> a register+size pair, because it needs to know the size of each
>> argument, and because its result is itself a register+size pair -- The
>> caller cannot do anything useful with a payload of unknown size, unless
>> it re-calculates the size of the result what would defeat much of the
>> purpose of emit_collect().
>>
>> The biggest benefit from being able to represent a register+size pair as
>> a value is that emit_collect() can easily be composed with other
>> functions (including with itself), so the construction of specific
>> chunks of the payload can easily be decoupled and re-used. My code
>> takes advantage of this feature extensively in order to share code that
>> constructs surface message headers or transforms some value into the
>> form expected by the shared unit (doing vector splitting or a strided
>> copy).
>
> Yes, I understand that it's useful for that. Hence the suggestion of
> either a reg/size pair struct or passing the size as an argument. The
> struct has the advantage, as you said, of allowing chaining.
>
>>> ### SIMD16 Instruction Splitting ###
>>>
>>> SIMD16 instruction splitting is an unfortunate fact of our hardware.
>>> There are a variety of times we have to do it including dual-source FB
>>> writes, some texturing, math ops on older gens and maybe another place
>>> or two. Classically, this has been done in one of two places: The
>>> visitor as we emit the code, or the generator. The problem with doing
>>> it in the generator is that we can't schedule it and, if it involves a
>>> payload, it's not really possible. The result is that we usually do
>>> it in the visitor these days.
>>>
>>> Unfortunately, even in the visitor, it's gen-specific and annoying.
>>> It gets even worse when you're working with something such as the
>>> untyped surface read/write messages that work with multi-component
>>> values that have to be zipped/unzipped to use Curro's terminology.
>>> Curro came up with some helper functions to make it substantially less
>>> annoying but it still involves nasty looping.
>>>
>>> At some point in the past I proposed a completely different and more
>>> solution to this problem. Unfortunately, while I've talked to Matt &
>>> Ken about it, it's never really been discussed all that publicly so
>>> Curro may not be aware of it. I'm going to lay it out here for
>>> Curro's sake as well as the sake of public record.
>>>
>>> The solution involves first changing the way we handle sends into a
>>> two step process. First, we emit a logical instruction that contains
>>> all of the data needed for the actual instruction. Then, we convert
>>> from the logical to the actual in a lowering pass. Take, for example,
>>> FB writes with which I am fairly familiar. We would first emit a
>>> logical FS_FB_WRITE_# instruction that has separate sources for color,
>>> depth, replicated alpha, etc. Then, in the lower_fb_writes pass
>>> (which we would have to implement), we would construct the payload
>>> from the sources provided on the logical instruction and emit the
>>> actual LOAD_PAYLOAD and FB_WRITE instructions. This lower_fb_writes
>>> function would then get called before the optimization loop so that
>>> the rest of the optimization could would never see it.
>>>
>>> Second, we add a split_instruction helper that would take a SIMD16
>>> instruction and naively split it into two SIMD8 instructions. Such a
>>> helper really shouldn't be that hard to write. It would have to know
>>> how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that
>>> shouldn't be bad. Any of these new logical send instructions would
>>> have their values as separate sources so they should be safe to split.
>>>
>>> Third, we add a lower_simd16_to_simd8 pass that walks the
>>> instructions, picks out the ones that need splitting, and calls
>>> split_instruction on them. All of the gen-specific SIMD8 vs. SIMD16
>>> knowledge would be contained in this one pass. This pass would happen
>>> between actually emitting code and running lower_fb_writes (or
>>> whatever other send lowering passes we have).
>>>
>>> Finally, and this is the icing on the cake, we write a
>>> lower_simd32_to_simd16 pass that goes through and lowers all SIMD32
>>> instructions (that act on full-sized data-types) to SIMD16
>>> instructions. Once the rest of the work is done, we get this pass,
>>> and with it SIMD32 mode, almost for free.
>>>
>>> I know this approach looks like more work and, to be honest, it may
>>> be. However, I think it makes a lot of things far more
>>> straightforward. In particular, it means that people working on the
>>> visitor code don't have to think about whether or not an instruction
>>> needs splitting. You also don't have to deal with the complexity of
>>> zipping/unzipping sources every time. Instead, we put all that code
>>> in one place and get to stop thinking about it. Also, if we *ever*
>>> want to get SIMD32, we will need some sort of automatic instruction
>>> splitting and this seems like a reasonable way to do it.
>>>
>>> I've talked to Ken about this approach and he's 100% on-board. I
>>> don't remember what Matt thinks of it. If we like the approach, then
>>> we should just split the tasks up and make it happen. It's a bit of
>>> refactoring but it shouldn't be terrible. If we wanted to demo it, I
>>> would probably suggest starting with FB writes as those are fairly
>>> complex but yet self-contained. They also have a case where we do
>>> split an instruction so it would be interesting to see what the code
>>> looks like before and after.
>>>
>>
>> I generally like your proposal. I guess the question we need to answer
>> is whether we want this complexity to be in a lowering pass or in a
>> helper function used to build the send-like instruction -- In either
>> case we need code to handle zipping and unzipping of SIMD16 vectors,
>> it's just about whether this code is called by a lowering pass or higher
>> up in the visitor.
>>
>> I can think of several benefits of the approach you propose over mine:
>>
>> - It's more transparent for the visitor code emitting the message -- I
>> completely agree with you that the explicit loops are rather ugly.
>>
>> - Instructions with explicit separate sources are likely to be more
>> suitable for certain optimization passes. Pull constant loads use a
>> similar approach with an expression-style opcode which is at some
>> point lowered to a load payload and send message. This may not be
>> terribly important at this point because of the optimizations already
>> performed in GLSL IR and NIR and due to the nature of the majority of
>> opcodes that don't support SIMD16, but it still seems appealing.
>
> There's one more really big one that you missed:
>
> It scales! We can't afford to have a for loop for every ADD and MUL
> instruction. Sure, we might be able to afford it on sends, but not
> for everything.
>
Well, I doubt you'd want to implement your proposal to the letter for
non-send instructions: For those we don't need separate lowered and
non-lowered instructions because there's no payload to assemble, so we
can just do with a single opcode with different execution widths. When
you take payloads out of the picture all the disadvantages mentioned of
the lowering pass approach no longer apply, so I totally agree with you
that we want a general lowering pass to lower instructions that expect
their arguments as separate sources in their final form. In any case
send-like instructions need a somewhat different (and less scalable)
treatment.
>> Some disadvantages come to my mind too:
>>
>> - It increases the amount of work required to add a new send-like
>> instruction because you need lowered and unlowered variants for each.
>>
>> - It seems tricky to get right when splitting an instruction in halves
>> involves changing the actual contents of the payload beyond zipping
>> and unzipping its arguments -- This might not seem like a big deal
>> right now, but it will be a problem when we implement SIMD32. The
>> surface messages that take a sample mask as argument are a good
>> example, because they only have 16 bits of space for it so you
>> actually need to provide different values depending on the "slot
>> group" the message is meant for. This can be worked around easily in
>> the visitor by shifting the sample mask register but it seems harder
>> to fix up later.
>
> Why do sample masks need to be part of the logical instruction? Can't
> we figure that out when we lower from logical to physical based on the
> quarter control?
>
You can surely do anything you want during the logical-to-physical
conversion, including rewriting the header, the problem is that it that
forces you to have a pile of message-specific handling code in the
lowering pass. How are you planning to address that? With a separate
lowering pass for each message opcode or a general one with
message-specific knowledge?
>> - My intention was to handle vector layout peculiarities beyond
>> SIMDN-to-SIMD8 splitting under a consistent framework, leaving it
>> under the control of the helper function that builds the message
>> which quirks to apply. With the lowering pass approach you'd either
>> need message-specific lowering passes, message-specific code in a
>> general lowering pass, or some way for the builder code to pass
>> vector_layout-like metadata to the lowering pass -- In the first two
>> cases the policy on how to format each source in the payload would be
>> "delocalized" between the visitor and lowering pass, what I don't
>> particularly like.
>
> We already do that and we would continue to. The logical instruction
> wouldn't be much more than a serialized version of the helper
> function.
>
> Thanks for your reply.
> --Jason
>
>>> Thoughts?
>>>
>>> Question for Curro: Supposing for the moment that we decide to do
>>> SIMD16 -> SIMD8 splitting this way, how much of the array_reg stuff
>>> would still be needed for image_load_store?
>>>
>> I think I've already answered this question, but I'll answer it again
>> here for clarity -- It would still help with building and passing around
>> payloads concisely using emit_collect(), regardless of how we implement
>> SIMD16 instruction splitting.
>>
>>> I hope this e-mail provides us with some good talking points and a way forward.
>>> --Jason
>>
>> Thanks Jason!
>>
>> [1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150516/7617becf/attachment-0001.sig>
More information about the mesa-dev
mailing list