[Mesa-dev] i965 implementation of the ARB_shader_image_load_store built-ins. (v3)

Mon May 18 19:08:00 PDT 2015

On Mon, May 18, 2015 at 10:34 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Francisco Jerez <currojerez at riseup.net> writes:

[...] Snip

>>>> >>> ### SIMD16 Instruction Splitting ###
>>>> >>>
>>>> >>> SIMD16 instruction splitting is an unfortunate fact of our hardware.
>>>> >>> There are a variety of times we have to do it including dual-source FB
>>>> >>> writes, some texturing, math ops on older gens and maybe another place
>>>> >>> or two.  Classically, this has been done in one of two places: The
>>>> >>> visitor as we emit the code, or the generator.  The problem with doing
>>>> >>> it in the generator is that we can't schedule it and, if it involves a
>>>> >>> payload, it's not really possible.  The result is that we usually do
>>>> >>> it in the visitor these days.
>>>> >>>
>>>> >>> Unfortunately, even in the visitor, it's gen-specific and annoying.
>>>> >>> It gets even worse when you're working with something such as the
>>>> >>> untyped surface read/write messages that work with multi-component
>>>> >>> values that have to be zipped/unzipped to use Curro's terminology.
>>>> >>> Curro came up with some helper functions to make it substantially less
>>>> >>> annoying but it still involves nasty looping.
>>>> >>>
>>>> >>> At some point in the past I proposed a completely different and more
>>>> >>> solution to this problem.  Unfortunately, while I've talked to Matt &
>>>> >>> Ken about it, it's never really been discussed all that publicly so
>>>> >>> Curro may not be aware of it.  I'm going to lay it out here for
>>>> >>> Curro's sake as well as the sake of public record.
>>>> >>>
>>>> >>> The solution involves first changing the way we handle sends into a
>>>> >>> two step process.  First, we emit a logical instruction that contains
>>>> >>> all of the data needed for the actual instruction.  Then, we convert
>>>> >>> from the logical to the actual in a lowering pass.  Take, for example,
>>>> >>> FB writes with which I am fairly familiar.  We would first emit a
>>>> >>> logical FS_FB_WRITE_# instruction that has separate sources for color,
>>>> >>> depth, replicated alpha, etc.   Then, in the lower_fb_writes pass
>>>> >>> (which we would have to implement), we would construct the payload
>>>> >>> from the sources provided on the logical instruction and emit the
>>>> >>> actual LOAD_PAYLOAD and FB_WRITE instructions.  This lower_fb_writes
>>>> >>> function would then get called before the optimization loop so that
>>>> >>> the rest of the optimization could would never see it.
>>>> >>>
>>>> >>> Second, we add a split_instruction helper that would take a SIMD16
>>>> >>> instruction and naively split it into two SIMD8 instructions.  Such a
>>>> >>> helper really shouldn't be that hard to write.  It would have to know
>>>> >>> how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that
>>>> >>> shouldn't be bad.  Any of these new logical send instructions would
>>>> >>> have their values as separate sources so they should be safe to split.
>>>> >>>
>>>> >>> Third, we add a lower_simd16_to_simd8 pass that walks the
>>>> >>> instructions, picks out the ones that need splitting, and calls
>>>> >>> split_instruction on them.  All of the gen-specific SIMD8 vs. SIMD16
>>>> >>> knowledge would be contained in this one pass.  This pass would happen
>>>> >>> between actually emitting code and running lower_fb_writes (or
>>>> >>> whatever other send lowering passes we have).
>>>> >>>
>>>> >>> Finally, and this is the icing on the cake, we write a
>>>> >>> lower_simd32_to_simd16 pass that goes through and lowers all SIMD32
>>>> >>> instructions (that act on full-sized data-types) to SIMD16
>>>> >>> instructions.  Once the rest of the work is done, we get this pass,
>>>> >>> and with it SIMD32 mode, almost for free.
>>>> >>>
>>>> >>> I know this approach looks like more work and, to be honest, it may
>>>> >>> be.  However, I think it makes a lot of things far more
>>>> >>> straightforward.  In particular, it means that people working on the
>>>> >>> visitor code don't have to think about whether or not an instruction
>>>> >>> needs splitting.  You also don't have to deal with the complexity of
>>>> >>> zipping/unzipping sources every time.  Instead, we put all that code
>>>> >>> in one place and get to stop thinking about it.  Also, if we *ever*
>>>> >>> want to get SIMD32, we will need some sort of automatic instruction
>>>> >>> splitting and this seems like a reasonable way to do it.
>>>> >>>
>>>> >>> I've talked to Ken about this approach and he's 100% on-board.  I
>>>> >>> don't remember what Matt thinks of it.  If we like the approach, then
>>>> >>> we should just split the tasks up and make it happen.  It's a bit of
>>>> >>> refactoring but it shouldn't be terrible.  If we wanted to demo it, I
>>>> >>> would probably suggest starting with FB writes as those are fairly
>>>> >>> complex but yet self-contained.  They also have a case where we do
>>>> >>> split an instruction so it would be interesting to see what the code
>>>> >>> looks like before and after.
>>>> >>>
>>>> >>
>>>> >> I generally like your proposal.  I guess the question we need to answer
>>>> >> is whether we want this complexity to be in a lowering pass or in a
>>>> >> helper function used to build the send-like instruction -- In either
>>>> >> case we need code to handle zipping and unzipping of SIMD16 vectors,
>>>> >> it's just about whether this code is called by a lowering pass or
>>> higher
>>>> >> up in the visitor.
>>>> >>
>>>> >> I can think of several benefits of the approach you propose over mine:
>>>> >>
>>>> >>  - It's more transparent for the visitor code emitting the message -- I
>>>> >>    completely agree with you that the explicit loops are rather ugly.
>>>> >>
>>>> >>  - Instructions with explicit separate sources are likely to be more
>>>> >>    suitable for certain optimization passes.  Pull constant loads use a
>>>> >>    similar approach with an expression-style opcode which is at some
>>>> >>    point lowered to a load payload and send message.  This may not be
>>>> >>    terribly important at this point because of the optimizations
>>> already
>>>> >>    performed in GLSL IR and NIR and due to the nature of the majority
>>> of
>>>> >>    opcodes that don't support SIMD16, but it still seems appealing.
>>>> >
>>>> > There's one more really big one that you missed:
>>>> >
>>>> > It scales!  We can't afford to have a for loop for every ADD and MUL
>>>> > instruction.  Sure, we might be able to afford it on sends, but not
>>>> > for everything.
>>>> >
>>>> Well, I doubt you'd want to implement your proposal to the letter for
>>>> non-send instructions: For those we don't need separate lowered and
>>>> non-lowered instructions because there's no payload to assemble, so we
>>>> can just do with a single opcode with different execution widths.  When
>>>> you take payloads out of the picture all the disadvantages mentioned of
>>>> the lowering pass approach no longer apply, so I totally agree with you
>>>> that we want a general lowering pass to lower instructions that expect
>>>> their arguments as separate sources in their final form.  In any case
>>>> send-like instructions need a somewhat different (and less scalable)
>>>> treatment.
>>>>
>>>> >> Some disadvantages come to my mind too:
>>>> >>
>>>> >>  - It increases the amount of work required to add a new send-like
>>>> >>    instruction because you need lowered and unlowered variants for
>>> each.
>>>> >>
>>>> >>  - It seems tricky to get right when splitting an instruction in halves
>>>> >>    involves changing the actual contents of the payload beyond zipping
>>>> >>    and unzipping its arguments -- This might not seem like a big deal
>>>> >>    right now, but it will be a problem when we implement SIMD32.  The
>>>> >>    surface messages that take a sample mask as argument are a good
>>>> >>    example, because they only have 16 bits of space for it so you
>>>> >>    actually need to provide different values depending on the "slot
>>>> >>    group" the message is meant for.  This can be worked around easily
>>> in
>>>> >>    the visitor by shifting the sample mask register but it seems harder
>>>> >>    to fix up later.
>>>> >
>>>> > Why do sample masks need to be part of the logical instruction?  Can't
>>>> > we figure that out when we lower from logical to physical based on the
>>>> > quarter control?
>>>> >
>>>> You can surely do anything you want during the logical-to-physical
>>>> conversion, including rewriting the header, the problem is that it that
>>>> forces you to have a pile of message-specific handling code in the
>>>> lowering pass.  How are you planning to address that?  With a separate
>>>> lowering pass for each message opcode or a general one with
>>>> message-specific knowledge?
>>>
>>> Yes, the lowering pass will have *all* of the message-specific
>>> information.  Probably broken out into helper functions exactly the way the
>>> message emit code is broken out now.  The lowering pass then just knows
>>> what helper to call for what instruction.
>>>
>> I think I only buy your proposal if it saves us more work than it
>> creates in the long term, e.g. by using general splitting and payload
>> assembly algorithms shared among all opcodes with minimal
>> message-specific information.  Otherwise what you are describing sounds
>> like a "bureaucratic" variant of my proposal, with lowered and unlowered
>> versions of each opcode and with the payload assembly code (functionally
>> almost the same as mine) hidden behind a lowering pass under a
>> switch-case statement instead of being called up front.
>>
>
> I've given this idea a shot.  Can you have a look at the
> image-load-store-lower branch of my tree [1]?  It's just a quick and
> dirty proof of concept, so don't bother to review it carefully, just let
> me know if you agree with the general design before I spend more time on
> it.
>
> [1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store-lower

I took a look at it.  I think patch 3 "Add pass to lower opcodes with
unsupported SIMD width." is more-or-less exactly what I'm talking
about.  What I don't understand is the stuff about split payloads.
While I think we *might* be able to split a payload it seems dangerous
and like something we shouldn't be doing.  This is where the "logical"
opcodes I mentioned come into play.  I think there has been some
miscommunication there; perhaps I didn't explain myself very well.
Allow me to be more explicit; I'll use image loads for my example.

 1) We would add an opcode SHADER_IMAGE_LOAD_LOGICAL (or some other
name) that takes 4 arguments: image, address, format, and dims just
like the emit_image_load helper.
 2) Instead of calling the helper, the visitor would just emit
SHADER_IMAGE_LOAD_LOGICAL instruction with those arguments.
 3) We then run the splitting pass which can easily split the new load
instruction since no payloads are involved.
 4) We then have a lowering pass which knows how to turn
SHADER_IMAGE_LOAD_LOGICAL into an actual load including the payload,
pixel mask, and whatever other fiddly bits there are.

Steps (1) and (2) may not be quite right (you'll have to help me out
here).  We may want to keep emit_image_load so that it can do format
conversion and emit an untyped logical instruction.  However, in any
case, the logical instruction does not have any payload sources if we
can at all help it.

Does that make more sense?  Is there something I'm missing?
--Jason