[Mesa-dev] [RFC 0/9] i965/fs: Combine constants and unconditionally emit MADs

Fri Oct 31 19:12:31 PDT 2014

On Fri, Oct 31, 2014 at 9:27 PM, Matt Turner <mattst88 at gmail.com> wrote:
> Three-source instructions on i965 have an annoying property that they
> cannot use immediate operands. They've do have the alluring property
> that they perform multiple operations in basically the same number of
> cycles as any other instruction. But when your arguments are immediates
> we decided that a MOV+MAD is basically going to be the same as a MUL+ADD
> (with immediates).
>
> Two things we didn't consider is that Gen 7 hardware can co-issue some
> instructions (ADD, MUL, MAD included) if they're not using immediates,
> so MOV+MAD probably is better in practice.
>
> Secondly, immediates are used multiple times more often than not. For
> example in 2.0 * vec4 + 1.0, we don't actually need to load each constant
> four times. 2 MOVs + 4 MADs would be better than 4 MULs and 4 ADDs,
> especially when co-issuing is considered.
>
> This series adds some infrastructure to the control flow graph, including
> code to create the dominance tree which I use to figure out where to place
> MOV immediate instructions.
>
> It then adds a pass that runs after optimizations to collect immediates
> and selectively promote some to registers. The immediates are packed 8x
> per register.
>
> The last one lets us emit MAD instructions unconditionally, safe in the
> knowledge that the constant-combining pass will clean things up for us.
>
> The series works and passes piglit. It also cuts more than 3% of instruc-
> tions in affected programs, including huge reductions in select programs.
>
> But there's some work to do before it'll be finished. Since review is so
> hard to come by these days, I'm hoping people will have managed to take
> a look by the time I've solved the remaining problems.
>
> The remaining to do items are:
>    Figure out if MAD instructions still co-issue if operands aren't
>    aligned (e.g., mad dst.0, src0.0, src1.0, src2.3)
>       If they don't, figure out whether packing operands is beneficial
>       at all.
>
>    Probably a bottom-up instruction scheduling pass to help sink MOV-imm
>       (Currently losing a bunch of SIMD16 programs, I expect because of
>        this)

Just wondering... what would a bottom-up scheduling pass do to help if
we already try to shorten live ranges? Another way to help solve the
problem might be to make the constant load insertion pass smarter
about where it inserts the loads in the first place. Finally, if
nothing else fixes the issue we might want to write something that
(efficiently & intelligently) calculates the register pressure as
we're walking the program (maybe we should walk it bottom-up and use
the live-out sets?), and then don't insert the mov if it increases the
register pressure in between the def and use above the magic number --
we'll probably want something like this more once we go to SSA and get
smarter about register allocation anyways.

>
>    Modify instruction scheduler to estimate clock cycles
>       Make shader-db handle this data
>
>    Add a pass to insert destination dependency hints in to the FS, now that
>    we're loading constants into the same register using mov(1).
>
>    Emit 4x constants at once with the :VF type. (:V/:UV can't help us load
>    8x floats at once, unfortunately)
>
>    Probably attempt some other constant loading tricks. I found a shader
>    that loads 0.1, 0.2, ..., 0.8, 0.9. We could load 2.0-9.0 with two VF
>    loads, 0.1 with a mov(1) and then do a mul(8), instead of 9 mov(1).
>
>    Some opt_algebraic on MADs, now that their arguments can be immediates
>    in the IR.
>
>    Probably even some code to break MADs into MUL+ADD when many MADs perform
>    the same multiplication.
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev