[Mesa-dev] [RFC 0/9] i965/fs: Combine constants and unconditionally emit MADs
Matt Turner
mattst88 at gmail.com
Fri Oct 31 18:27:44 PDT 2014
Three-source instructions on i965 have an annoying property that they
cannot use immediate operands. They've do have the alluring property
that they perform multiple operations in basically the same number of
cycles as any other instruction. But when your arguments are immediates
we decided that a MOV+MAD is basically going to be the same as a MUL+ADD
(with immediates).
Two things we didn't consider is that Gen 7 hardware can co-issue some
instructions (ADD, MUL, MAD included) if they're not using immediates,
so MOV+MAD probably is better in practice.
Secondly, immediates are used multiple times more often than not. For
example in 2.0 * vec4 + 1.0, we don't actually need to load each constant
four times. 2 MOVs + 4 MADs would be better than 4 MULs and 4 ADDs,
especially when co-issuing is considered.
This series adds some infrastructure to the control flow graph, including
code to create the dominance tree which I use to figure out where to place
MOV immediate instructions.
It then adds a pass that runs after optimizations to collect immediates
and selectively promote some to registers. The immediates are packed 8x
per register.
The last one lets us emit MAD instructions unconditionally, safe in the
knowledge that the constant-combining pass will clean things up for us.
The series works and passes piglit. It also cuts more than 3% of instruc-
tions in affected programs, including huge reductions in select programs.
But there's some work to do before it'll be finished. Since review is so
hard to come by these days, I'm hoping people will have managed to take
a look by the time I've solved the remaining problems.
The remaining to do items are:
Figure out if MAD instructions still co-issue if operands aren't
aligned (e.g., mad dst.0, src0.0, src1.0, src2.3)
If they don't, figure out whether packing operands is beneficial
at all.
Probably a bottom-up instruction scheduling pass to help sink MOV-imm
(Currently losing a bunch of SIMD16 programs, I expect because of
this)
Modify instruction scheduler to estimate clock cycles
Make shader-db handle this data
Add a pass to insert destination dependency hints in to the FS, now that
we're loading constants into the same register using mov(1).
Emit 4x constants at once with the :VF type. (:V/:UV can't help us load
8x floats at once, unfortunately)
Probably attempt some other constant loading tricks. I found a shader
that loads 0.1, 0.2, ..., 0.8, 0.9. We could load 2.0-9.0 with two VF
loads, 0.1 with a mov(1) and then do a mul(8), instead of 9 mov(1).
Some opt_algebraic on MADs, now that their arguments can be immediates
in the IR.
Probably even some code to break MADs into MUL+ADD when many MADs perform
the same multiplication.
More information about the mesa-dev
mailing list