[Mesa-dev] [PATCH 05/16] nir: combine fmul and fadd across ffma operations

Wed Jan 2 22:18:58 UTC 2019

On 12/19/18 8:39 AM, Jonathan Marek wrote:
> This works by moving the fadd up across the ffma operations, so that it
> can eventually can be combined with a fmul. I'm not sure it works in all
> cases, but it works in all the common cases.
> 
> This will only affect freedreno since it is the only driver using the
> fuse_ffma option.

tl;dr: Optimal generation of FFMAs is much more difficult than you would
think it should be.  You should collect some actual data before landing
this.

Any change to ffma generation is likely to have massive, unforeseen
changes to lots of shaders.  Seemingly simple, obvious changes result in
changes to live ranges, register pressure, scheduling, constant folding,
and on, and on.

I took this patch, substituted !options->lower_ffma for
options->fuse_ffma in the pattern you added, and ran it through
shader-db for Skylake and Haswell.  As I expected, the results were just
all over the place (see below).  Notice that register spills are helped
on one platform but hurt on the other.

There are some simple rules in nir_opt_algebraic for generating and
reassociating ffmas.  Given the complex interactions with live ranges,
register pressure, and scheduling, I feel like ffma generation should
happen much, much later in the process... it should almost certainly be
deep in the backend where register pressure and scheduling information
are available.

The Intel compiler has its own pass for ffma generation, and I've found
that makes really, really bad choices due to lack of this information.
For example, consider a sequence like

	(shaderInputA * uniformB) + (texture(...) * shaderInputC)

There are two ways to generate an ffma from that.  One will schedule
well, and the other will be horrible.  You /probably/ want

	ffma(texture(...), shaderInputC, (shaderInputA * uniformB))

so that the first multiply can happen during the latency of the texture
lookup.  But maybe not.  Maybe shaderInputA and uniformB are still live
after the multiply and storing the result of the multiply pushes
register pressure too high.

Right now our ffma pass is greedy.  If it sees a*b+c, it will always
generate ffma(a, b, c), regardless of whether or not c is also a
multiply.  In one of my experiments, I flipped the logic so a*b+c*d
would always generate ffma(c, d, a*b).  The number of helped and hurt
shaders was very close to even.  Some shaders were helped by a huge
amount, and other shaders were hurt by an equally huge amount.  I also
tried not generating an ffma at all for the a*b+c*d case.  My
recollection is that a few shaders were helped by a large amount, and
many thousands of shaders were hurt by small amounts.

If I add it all up, I probably spent several weeks last year poking at
changes like this in our ffma pass.  It began to feel like the old woman
who swallowed a fly.  Every change helped some things, but it made other
things fall off a cliff.  The next fix helped a few of the things
damaged by the previous change, but it made other things fall of a
different cliff.  I eventually abandoned the project.  If I ever pick it
back up, it will be as a pass that occurs closer to scheduling and
register allocation.

Skylake
total instructions in shared programs: 15031138 -> 15035206 (0.03%)
instructions in affected programs: 1230624 -> 1234692 (0.33%)
helped: 1428
HURT: 1067
helped stats (abs) min: 1 max: 671 x̄: 7.08 x̃: 3
helped stats (rel) min: 0.04% max: 24.72% x̄: 2.30% x̃: 1.78%
HURT stats (abs)   min: 1 max: 1601 x̄: 13.29 x̃: 4
HURT stats (rel)   min: 0.05% max: 352.64% x̄: 4.42% x̃: 2.35%
95% mean confidence interval for instructions value: 0.03 3.23
95% mean confidence interval for instructions %-change: 0.24% 0.91%
Instructions are HURT.

total cycles in shared programs: 369712682 -> 370166527 (0.12%)
cycles in affected programs: 128542483 -> 128996328 (0.35%)
helped: 1679
HURT: 2639
helped stats (abs) min: 1 max: 27317 x̄: 162.81 x̃: 18
helped stats (rel) min: <.01% max: 60.25% x̄: 2.34% x̃: 1.38%
HURT stats (abs)   min: 1 max: 57100 x̄: 275.56 x̃: 58
HURT stats (rel)   min: <.01% max: 147.37% x̄: 8.62% x̃: 5.01%
95% mean confidence interval for cycles value: 61.86 148.35
95% mean confidence interval for cycles %-change: 4.06% 4.66%
Cycles are HURT.

total spills in shared programs: 10158 -> 9688 (-4.63%)
spills in affected programs: 1829 -> 1359 (-25.70%)
helped: 140
HURT: 3

total fills in shared programs: 22117 -> 21371 (-3.37%)
fills in affected programs: 2575 -> 1829 (-28.97%)
helped: 140
HURT: 3

LOST:   7
GAINED: 0

Haswell
total instructions in shared programs: 13625863 -> 13635875 (0.07%)
instructions in affected programs: 1554579 -> 1564591 (0.64%)
helped: 844
HURT: 1651
helped stats (abs) min: 1 max: 96 x̄: 4.16 x̃: 3
helped stats (rel) min: 0.04% max: 10.26% x̄: 1.91% x̃: 1.90%
HURT stats (abs)   min: 1 max: 1602 x̄: 8.19 x̃: 5
HURT stats (rel)   min: 0.10% max: 346.00% x̄: 2.97% x̃: 1.45%
95% mean confidence interval for instructions value: 2.70 5.33
95% mean confidence interval for instructions %-change: 1.02% 1.63%
Instructions are HURT.

total cycles in shared programs: 372618507 -> 372381101 (-0.06%)
cycles in affected programs: 147167634 -> 146930228 (-0.16%)
helped: 1895
HURT: 2001
helped stats (abs) min: 1 max: 8639 x̄: 341.01 x̃: 26
helped stats (rel) min: <.01% max: 29.69% x̄: 2.78% x̃: 1.66%
HURT stats (abs)   min: 1 max: 40206 x̄: 204.30 x̃: 53
HURT stats (rel)   min: 0.01% max: 71.38% x̄: 6.06% x̃: 3.11%
95% mean confidence interval for cycles value: -92.06 -29.82
95% mean confidence interval for cycles %-change: 1.52% 2.00%
Inconclusive result (value mean confidence interval and %-change mean
confidence interval disagree).

total spills in shared programs: 82582 -> 83316 (0.89%)
spills in affected programs: 71432 -> 72166 (1.03%)
helped: 16
HURT: 348

total fills in shared programs: 93463 -> 94192 (0.78%)
fills in affected programs: 72319 -> 73048 (1.01%)
helped: 16
HURT: 379

LOST:   6
GAINED: 0

> Example:
>     matrix * vec4(coord, 1.0)
> is compiled as:
>     fmul, ffma, ffma, fadd
> and with this patch:
>     ffma, ffma, ffma
> 
> Signed-off-by: Jonathan Marek <jonathan at marek.ca>
> ---
>  src/compiler/nir/nir_opt_algebraic.py | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/src/compiler/nir/nir_opt_algebraic.py b/src/compiler/nir/nir_opt_algebraic.py
> index 506d45e55b..97a6c0d8dc 100644
> --- a/src/compiler/nir/nir_opt_algebraic.py
> +++ b/src/compiler/nir/nir_opt_algebraic.py
> @@ -137,6 +137,7 @@ optimizations = [
>     (('~fadd at 64', a, ('fmul',         c , ('fadd', b, ('fneg', a)))), ('flrp', a, b, c), '!options->lower_flrp64'),
>     (('ffma', a, b, c), ('fadd', ('fmul', a, b), c), 'options->lower_ffma'),
>     (('~fadd', ('fmul', a, b), c), ('ffma', a, b, c), 'options->fuse_ffma'),
> +   (('~fadd', ('ffma', a, b, c), d), ('ffma', a, b, ('fadd', c, d)), 'options->fuse_ffma'),
>  
>     (('fdot4', ('vec4', a, b,   c,   1.0), d), ('fdph',  ('vec3', a, b, c), d)),
>     (('fdot4', ('vec4', a, 0.0, 0.0, 0.0), b), ('fmul', a, b)),