[Mesa-dev] [PATCH 05/16] nir: combine fmul and fadd across ffma operations

Wed Jan 2 23:21:13 UTC 2019

On Wed, Jan 2, 2019 at 5:19 PM Ian Romanick <idr at freedesktop.org> wrote:
>
> On 12/19/18 8:39 AM, Jonathan Marek wrote:
> > This works by moving the fadd up across the ffma operations, so that it
> > can eventually can be combined with a fmul. I'm not sure it works in all
> > cases, but it works in all the common cases.
> >
> > This will only affect freedreno since it is the only driver using the
> > fuse_ffma option.
>
> tl;dr: Optimal generation of FFMAs is much more difficult than you would
> think it should be.  You should collect some actual data before landing
> this.
>
> Any change to ffma generation is likely to have massive, unforeseen
> changes to lots of shaders.  Seemingly simple, obvious changes result in
> changes to live ranges, register pressure, scheduling, constant folding,
> and on, and on.
>
> I took this patch, substituted !options->lower_ffma for
> options->fuse_ffma in the pattern you added, and ran it through
> shader-db for Skylake and Haswell.  As I expected, the results were just
> all over the place (see below).  Notice that register spills are helped
> on one platform but hurt on the other.
>
> There are some simple rules in nir_opt_algebraic for generating and
> reassociating ffmas.  Given the complex interactions with live ranges,
> register pressure, and scheduling, I feel like ffma generation should
> happen much, much later in the process... it should almost certainly be
> deep in the backend where register pressure and scheduling information
> are available.

To be fair, I've tried a few different approaches, but I've yet to
come up with a good way to balance register pressure and instruction
scheduling in ir3 backend[1].. and for a2xx, given that it is limited
to more or less gles2 level shader features (which narrows the pool of
possible shaders to care about), a simple mostly-good-enough option to
improve things in nir, as long as it doesn't hurt other drivers, has
some appeal.

I guess before/after shader-db runs are in order.  But I guess for
a2xx, esp on a20x hw which seem to have pretty weak shader core this
could still be useful.

BR,
-R

[1] tbh, ir3 instruction scheduler needs to be semi-aggressive to fill
delay slots..  but I think it doesn't always make good decisions
without having some clue about register pressure

> The Intel compiler has its own pass for ffma generation, and I've found
> that makes really, really bad choices due to lack of this information.
> For example, consider a sequence like
>
>         (shaderInputA * uniformB) + (texture(...) * shaderInputC)
>
> There are two ways to generate an ffma from that.  One will schedule
> well, and the other will be horrible.  You /probably/ want
>
>         ffma(texture(...), shaderInputC, (shaderInputA * uniformB))
>
> so that the first multiply can happen during the latency of the texture
> lookup.  But maybe not.  Maybe shaderInputA and uniformB are still live
> after the multiply and storing the result of the multiply pushes
> register pressure too high.
>
> Right now our ffma pass is greedy.  If it sees a*b+c, it will always
> generate ffma(a, b, c), regardless of whether or not c is also a
> multiply.  In one of my experiments, I flipped the logic so a*b+c*d
> would always generate ffma(c, d, a*b).  The number of helped and hurt
> shaders was very close to even.  Some shaders were helped by a huge
> amount, and other shaders were hurt by an equally huge amount.  I also
> tried not generating an ffma at all for the a*b+c*d case.  My
> recollection is that a few shaders were helped by a large amount, and
> many thousands of shaders were hurt by small amounts.
>
> If I add it all up, I probably spent several weeks last year poking at
> changes like this in our ffma pass.  It began to feel like the old woman
> who swallowed a fly.  Every change helped some things, but it made other
> things fall off a cliff.  The next fix helped a few of the things
> damaged by the previous change, but it made other things fall of a
> different cliff.  I eventually abandoned the project.  If I ever pick it
> back up, it will be as a pass that occurs closer to scheduling and
> register allocation.
>
> Skylake
> total instructions in shared programs: 15031138 -> 15035206 (0.03%)
> instructions in affected programs: 1230624 -> 1234692 (0.33%)
> helped: 1428
> HURT: 1067
> helped stats (abs) min: 1 max: 671 x̄: 7.08 x̃: 3
> helped stats (rel) min: 0.04% max: 24.72% x̄: 2.30% x̃: 1.78%
> HURT stats (abs)   min: 1 max: 1601 x̄: 13.29 x̃: 4
> HURT stats (rel)   min: 0.05% max: 352.64% x̄: 4.42% x̃: 2.35%
> 95% mean confidence interval for instructions value: 0.03 3.23
> 95% mean confidence interval for instructions %-change: 0.24% 0.91%
> Instructions are HURT.
>
> total cycles in shared programs: 369712682 -> 370166527 (0.12%)
> cycles in affected programs: 128542483 -> 128996328 (0.35%)
> helped: 1679
> HURT: 2639
> helped stats (abs) min: 1 max: 27317 x̄: 162.81 x̃: 18
> helped stats (rel) min: <.01% max: 60.25% x̄: 2.34% x̃: 1.38%
> HURT stats (abs)   min: 1 max: 57100 x̄: 275.56 x̃: 58
> HURT stats (rel)   min: <.01% max: 147.37% x̄: 8.62% x̃: 5.01%
> 95% mean confidence interval for cycles value: 61.86 148.35
> 95% mean confidence interval for cycles %-change: 4.06% 4.66%
> Cycles are HURT.
>
> total spills in shared programs: 10158 -> 9688 (-4.63%)
> spills in affected programs: 1829 -> 1359 (-25.70%)
> helped: 140
> HURT: 3
>
> total fills in shared programs: 22117 -> 21371 (-3.37%)
> fills in affected programs: 2575 -> 1829 (-28.97%)
> helped: 140
> HURT: 3
>
> LOST:   7
> GAINED: 0
>
>
> Haswell
> total instructions in shared programs: 13625863 -> 13635875 (0.07%)
> instructions in affected programs: 1554579 -> 1564591 (0.64%)
> helped: 844
> HURT: 1651
> helped stats (abs) min: 1 max: 96 x̄: 4.16 x̃: 3
> helped stats (rel) min: 0.04% max: 10.26% x̄: 1.91% x̃: 1.90%
> HURT stats (abs)   min: 1 max: 1602 x̄: 8.19 x̃: 5
> HURT stats (rel)   min: 0.10% max: 346.00% x̄: 2.97% x̃: 1.45%
> 95% mean confidence interval for instructions value: 2.70 5.33
> 95% mean confidence interval for instructions %-change: 1.02% 1.63%
> Instructions are HURT.
>
> total cycles in shared programs: 372618507 -> 372381101 (-0.06%)
> cycles in affected programs: 147167634 -> 146930228 (-0.16%)
> helped: 1895
> HURT: 2001
> helped stats (abs) min: 1 max: 8639 x̄: 341.01 x̃: 26
> helped stats (rel) min: <.01% max: 29.69% x̄: 2.78% x̃: 1.66%
> HURT stats (abs)   min: 1 max: 40206 x̄: 204.30 x̃: 53
> HURT stats (rel)   min: 0.01% max: 71.38% x̄: 6.06% x̃: 3.11%
> 95% mean confidence interval for cycles value: -92.06 -29.82
> 95% mean confidence interval for cycles %-change: 1.52% 2.00%
> Inconclusive result (value mean confidence interval and %-change mean
> confidence interval disagree).
>
> total spills in shared programs: 82582 -> 83316 (0.89%)
> spills in affected programs: 71432 -> 72166 (1.03%)
> helped: 16
> HURT: 348
>
> total fills in shared programs: 93463 -> 94192 (0.78%)
> fills in affected programs: 72319 -> 73048 (1.01%)
> helped: 16
> HURT: 379
>
> LOST:   6
> GAINED: 0
>
>
> > Example:
> >     matrix * vec4(coord, 1.0)
> > is compiled as:
> >     fmul, ffma, ffma, fadd
> > and with this patch:
> >     ffma, ffma, ffma
> >
> > Signed-off-by: Jonathan Marek <jonathan at marek.ca>
> > ---
> >  src/compiler/nir/nir_opt_algebraic.py | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/src/compiler/nir/nir_opt_algebraic.py b/src/compiler/nir/nir_opt_algebraic.py
> > index 506d45e55b..97a6c0d8dc 100644
> > --- a/src/compiler/nir/nir_opt_algebraic.py
> > +++ b/src/compiler/nir/nir_opt_algebraic.py
> > @@ -137,6 +137,7 @@ optimizations = [
> >     (('~fadd at 64', a, ('fmul',         c , ('fadd', b, ('fneg', a)))), ('flrp', a, b, c), '!options->lower_flrp64'),
> >     (('ffma', a, b, c), ('fadd', ('fmul', a, b), c), 'options->lower_ffma'),
> >     (('~fadd', ('fmul', a, b), c), ('ffma', a, b, c), 'options->fuse_ffma'),
> > +   (('~fadd', ('ffma', a, b, c), d), ('ffma', a, b, ('fadd', c, d)), 'options->fuse_ffma'),
> >
> >     (('fdot4', ('vec4', a, b,   c,   1.0), d), ('fdph',  ('vec3', a, b, c), d)),
> >     (('fdot4', ('vec4', a, 0.0, 0.0, 0.0), b), ('fmul', a, b)),
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev