[Mesa-dev] [PATCH 4/8] intel/compiler: More peephole select

Thu Jun 7 05:16:55 UTC 2018

On 06/06/2018 05:39 PM, Ian Romanick wrote:
> On 06/06/2018 04:26 PM, Matt Turner wrote:
>> On Wed, Jun 6, 2018 at 2:33 PM, Ian Romanick <idr at freedesktop.org> wrote:
>>> From: Ian Romanick <ian.d.romanick at intel.com>
>>>
>>> Shader-db results:
>>>
>>> Skylake and Broadwell had similar results. (Skylake shown)
>>> total instructions in shared programs: 14371513 -> 14346174 (-0.18%)
>>> instructions in affected programs: 890389 -> 865050 (-2.85%)
>>> helped: 3601
>>> HURT: 1
>>> helped stats (abs) min: 1 max: 92 x̄: 7.05 x̃: 4
>>> helped stats (rel) min: 0.10% max: 25.00% x̄: 3.95% x̃: 3.23%
>>> HURT stats (abs)   min: 43 max: 43 x̄: 43.00 x̃: 43
>>> HURT stats (rel)   min: 0.90% max: 0.90% x̄: 0.90% x̃: 0.90%
>>> 95% mean confidence interval for instructions value: -7.27 -6.80
>>> 95% mean confidence interval for instructions %-change: -4.05% -3.84%
>>> Instructions are helped.
>>>
>>> total cycles in shared programs: 532435951 -> 532154282 (-0.05%)
>>> cycles in affected programs: 69203137 -> 68921468 (-0.41%)
>>> helped: 2654
>>> HURT: 981
>>> helped stats (abs) min: 1 max: 4496 x̄: 177.17 x̃: 76
>>> helped stats (rel) min: <.01% max: 71.34% x̄: 9.16% x̃: 5.42%
>>> HURT stats (abs)   min: 1 max: 33338 x̄: 192.20 x̃: 19
>>> HURT stats (rel)   min: <.01% max: 36.36% x̄: 2.95% x̃: 1.46%
>>> 95% mean confidence interval for cycles value: -113.38 -41.60
>>> 95% mean confidence interval for cycles %-change: -6.24% -5.53%
>>> Cycles are helped.
>>>
>>> total spills in shared programs: 8114 -> 8122 (0.10%)
>>> spills in affected programs: 152 -> 160 (5.26%)
>>> helped: 0
>>> HURT: 2
>>>
>>> total fills in shared programs: 11082 -> 11100 (0.16%)
>>> fills in affected programs: 375 -> 393 (4.80%)
>>> helped: 1
>>> HURT: 1
>>>
>>> Haswell, Ivy Bridge, and Sandy Bridge had similar results. (Ivy Bridge shown)
>>> total instructions in shared programs: 9897654 -> 9890341 (-0.07%)
>>> instructions in affected programs: 213092 -> 205779 (-3.43%)
>>> helped: 775
>>> HURT: 18
>>> helped stats (abs) min: 1 max: 65 x̄: 9.62 x̃: 6
>>> helped stats (rel) min: 0.11% max: 25.00% x̄: 4.85% x̃: 3.70%
>>> HURT stats (abs)   min: 2 max: 20 x̄: 7.89 x̃: 6
>>> HURT stats (rel)   min: 0.70% max: 2.59% x̄: 1.63% x̃: 1.70%
>>> 95% mean confidence interval for instructions value: -9.93 -8.51
>>> 95% mean confidence interval for instructions %-change: -5.01% -4.40%
>>> Instructions are helped.
>>>
>>> total cycles in shared programs: 87653348 -> 87562421 (-0.10%)
>>> cycles in affected programs: 2411339 -> 2320412 (-3.77%)
>>> helped: 612
>>> HURT: 227
>>> helped stats (abs) min: 1 max: 2103 x̄: 162.83 x̃: 53
>>> helped stats (rel) min: 0.05% max: 58.41% x̄: 6.50% x̃: 2.65%
>>> HURT stats (abs)   min: 1 max: 772 x̄: 38.43 x̃: 10
>>> HURT stats (rel)   min: 0.04% max: 36.36% x̄: 3.60% x̃: 0.92%
>>> 95% mean confidence interval for cycles value: -128.53 -88.22
>>> 95% mean confidence interval for cycles %-change: -4.39% -3.14%
>>> Cycles are helped.
>>>
>>> No change on Iron Lake or GM45.
>>>
>>> Signed-off-by: Ian Romanick <ian.d.romanick at intel.com>
>>> ---
>>>  src/intel/compiler/brw_nir.c | 14 ++++++++++++++
>>>  1 file changed, 14 insertions(+)
>>>
>>> diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
>>> index 67c062d91f5..ca9b021767f 100644
>>> --- a/src/intel/compiler/brw_nir.c
>>> +++ b/src/intel/compiler/brw_nir.c
>>> @@ -557,7 +557,21 @@ brw_nir_optimize(nir_shader *nir, const struct brw_compiler *compiler,
>>>        OPT(nir_copy_prop);
>>>        OPT(nir_opt_dce);
>>>        OPT(nir_opt_cse);
>>> +
>>> +      /* Passing 0 to the peephole select pass causes it to convert
>>> +       * if-statements that contain only move instructions in the branches
>>> +       * regardless of the count.
>>> +       *
>>> +       * Passing 0 to the peephole select pass causes it to convert
> 
>              Passing 1
> 
> I thought I already fixed that. :(
> 
>>> +       * if-statements that contain at most a single ALU instruction (total)
>>> +       * in both branches.  The select instruction works somewhat differently
>>> +       * on Gen5 and earlier, and adding this pass on those platforms was
>>
>> It does? Something about min/max requiring the CMP?
> 
> I remember thinking the problem was obvious when I looked at the first
> shader that was hurt, but I don't recall exactly what it was now.  I'll
> try it again.

Here are the ILK results:

total instructions in shared programs: 7774514 -> 7773708 (-0.01%)
instructions in affected programs: 89355 -> 88549 (-0.90%)
helped: 162
HURT: 26
helped stats (abs) min: 2 max: 18 x̄: 6.46 x̃: 6
helped stats (rel) min: 0.17% max: 13.04% x̄: 2.29% x̃: 1.09%
HURT stats (abs)   min: 2 max: 20 x̄: 9.23 x̃: 8
HURT stats (rel)   min: 0.70% max: 2.48% x̄: 1.66% x̃: 1.61%
95% mean confidence interval for instructions value: -5.25 -3.32
95% mean confidence interval for instructions %-change: -2.14% -1.35%
Instructions are helped.

total cycles in shared programs: 177899700 -> 177958996 (0.03%)
cycles in affected programs: 753424 -> 812720 (7.87%)
helped: 88
HURT: 100
helped stats (abs) min: 2 max: 76 x̄: 22.84 x̃: 16
helped stats (rel) min: 0.05% max: 6.16% x̄: 0.91% x̃: 0.63%
HURT stats (abs)   min: 4 max: 2946 x̄: 613.06 x̃: 512
HURT stats (rel)   min: 0.33% max: 48.26% x̄: 12.16% x̃: 8.37%
95% mean confidence interval for cycles value: 236.67 394.14
95% mean confidence interval for cycles %-change: 4.41% 7.68%
Cycles are HURT.

Looking at the hurt shaders (both instructions and cycles), I'm now not
sure why I wrote this comment. :(  There are appear to be two separate,
unrelated problems that cause shaders to be hurt:

 - Instructions are generally hurt when more Boolean resolves have to be
inserted.  It's actually a little hard to tell for sure because the
smallest hurt shader is 277 instructions.

 - Cycles are hurt when math box instructions are moved out of
if-statements.  shaders/unity/24-Tree.shader_test VS (cycles hurt by
44%) is an example of this.  A pow that was conditional became
unconditional.

I think both are solvable depending on how much work I feel like doing.
On <= Gen5 we emit a lot of stuff like

cmp.ge(f0)	g32, g3, g8
and		g32, g32, 1D
...
cmp.nz(f0)	NULL, -g32, 0D

Where the second compare is the only use of g32.  We should just emit

cmp.ge(f0)	g32, g3, g8
...
and.nz(f0)	null, g32, 1D

Looking at the existing code, I don't think this will be too hard.  I am
a little confused that we emit the resolves at code generation instead
of treating it like a lowering pass in NIR.

> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev