[Mesa-dev] [PATCH 3/4] i965/fs: Optimize (gl_FrontFacing ? x : y) where x and y are ±1.0.

Wed Jan 14 17:04:14 PST 2015

On Wed, Jan 14, 2015 at 1:52 PM, Matt Turner <mattst88 at gmail.com> wrote:
> On Wed, Jan 14, 2015 at 1:29 PM, Matt Turner <mattst88 at gmail.com> wrote:
>> glsl: Optimize certain if-statements to just casts from the condition
>
> Cherry-picked to master, the shader-db results are
>
> total instructions in shared programs: 5965630 -> 5952789 (-0.22%)
> instructions in affected programs:     737228 -> 724387 (-1.74%)
> GAINED:                                5
> LOST:                                  16
>
> and we hurt 20 programs: 12 vec4 programs significantly (>68%) and 8
> SIMD8/16 programs by 1 instruction.

It looks like the vec4 programs have loops that are now able to be
unrolled, so those are actually improvements.

> This, and seemingly every other work-in-progress branch I have really
> highlights the improvements we need to make to instruction scheduling.
> It feels fitting that one of the last significant changes Eric made to
> scheduling has a commit message that says "This is madness, [...]"
>
>> i965/fs: Emit smarter code for b2f
>
> I wouldn't expect this one to change instruction counts, except on gen
> <= 5 where maybe we get to skip the true/false resolve (apparently I
> did it wrong in my merge -- it caused a bunch of failures on G45 and
> ILK according to Jenkins). On Haswell,
>
> total instructions in shared programs: 5954954 -> 5955030 (0.00%)
> instructions in affected programs:     4212 -> 4288 (1.80%)
>
> with three programs helped (that I just added to shader-db on Monday,
> yay!) and 19 hurt, 12 significantly. I'm surprised.

The smallest program helped (188->187 instructions) did this
immediately before an endif:

-and(8)          g3<1>D          g2<8,8,1>D      0x3f800000UD
-mov(8)          g23<1>F         g3<8,8,1>F
+mov.sat(8)      g23<1>F         g2<8,8,1>UD

I assume register coalescing wasn't able to get rid of the extra MOV.

The most hurt shader was affected like so:

-cmp.l.f0(8)     g12<1>D         g6<8,8,1>F      0F
-and(8)          g13<1>D         g12<8,8,1>D     0x3f800000UD
+cmp.l.f0(8)     g19<1>D         g6<8,8,1>F      0F
+mov.sat(8)      g7<1>F          g19<8,8,1>UD
+mov.sat(8)      g8<1>F          g19<8,8,1>UD
+mov.sat(8)      g9<1>F          g19<8,8,1>UD

because we don't CSE MOV instructions. Fixing CSE handle saturated
MOVs is trivial though, and after that change we're left with
(ignoring potential gen <= 5 improvements) three programs helped by
one or two instructions because of the deficiency in register
coalescing. We can compact mov.sat dst:F src:UD though, so that's an
improvement over AND 0x3f800000.

For the gl_FrontFacing case, I think using your more general approach
will actually be better. A bunch of shaders do multiple gl_FrontFacing
ternaries with different 1.0/-1.0/0.0 values and we could do two of
these in 3 instructions by eliminating one of the ASRs that expands
the front-facing bit to a bool.