[Mesa-dev] [PATCH 3/4] i965/fs: Optimize (gl_FrontFacing ? x : y) where x and y are ±1.0.
Matt Turner
mattst88 at gmail.com
Wed Jan 14 17:04:14 PST 2015
On Wed, Jan 14, 2015 at 1:52 PM, Matt Turner <mattst88 at gmail.com> wrote:
> On Wed, Jan 14, 2015 at 1:29 PM, Matt Turner <mattst88 at gmail.com> wrote:
>> glsl: Optimize certain if-statements to just casts from the condition
>
> Cherry-picked to master, the shader-db results are
>
> total instructions in shared programs: 5965630 -> 5952789 (-0.22%)
> instructions in affected programs: 737228 -> 724387 (-1.74%)
> GAINED: 5
> LOST: 16
>
> and we hurt 20 programs: 12 vec4 programs significantly (>68%) and 8
> SIMD8/16 programs by 1 instruction.
It looks like the vec4 programs have loops that are now able to be
unrolled, so those are actually improvements.
> This, and seemingly every other work-in-progress branch I have really
> highlights the improvements we need to make to instruction scheduling.
> It feels fitting that one of the last significant changes Eric made to
> scheduling has a commit message that says "This is madness, [...]"
>
>> i965/fs: Emit smarter code for b2f
>
> I wouldn't expect this one to change instruction counts, except on gen
> <= 5 where maybe we get to skip the true/false resolve (apparently I
> did it wrong in my merge -- it caused a bunch of failures on G45 and
> ILK according to Jenkins). On Haswell,
>
> total instructions in shared programs: 5954954 -> 5955030 (0.00%)
> instructions in affected programs: 4212 -> 4288 (1.80%)
>
> with three programs helped (that I just added to shader-db on Monday,
> yay!) and 19 hurt, 12 significantly. I'm surprised.
The smallest program helped (188->187 instructions) did this
immediately before an endif:
-and(8) g3<1>D g2<8,8,1>D 0x3f800000UD
-mov(8) g23<1>F g3<8,8,1>F
+mov.sat(8) g23<1>F g2<8,8,1>UD
I assume register coalescing wasn't able to get rid of the extra MOV.
The most hurt shader was affected like so:
-cmp.l.f0(8) g12<1>D g6<8,8,1>F 0F
-and(8) g13<1>D g12<8,8,1>D 0x3f800000UD
+cmp.l.f0(8) g19<1>D g6<8,8,1>F 0F
+mov.sat(8) g7<1>F g19<8,8,1>UD
+mov.sat(8) g8<1>F g19<8,8,1>UD
+mov.sat(8) g9<1>F g19<8,8,1>UD
because we don't CSE MOV instructions. Fixing CSE handle saturated
MOVs is trivial though, and after that change we're left with
(ignoring potential gen <= 5 improvements) three programs helped by
one or two instructions because of the deficiency in register
coalescing. We can compact mov.sat dst:F src:UD though, so that's an
improvement over AND 0x3f800000.
For the gl_FrontFacing case, I think using your more general approach
will actually be better. A bunch of shaders do multiple gl_FrontFacing
ternaries with different 1.0/-1.0/0.0 values and we could do two of
these in 3 instructions by eliminating one of the ASRs that expands
the front-facing bit to a bool.
More information about the mesa-dev
mailing list