[Mesa-dev] [PATCH] i965: Use NIR by default for vertex shaders on GEN8+

Sat May 16 13:07:03 PDT 2015

On Sat, May 16, 2015 at 1:01 PM, Jason Ekstrand <jason at jlekstrand.net> wrote:
> On Sat, May 16, 2015 at 12:59 PM, Matt Turner <mattst88 at gmail.com> wrote:
>> On Sat, May 16, 2015 at 12:45 PM, Jason Ekstrand <jason at jlekstrand.net> wrote:
>>> On Sat, May 16, 2015 at 12:12 PM, Matt Turner <mattst88 at gmail.com> wrote:
>>>> On Fri, May 8, 2015 at 3:27 AM, Kenneth Graunke <kenneth at whitecape.org> wrote:
>>>>> Looking at a couple of the shaders that are still worse off...it looks
>>>>> like a ton of Source shaders used to do MUL/ADD with an attribute and
>>>>> two immediates, and now are doing MOV/MOV/MAD.
>>>>
>>>> I just looked, and thought that too for a minute, but it actually
>>>> shouldn't be doing that. Take for instance:
>>>>
>>>> shaders/closed/steam/dota-2/498.shader_test VS SIMD8: 47 -> 53 (12.77%)
>>>>
>>>> It indeed replaces 6x MUL/ADD pairs with MOV/MAD (introducing 6 extra
>>>> MOVs), but....
>>>>
>>>> Without NIR we have
>>>>
>>>> mul(8)          g15<1>F         g6<8,8,1>F      6F
>>>> ...
>>>> add(8)          g16<1>F         g15<8,8,1>F     2.1F
>>>> add(8)          g35<1>F         g15<8,8,1>F     3.1F
>>>> add(8)          g42<1>F         g15<8,8,1>F     4.1F
>>>> add(8)          g45<1>F         g15<8,8,1>F     5.1F
>>>> add(8)          g48<1>F         g15<8,8,1>F     0.1F
>>>> add(8)          g51<1>F         g15<8,8,1>F     1.1F
>>>>
>>>> That is, one multiply is consumed by 6 adds.
>>>>
>>>> With NIR we have
>>>>
>>>> mov(1)          g22<1>F         2.1F
>>>> mov(1)          g22.1<1>F       6F
>>>> mad(8)          g16<1>F         g22<0,1,0>.xF   g22.1<0,1,0>.xF g6<4,4,1>F
>>>> mov(1)          g22.2<1>F       3.1F
>>>> mad(8)          g23<1>F         g22.2<0,1,0>.xF g22.1<0,1,0>.xF g6<4,4,1>F
>>>> mov(1)          g22.3<1>F       4.1F
>>>> mad(8)          g30<1>F         g22.3<0,1,0>.xF g22.1<0,1,0>.xF g6<4,4,1>F
>>>> mov(1)          g22.4<1>F       5.1F
>>>> mad(8)          g33<1>F         g22.4<0,1,0>.xF g22.1<0,1,0>.xF g6<4,4,1>F
>>>> mov(1)          g22.5<1>F       0.1F
>>>> mad(8)          g36<1>F         g22.5<0,1,0>.xF g22.1<0,1,0>.xF g6<4,4,1>F
>>>> mov(1)          g22.6<1>F       1.1F
>>>> mad(8)          g39<1>F         g22.6<0,1,0>.xF g22.1<0,1,0>.xF g6<4,4,1>F
>>>>
>>>> So we're doing the g6 * 6F operation 6 times! We see this in the NIR as well:
>>>>
>>>>         vec1 ssa_419 = ffma ssa_384, ssa_132, ssa_133
>>>>         vec1 ssa_423 = ffma ssa_384, ssa_132, ssa_135
>>>>         vec1 ssa_427 = ffma ssa_384, ssa_132, ssa_137
>>>>         vec1 ssa_428 = ffma ssa_384, ssa_132, ssa_139
>>>>         vec1 ssa_429 = ffma ssa_384, ssa_132, ssa_141
>>>>         vec1 ssa_430 = ffma ssa_384, ssa_132, ssa_144
>>>>
>>>> Whoops. Ideas for fixing that? I'm guessing that this accounts for
>>>> nearly all of the remaining 1120 hurt programs.
>>>
>>> Ugh... We've been tacitly assuming that your constant combine stuff
>>> will magically make immediates not a problem.  In this case, they are
>>> a problem.  I guess we could do something different for 1 vs. 2
>>> immediates.
>>
>> That's not really the problem as far as I see. I mean, we could split
>> MADs that do x * imm + imm, but I would think NIR shouldn't be
>> combining these operations if the multiply is used in a bunch of
>> places.
>>
>> The current code in the ffma peephole in does... to quote the comment:
>>
>>       /* Only absorb a fmul into a ffma if the fmul is is only used in fadd
>>        * operations.  This prevents us from being too aggressive with our
>>        * fusing which can actually lead to more instructions.
>>        */
>>
>> Can't we pretty trivially modify that to count the number of uses as
>> well and only combine if it's used in one place?
>>
>> To be honest, before I looked in the code I thought that's what it was doing.
>
> If you want to know why I did it that way, just run shader-db. :-)

Ok, longer less snarky version:

I found a variety of places where the user was doing, for instance, 2
muls and 4 adds where the result of each mul is used twice.  The
result is 6 instructions instead of just the 4 mad's.  It's entirely
possible that, thanks to latancies, the 6 would actually be better,
but that's why I did it that way.
--Jason