[Mesa-dev] [PATCH 2/5] i965/fs: Emit better b2f of an expression on GEN4 and GEN5

Mon Mar 16 12:23:11 PDT 2015

On 03/16/2015 10:06 AM, Matt Turner wrote:
> On Wed, Mar 11, 2015 at 1:44 PM, Ian Romanick <idr at freedesktop.org> wrote:
>> From: Ian Romanick <ian.d.romanick at intel.com>
>>
>> On platforms that do not natively generate 0u and ~0u for Boolean
>> results, b2f expressions that look like
>>
>>    f = b2f(expr cmp 0)
>>
>> will generate better code by pretending the expression is
>>
>>     f = ir_triop_sel(0.0, 1.0, expr cmp 0)
>>
>> This is because the last instruction of "expr" can generate the
>> condition code for the "cmp 0".  This avoids having to do the "-(b & 1)"
>> trick to generate 0u or ~0u for the Boolean result.  This means code like
>>
>>     mov(16)         g16<1>F         1F
>>     mul.ge.f0(16)   null            g6<8,8,1>F      g14<8,8,1>F
>>     (+f0) sel(16)   m6<1>F          g16<8,8,1>F     0F
>>
>> will be generated instead of
>>
>>     mul(16)         g2<1>F          g12<8,8,1>F     g4<8,8,1>F
>>     cmp.ge.f0(16)   g2<1>D          g4<8,8,1>F      0F
> 
> Presumably this g4 should be g2?

Probably.  I was cutting out of a diff of shader-db results, and I must
have botched it.  Here's the diff from shaders/anholt/6.shader_test:

@@ -129,7 +129,7 @@
 )
 
 Native code for unnamed fragment shader 3
-SIMD8 shader: 77 instructions. 0 loops. Compacted 1232 to 832 bytes (32%)
+SIMD8 shader: 76 instructions. 0 loops. Compacted 1216 to 816 bytes (33%)
    START B0
 add(8)          g9<1>UW         g1.4<2,4,0>UW   0x10101010V     { align1 };
 mov(8)          m3<1>F          16F                             { align1 };
@@ -163,7 +163,7 @@
 add(8)          g2<1>F          g2<8,8,1>F      g6<8,8,1>F      { align1 compacted };
 send(8) 2       g6<1>F          g2<8,8,1>F
                             math rsq mlen 1 rlen 1                          { align1 };
-mul(8)          g2<1>F          g5<8,8,1>F      g6<8,8,1>F      { align1 compacted };
+mul.ge.f0(8)    g2<1>F          g5<8,8,1>F      g6<8,8,1>F      { align1 compacted };
 mul(8)          g5<1>F          -g12<8,8,1>F    -g12<8,8,1>F    { align1 compacted };
 mul(8)          g7<1>F          -g16<8,8,1>F    -g16<8,8,1>F    { align1 compacted };
 mul(8)          g8<1>F          -g15<8,8,1>F    -g15<8,8,1>F    { align1 compacted };
@@ -194,14 +194,13 @@
 send(8) 2       g3<1>F          g3<8,8,1>F
                             math pow mlen 2 rlen 1                          { align1 };
 mul(8)          m6<1>F          g11<8,8,1>F     g8<8,8,1>F      { align1 };
-cmp.ge.f0(8)    g4<1>F          g2<8,8,1>F      0F              { align1 };
+mov(8)          g4<1>F          1F                              { align1 };
 mov(8)          m2<1>F          g6<8,8,1>F                      { align1 };
 mov(8)          m3<1>F          g7<8,8,1>F                      { align1 };
 add(8)          g9<1>F          g2<8,8,1>F      g3<8,8,1>F      { align1 compacted };
-and(8)          g8<1>D          g4<8,8,1>D      1D              { align1 };
+(+f0) sel(8)    g8<1>F          g4<8,8,1>F      0F              { align1 };
 send(8) 2       g4<1>UW         null
                             sampler (1, 0, 3, 1) mlen 5 rlen 4              { align1 };
-and(8)          g8<1>D          -g8<8,8,1>D     0x3f800000UD    { align1 };
 mul(8)          g9<1>F          g9<8,8,1>F      g4<8,8,1>F      { align1 compacted };
 mul(8)          m3<1>F          g8<8,8,1>F      g9<8,8,1>F      { align1 };
 mul(8)          g9<1>F          g2<8,8,1>F      0.7F            { align1 };

I think I can adjust the commit message to:

"...This means code like

    mul.ge.f0(8)    g2<1>F          g5<8,8,1>F      g6<8,8,1>F
    mov(8)          g4<1>F          1F
    (+f0) sel(8)    g8<1>F          g4<8,8,1>F      0F

will be generated instead of

    mul(8)          g2<1>F          g5<8,8,1>F      g6<8,8,1>F
    cmp.ge.f0(8)    g4<1>F          g2<8,8,1>F      0F
    and(8)          g8<1>D          g4<8,8,1>D      1D
    and(8)          g8<1>D          -g8<8,8,1>D     0x3f800000UD"

I'll update the comment in the code too.

>>     and(16)         g4<1>D          g2<8,8,1>D      1D
>>     and(16)         m6<1>D          -g4<8,8,1>D     0x3f800000UD
>>
>> v2: When the comparison is either == 0.0 or != 0.0 use the knowledge
>> that the true (or false) case already results in zero would allow better
>> code generation by possibly avoiding a load-immediate instruction.
>>
>> v3: Apply the optimization even when neither comparitor is zero.
>>
>> Shader-db results:
>>
>> GM45 (0x2A42):
>> total instructions in shared programs: 3551002 -> 3550829 (-0.00%)
>> instructions in affected programs:     33269 -> 33096 (-0.52%)
>> helped:                                121
>>
>> Iron Lake (0x0046):
>> total instructions in shared programs: 4993327 -> 4993146 (-0.00%)
>> instructions in affected programs:     34199 -> 34018 (-0.53%)
>> helped:                                129
>>
>> No change on other platforms.
>>
>> Signed-off-by: Ian Romanick <ian.d.romanick at intel.com>
>> Cc: Tapani Palli <tapani.palli at intel.com>
>> ---
>>  src/mesa/drivers/dri/i965/brw_fs.h           |   2 +
>>  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp | 101 +++++++++++++++++++++++++--
>>  2 files changed, 99 insertions(+), 4 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
>> index d9d5858..075e90c 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.h
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
>> @@ -307,6 +307,7 @@ public:
>>                   const fs_reg &a);
>>     void emit_minmax(enum brw_conditional_mod conditionalmod, const fs_reg &dst,
>>                      const fs_reg &src0, const fs_reg &src1);
>> +   bool try_emit_b2f_of_comparison(ir_expression *ir);
>>     bool try_emit_saturate(ir_expression *ir);
>>     bool try_emit_line(ir_expression *ir);
>>     bool try_emit_mad(ir_expression *ir);
>> @@ -317,6 +318,7 @@ public:
>>     bool opt_saturate_propagation();
>>     bool opt_cmod_propagation();
>>     void emit_bool_to_cond_code(ir_rvalue *condition);
>> +   void emit_bool_to_cond_code_of_reg(ir_expression *expr, fs_reg op[3]);
>>     void emit_if_gen6(ir_if *ir);
>>     void emit_unspill(bblock_t *block, fs_inst *inst, fs_reg reg,
>>                       uint32_t spill_offset, int count);
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> index 3025a9d..3d79796 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> @@ -475,6 +475,87 @@ fs_visitor::try_emit_mad(ir_expression *ir)
>>     return true;
>>  }
>>
>> +bool
>> +fs_visitor::try_emit_b2f_of_comparison(ir_expression *ir)
>> +{
>> +   /* On platforms that do not natively generate 0u and ~0u for Boolean
>> +    * results, b2f expressions that look like
>> +    *
>> +    *     f = b2f(expr cmp 0)
>> +    *
>> +    * will generate better code by pretending the expression is
>> +    *
>> +    *     f = ir_triop_csel(0.0, 1.0, expr cmp 0)
>> +    *
>> +    * This is because the last instruction of "expr" can generate the
>> +    * condition code for the "cmp 0".  This avoids having to do the "-(b & 1)"
>> +    * trick to generate 0u or ~0u for the Boolean result.  This means code like
>> +    *
>> +    *     mov(16)         g16<1>F         1F
>> +    *     mul.ge.f0(16)   null            g6<8,8,1>F      g14<8,8,1>F
>> +    *     (+f0) sel(16)   m6<1>F          g16<8,8,1>F     0F
>> +    *
>> +    * will be generated instead of
>> +    *
>> +    *     mul(16)         g2<1>F          g12<8,8,1>F     g4<8,8,1>F
>> +    *     cmp.ge.f0(16)   g2<1>D          g4<8,8,1>F      0F
>> +    *     and(16)         g4<1>D          g2<8,8,1>D      1D
>> +    *     and(16)         m6<1>D          -g4<8,8,1>D     0x3f800000UD
>> +    *
>> +    * When the comparison is either == 0.0 or != 0.0 using the knowledge that
>> +    * the true (or false) case already results in zero would allow better code
>> +    * generation by possibly avoiding a load-immediate instruction.
>> +    */
>> +   ir_expression *cmp = ir->operands[0]->as_expression();
>> +   if (cmp == NULL)
>> +      return false;
>> +
>> +   if (cmp->operation == ir_binop_equal || cmp->operation == ir_binop_nequal) {
>> +      for (unsigned i = 0; i < 2; i++) {
>> +         ir_constant *c = cmp->operands[i]->as_constant();
>> +         if (c == NULL || !c->is_zero())
>> +            continue;
>> +
>> +         ir_expression *expr = cmp->operands[i ^ 1]->as_expression();
>> +         if (expr != NULL) {
>> +            fs_reg op[2];
>> +
>> +            for (unsigned j = 0; j < 2; j++) {
>> +               cmp->operands[j]->accept(this);
>> +               op[j] = this->result;
>> +
>> +               resolve_ud_negate(&op[j]);
>> +            }
>> +
>> +            emit_bool_to_cond_code_of_reg(cmp, op);
>> +
>> +            /* In this case we know when the condition is true, op[i ^ 1]
>> +             * contains zero.  Invert the predicate, use op[i ^ 1] as src0,
>> +             * and immediate 1.0f as src1.
>> +             */
>> +            this->result = vgrf(ir->type);
>> +            op[i ^ 1].type = BRW_REGISTER_TYPE_F;
> 
> We just do op[1 - i] in tons of other places. No comment needed to explain 1-i.

It must be the old timer in me, but I'd swear that i^1 typically generates fewer instructions than 1-i on x86.  I know it's not definitive, but with i^1 that function is 1025 bytes (excluding padding at the end) and with 1-i it's 1091 bytes (excluding padding at the end).