[Mesa-dev] [PATCH 2/5] i965/fs: Emit better b2f of an expression on GEN4 and GEN5
Ian Romanick
idr at freedesktop.org
Mon Mar 16 12:23:11 PDT 2015
On 03/16/2015 10:06 AM, Matt Turner wrote:
> On Wed, Mar 11, 2015 at 1:44 PM, Ian Romanick <idr at freedesktop.org> wrote:
>> From: Ian Romanick <ian.d.romanick at intel.com>
>>
>> On platforms that do not natively generate 0u and ~0u for Boolean
>> results, b2f expressions that look like
>>
>> f = b2f(expr cmp 0)
>>
>> will generate better code by pretending the expression is
>>
>> f = ir_triop_sel(0.0, 1.0, expr cmp 0)
>>
>> This is because the last instruction of "expr" can generate the
>> condition code for the "cmp 0". This avoids having to do the "-(b & 1)"
>> trick to generate 0u or ~0u for the Boolean result. This means code like
>>
>> mov(16) g16<1>F 1F
>> mul.ge.f0(16) null g6<8,8,1>F g14<8,8,1>F
>> (+f0) sel(16) m6<1>F g16<8,8,1>F 0F
>>
>> will be generated instead of
>>
>> mul(16) g2<1>F g12<8,8,1>F g4<8,8,1>F
>> cmp.ge.f0(16) g2<1>D g4<8,8,1>F 0F
>
> Presumably this g4 should be g2?
Probably. I was cutting out of a diff of shader-db results, and I must
have botched it. Here's the diff from shaders/anholt/6.shader_test:
@@ -129,7 +129,7 @@
)
Native code for unnamed fragment shader 3
-SIMD8 shader: 77 instructions. 0 loops. Compacted 1232 to 832 bytes (32%)
+SIMD8 shader: 76 instructions. 0 loops. Compacted 1216 to 816 bytes (33%)
START B0
add(8) g9<1>UW g1.4<2,4,0>UW 0x10101010V { align1 };
mov(8) m3<1>F 16F { align1 };
@@ -163,7 +163,7 @@
add(8) g2<1>F g2<8,8,1>F g6<8,8,1>F { align1 compacted };
send(8) 2 g6<1>F g2<8,8,1>F
math rsq mlen 1 rlen 1 { align1 };
-mul(8) g2<1>F g5<8,8,1>F g6<8,8,1>F { align1 compacted };
+mul.ge.f0(8) g2<1>F g5<8,8,1>F g6<8,8,1>F { align1 compacted };
mul(8) g5<1>F -g12<8,8,1>F -g12<8,8,1>F { align1 compacted };
mul(8) g7<1>F -g16<8,8,1>F -g16<8,8,1>F { align1 compacted };
mul(8) g8<1>F -g15<8,8,1>F -g15<8,8,1>F { align1 compacted };
@@ -194,14 +194,13 @@
send(8) 2 g3<1>F g3<8,8,1>F
math pow mlen 2 rlen 1 { align1 };
mul(8) m6<1>F g11<8,8,1>F g8<8,8,1>F { align1 };
-cmp.ge.f0(8) g4<1>F g2<8,8,1>F 0F { align1 };
+mov(8) g4<1>F 1F { align1 };
mov(8) m2<1>F g6<8,8,1>F { align1 };
mov(8) m3<1>F g7<8,8,1>F { align1 };
add(8) g9<1>F g2<8,8,1>F g3<8,8,1>F { align1 compacted };
-and(8) g8<1>D g4<8,8,1>D 1D { align1 };
+(+f0) sel(8) g8<1>F g4<8,8,1>F 0F { align1 };
send(8) 2 g4<1>UW null
sampler (1, 0, 3, 1) mlen 5 rlen 4 { align1 };
-and(8) g8<1>D -g8<8,8,1>D 0x3f800000UD { align1 };
mul(8) g9<1>F g9<8,8,1>F g4<8,8,1>F { align1 compacted };
mul(8) m3<1>F g8<8,8,1>F g9<8,8,1>F { align1 };
mul(8) g9<1>F g2<8,8,1>F 0.7F { align1 };
I think I can adjust the commit message to:
"...This means code like
mul.ge.f0(8) g2<1>F g5<8,8,1>F g6<8,8,1>F
mov(8) g4<1>F 1F
(+f0) sel(8) g8<1>F g4<8,8,1>F 0F
will be generated instead of
mul(8) g2<1>F g5<8,8,1>F g6<8,8,1>F
cmp.ge.f0(8) g4<1>F g2<8,8,1>F 0F
and(8) g8<1>D g4<8,8,1>D 1D
and(8) g8<1>D -g8<8,8,1>D 0x3f800000UD"
I'll update the comment in the code too.
>> and(16) g4<1>D g2<8,8,1>D 1D
>> and(16) m6<1>D -g4<8,8,1>D 0x3f800000UD
>>
>> v2: When the comparison is either == 0.0 or != 0.0 use the knowledge
>> that the true (or false) case already results in zero would allow better
>> code generation by possibly avoiding a load-immediate instruction.
>>
>> v3: Apply the optimization even when neither comparitor is zero.
>>
>> Shader-db results:
>>
>> GM45 (0x2A42):
>> total instructions in shared programs: 3551002 -> 3550829 (-0.00%)
>> instructions in affected programs: 33269 -> 33096 (-0.52%)
>> helped: 121
>>
>> Iron Lake (0x0046):
>> total instructions in shared programs: 4993327 -> 4993146 (-0.00%)
>> instructions in affected programs: 34199 -> 34018 (-0.53%)
>> helped: 129
>>
>> No change on other platforms.
>>
>> Signed-off-by: Ian Romanick <ian.d.romanick at intel.com>
>> Cc: Tapani Palli <tapani.palli at intel.com>
>> ---
>> src/mesa/drivers/dri/i965/brw_fs.h | 2 +
>> src/mesa/drivers/dri/i965/brw_fs_visitor.cpp | 101 +++++++++++++++++++++++++--
>> 2 files changed, 99 insertions(+), 4 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
>> index d9d5858..075e90c 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.h
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
>> @@ -307,6 +307,7 @@ public:
>> const fs_reg &a);
>> void emit_minmax(enum brw_conditional_mod conditionalmod, const fs_reg &dst,
>> const fs_reg &src0, const fs_reg &src1);
>> + bool try_emit_b2f_of_comparison(ir_expression *ir);
>> bool try_emit_saturate(ir_expression *ir);
>> bool try_emit_line(ir_expression *ir);
>> bool try_emit_mad(ir_expression *ir);
>> @@ -317,6 +318,7 @@ public:
>> bool opt_saturate_propagation();
>> bool opt_cmod_propagation();
>> void emit_bool_to_cond_code(ir_rvalue *condition);
>> + void emit_bool_to_cond_code_of_reg(ir_expression *expr, fs_reg op[3]);
>> void emit_if_gen6(ir_if *ir);
>> void emit_unspill(bblock_t *block, fs_inst *inst, fs_reg reg,
>> uint32_t spill_offset, int count);
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> index 3025a9d..3d79796 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> @@ -475,6 +475,87 @@ fs_visitor::try_emit_mad(ir_expression *ir)
>> return true;
>> }
>>
>> +bool
>> +fs_visitor::try_emit_b2f_of_comparison(ir_expression *ir)
>> +{
>> + /* On platforms that do not natively generate 0u and ~0u for Boolean
>> + * results, b2f expressions that look like
>> + *
>> + * f = b2f(expr cmp 0)
>> + *
>> + * will generate better code by pretending the expression is
>> + *
>> + * f = ir_triop_csel(0.0, 1.0, expr cmp 0)
>> + *
>> + * This is because the last instruction of "expr" can generate the
>> + * condition code for the "cmp 0". This avoids having to do the "-(b & 1)"
>> + * trick to generate 0u or ~0u for the Boolean result. This means code like
>> + *
>> + * mov(16) g16<1>F 1F
>> + * mul.ge.f0(16) null g6<8,8,1>F g14<8,8,1>F
>> + * (+f0) sel(16) m6<1>F g16<8,8,1>F 0F
>> + *
>> + * will be generated instead of
>> + *
>> + * mul(16) g2<1>F g12<8,8,1>F g4<8,8,1>F
>> + * cmp.ge.f0(16) g2<1>D g4<8,8,1>F 0F
>> + * and(16) g4<1>D g2<8,8,1>D 1D
>> + * and(16) m6<1>D -g4<8,8,1>D 0x3f800000UD
>> + *
>> + * When the comparison is either == 0.0 or != 0.0 using the knowledge that
>> + * the true (or false) case already results in zero would allow better code
>> + * generation by possibly avoiding a load-immediate instruction.
>> + */
>> + ir_expression *cmp = ir->operands[0]->as_expression();
>> + if (cmp == NULL)
>> + return false;
>> +
>> + if (cmp->operation == ir_binop_equal || cmp->operation == ir_binop_nequal) {
>> + for (unsigned i = 0; i < 2; i++) {
>> + ir_constant *c = cmp->operands[i]->as_constant();
>> + if (c == NULL || !c->is_zero())
>> + continue;
>> +
>> + ir_expression *expr = cmp->operands[i ^ 1]->as_expression();
>> + if (expr != NULL) {
>> + fs_reg op[2];
>> +
>> + for (unsigned j = 0; j < 2; j++) {
>> + cmp->operands[j]->accept(this);
>> + op[j] = this->result;
>> +
>> + resolve_ud_negate(&op[j]);
>> + }
>> +
>> + emit_bool_to_cond_code_of_reg(cmp, op);
>> +
>> + /* In this case we know when the condition is true, op[i ^ 1]
>> + * contains zero. Invert the predicate, use op[i ^ 1] as src0,
>> + * and immediate 1.0f as src1.
>> + */
>> + this->result = vgrf(ir->type);
>> + op[i ^ 1].type = BRW_REGISTER_TYPE_F;
>
> We just do op[1 - i] in tons of other places. No comment needed to explain 1-i.
It must be the old timer in me, but I'd swear that i^1 typically generates fewer instructions than 1-i on x86. I know it's not definitive, but with i^1 that function is 1025 bytes (excluding padding at the end) and with 1-i it's 1091 bytes (excluding padding at the end).
More information about the mesa-dev
mailing list