[Mesa-dev] [PATCH v2] i965/vec4: track and use independently each flag channel

Matt Turner mattst88 at gmail.com
Thu Oct 22 12:08:21 PDT 2015


On Thu, Oct 22, 2015 at 11:19 AM, Matt Turner <mattst88 at gmail.com> wrote:
> On Mon, Oct 19, 2015 at 10:38 AM, Alejandro PiƱeiro
> <apinheiro at igalia.com> wrote:
>> vec4_live_variables tracks now each flag channel independently, so
>> vec4_dead_code_eliminate can update the writemask of null registers,
>> based on which component are alive at the moment. This would allow
>> vec4_cmod_propagation to optimize out several movs involving null
>> registers.
>>
>> v2: added support to track each flag channel independently at vec4
>>     live_variables, as v1 assumed that it was already doing it, as
>>     pointed by Francisco Jerez
>> ---
>>
>> I was tempted to split this commit in two, one for tracking each
>> channel independently at vec4_live_variables, and other for using it
>> at vec4_dead_code_eliminate, but in the end I concluded that made more
>> sense as one commit, as are changes tightly tied.
>>
>> FWIW, the shader-db numbers for the optimization patch ("i965/vec4:
>> adding vec4_cmod_propagation optimization", 2/5) changed, being now
>> the following:
>>
>> total instructions in shared programs: 6240413 -> 6235841 (-0.07%)
>> instructions in affected programs:     401933 -> 397361 (-1.14%)
>> total loops in shared programs:        1979 -> 1979 (0.00%)
>> helped:                                2265
>> HURT:                                  0
>>
>> That are slightly worse that the numbers for the first version of the
>> patch, before adding that extra check. I will use this numbers if this
>> patch is approved.
>>
>> FWIW, without this change on live_variables and dead_code eliminate,
>> the numbers on the optimization would be the following ones:
>>
>> total instructions in shared programs: 6240413 -> 6240253 (-0.00%)
>> instructions in affected programs:     18965 -> 18805 (-0.84%)
>> total loops in shared programs:        1979 -> 1979 (0.00%)
>> helped:                                160
>> HURT:
>>
>>
>>  src/mesa/drivers/dri/i965/brw_ir_vec4.h            | 26 ++++++++++++++++++
>>  .../dri/i965/brw_vec4_dead_code_eliminate.cpp      | 31 +++++++++++++++-------
>>  .../drivers/dri/i965/brw_vec4_live_variables.cpp   | 14 ++++++----
>>  3 files changed, 57 insertions(+), 14 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_ir_vec4.h b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
>> index 96dd633..934d7b1 100644
>> --- a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
>> +++ b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
>> @@ -185,6 +185,32 @@ public:
>>        return predicate || opcode == VS_OPCODE_UNPACK_FLAGS_SIMD4X2;
>>     }
>>
>> +   bool reads_flag(unsigned c)
>> +   {
>> +      if (!reads_flag())
>> +         return false;
>> +
>> +      switch(predicate) {
>
> Space between switch and (
>
>> +      case BRW_PREDICATE_NONE:
>> +         return false;
>> +         break;
>
> No need for break since it returns early (same applies below).
>
>> +      case BRW_PREDICATE_ALIGN16_REPLICATE_X:
>> +         return (c == 0 ? true : false);
>
> Simply return c == 0. (same below)
>
>> +         break;
>> +      case BRW_PREDICATE_ALIGN16_REPLICATE_Y:
>> +         return (c == 1 ? true : false);
>> +         break;
>> +      case BRW_PREDICATE_ALIGN16_REPLICATE_Z:
>> +         return (c == 2 ? true : false);
>> +         break;
>> +      case BRW_PREDICATE_ALIGN16_REPLICATE_W:
>> +         return (c == 3 ? true : false);
>> +         break;
>> +      default:
>> +         return true;
>> +      }
>> +   }
>> +
>>     bool writes_flag()
>>     {
>>        return (conditional_mod && (opcode != BRW_OPCODE_SEL &&
>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
>> index 8fc7a36..d4216fd 100644
>> --- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
>> @@ -78,13 +78,19 @@ vec4_visitor::dead_code_eliminate()
>>               sizeof(BITSET_WORD));
>>
>>        foreach_inst_in_block_reverse(vec4_instruction, inst, block) {
>> -         if (inst->dst.file == GRF && !inst->has_side_effects()) {
>> +         if ((inst->dst.file == GRF && !inst->has_side_effects()) ||
>> +             (inst->dst.is_null() && inst->writes_flag())){
>>              bool result_live[4] = { false };
>>
>> -            for (unsigned i = 0; i < inst->regs_written; i++) {
>> -               for (int c = 0; c < 4; c++)
>> -                  result_live[c] |= BITSET_TEST(
>> -                     live, var_from_reg(alloc, offset(inst->dst, i), c));
>> +            if (inst->dst.file == GRF) {
>> +               for (unsigned i = 0; i < inst->regs_written; i++) {
>> +                  for (int c = 0; c < 4; c++)
>> +                     result_live[c] |= BITSET_TEST(
>> +                        live, var_from_reg(alloc, offset(inst->dst, i), c));
>> +               }
>> +            } else {
>> +               for (unsigned c = 0; c < 4; c++)
>> +                  result_live[c] = BITSET_TEST(flag_live, c);
>>              }
>>
>>              /* If the instruction can't do writemasking, then it's all or
>
> Below this hunk is the code that changes the opcode to NOP when all
> the channels are writemasked:
>
>    for (int c = 0; c < 4; c++) {
>       if (!result_live[c] && inst->dst.writemask & (1 << c)) {
>          inst->dst.writemask &= ~(1 << c);
>          progress = true;
>
>          if (inst->dst.writemask == 0) {
>             if (inst->writes_accumulator || inst->writes_flag()) {
>                inst->dst = dst_reg(retype(brw_null_reg(), inst->dst.type));
>             } else {
>                inst->opcode = BRW_OPCODE_NOP;
>                continue;
>             }
>          }
>       }
>    }
>
> So we simply set the destination to null, when all channels are dead
> but the instruction is still writing the accumulator or the flag. But
> that's weird, because as we've learned, the writemask actually affects
> which channels of the flag (and presumably the accumulator too) are
> written. Also, setting dst to brw_null_reg() will *reset* the
> writemask to .xyzw.
>
> So if you had an instruction with a null destination writing the flag
> where only some channels of the flag are live:
>
>    cmp.ge.f0 null.xyzw, src0, src1
>    (+f0.x) if
>
> we would do the right thing, and be left with a writemask of only .x
> on the CMP, but if the whole flag is dead... wouldn't we end up with
> the same .xyzw writemask that we started with?
>
> Maybe try simply doing
>
>             for (int c = 0; c < 4; c++) {
>                if (!result_live[c] && inst->dst.writemask & (1 << c)) {
>                   inst->dst.writemask &= ~(1 << c);
>                   progress = true;
>                }
>             }
>
>             if (inst->dst.writemask == 0) {
>                inst->opcode = BRW_OPCODE_NOP;
>                continue;
>             }
>
> which as a side effect will fix the continue to restart the loop it
> was intended to restart (that is, restart the main instruction
> scanning loop, not the loop over the channels).
>
> If that works, then just squash this change into the patch.

Alejandro says this causes a bunch of piglit regressions.

In that case, let's skip this. In that case, with the small bits of
feedback addressed, this patch is

Reviewed-by: Matt Turner <mattst88 at gmail.com>


More information about the mesa-dev mailing list