[Mesa-dev] [PATCH v2] i965/vec4: track and use independently each flag channel
Matt Turner
mattst88 at gmail.com
Thu Oct 22 12:08:21 PDT 2015
On Thu, Oct 22, 2015 at 11:19 AM, Matt Turner <mattst88 at gmail.com> wrote:
> On Mon, Oct 19, 2015 at 10:38 AM, Alejandro PiƱeiro
> <apinheiro at igalia.com> wrote:
>> vec4_live_variables tracks now each flag channel independently, so
>> vec4_dead_code_eliminate can update the writemask of null registers,
>> based on which component are alive at the moment. This would allow
>> vec4_cmod_propagation to optimize out several movs involving null
>> registers.
>>
>> v2: added support to track each flag channel independently at vec4
>> live_variables, as v1 assumed that it was already doing it, as
>> pointed by Francisco Jerez
>> ---
>>
>> I was tempted to split this commit in two, one for tracking each
>> channel independently at vec4_live_variables, and other for using it
>> at vec4_dead_code_eliminate, but in the end I concluded that made more
>> sense as one commit, as are changes tightly tied.
>>
>> FWIW, the shader-db numbers for the optimization patch ("i965/vec4:
>> adding vec4_cmod_propagation optimization", 2/5) changed, being now
>> the following:
>>
>> total instructions in shared programs: 6240413 -> 6235841 (-0.07%)
>> instructions in affected programs: 401933 -> 397361 (-1.14%)
>> total loops in shared programs: 1979 -> 1979 (0.00%)
>> helped: 2265
>> HURT: 0
>>
>> That are slightly worse that the numbers for the first version of the
>> patch, before adding that extra check. I will use this numbers if this
>> patch is approved.
>>
>> FWIW, without this change on live_variables and dead_code eliminate,
>> the numbers on the optimization would be the following ones:
>>
>> total instructions in shared programs: 6240413 -> 6240253 (-0.00%)
>> instructions in affected programs: 18965 -> 18805 (-0.84%)
>> total loops in shared programs: 1979 -> 1979 (0.00%)
>> helped: 160
>> HURT:
>>
>>
>> src/mesa/drivers/dri/i965/brw_ir_vec4.h | 26 ++++++++++++++++++
>> .../dri/i965/brw_vec4_dead_code_eliminate.cpp | 31 +++++++++++++++-------
>> .../drivers/dri/i965/brw_vec4_live_variables.cpp | 14 ++++++----
>> 3 files changed, 57 insertions(+), 14 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_ir_vec4.h b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
>> index 96dd633..934d7b1 100644
>> --- a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
>> +++ b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
>> @@ -185,6 +185,32 @@ public:
>> return predicate || opcode == VS_OPCODE_UNPACK_FLAGS_SIMD4X2;
>> }
>>
>> + bool reads_flag(unsigned c)
>> + {
>> + if (!reads_flag())
>> + return false;
>> +
>> + switch(predicate) {
>
> Space between switch and (
>
>> + case BRW_PREDICATE_NONE:
>> + return false;
>> + break;
>
> No need for break since it returns early (same applies below).
>
>> + case BRW_PREDICATE_ALIGN16_REPLICATE_X:
>> + return (c == 0 ? true : false);
>
> Simply return c == 0. (same below)
>
>> + break;
>> + case BRW_PREDICATE_ALIGN16_REPLICATE_Y:
>> + return (c == 1 ? true : false);
>> + break;
>> + case BRW_PREDICATE_ALIGN16_REPLICATE_Z:
>> + return (c == 2 ? true : false);
>> + break;
>> + case BRW_PREDICATE_ALIGN16_REPLICATE_W:
>> + return (c == 3 ? true : false);
>> + break;
>> + default:
>> + return true;
>> + }
>> + }
>> +
>> bool writes_flag()
>> {
>> return (conditional_mod && (opcode != BRW_OPCODE_SEL &&
>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
>> index 8fc7a36..d4216fd 100644
>> --- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
>> @@ -78,13 +78,19 @@ vec4_visitor::dead_code_eliminate()
>> sizeof(BITSET_WORD));
>>
>> foreach_inst_in_block_reverse(vec4_instruction, inst, block) {
>> - if (inst->dst.file == GRF && !inst->has_side_effects()) {
>> + if ((inst->dst.file == GRF && !inst->has_side_effects()) ||
>> + (inst->dst.is_null() && inst->writes_flag())){
>> bool result_live[4] = { false };
>>
>> - for (unsigned i = 0; i < inst->regs_written; i++) {
>> - for (int c = 0; c < 4; c++)
>> - result_live[c] |= BITSET_TEST(
>> - live, var_from_reg(alloc, offset(inst->dst, i), c));
>> + if (inst->dst.file == GRF) {
>> + for (unsigned i = 0; i < inst->regs_written; i++) {
>> + for (int c = 0; c < 4; c++)
>> + result_live[c] |= BITSET_TEST(
>> + live, var_from_reg(alloc, offset(inst->dst, i), c));
>> + }
>> + } else {
>> + for (unsigned c = 0; c < 4; c++)
>> + result_live[c] = BITSET_TEST(flag_live, c);
>> }
>>
>> /* If the instruction can't do writemasking, then it's all or
>
> Below this hunk is the code that changes the opcode to NOP when all
> the channels are writemasked:
>
> for (int c = 0; c < 4; c++) {
> if (!result_live[c] && inst->dst.writemask & (1 << c)) {
> inst->dst.writemask &= ~(1 << c);
> progress = true;
>
> if (inst->dst.writemask == 0) {
> if (inst->writes_accumulator || inst->writes_flag()) {
> inst->dst = dst_reg(retype(brw_null_reg(), inst->dst.type));
> } else {
> inst->opcode = BRW_OPCODE_NOP;
> continue;
> }
> }
> }
> }
>
> So we simply set the destination to null, when all channels are dead
> but the instruction is still writing the accumulator or the flag. But
> that's weird, because as we've learned, the writemask actually affects
> which channels of the flag (and presumably the accumulator too) are
> written. Also, setting dst to brw_null_reg() will *reset* the
> writemask to .xyzw.
>
> So if you had an instruction with a null destination writing the flag
> where only some channels of the flag are live:
>
> cmp.ge.f0 null.xyzw, src0, src1
> (+f0.x) if
>
> we would do the right thing, and be left with a writemask of only .x
> on the CMP, but if the whole flag is dead... wouldn't we end up with
> the same .xyzw writemask that we started with?
>
> Maybe try simply doing
>
> for (int c = 0; c < 4; c++) {
> if (!result_live[c] && inst->dst.writemask & (1 << c)) {
> inst->dst.writemask &= ~(1 << c);
> progress = true;
> }
> }
>
> if (inst->dst.writemask == 0) {
> inst->opcode = BRW_OPCODE_NOP;
> continue;
> }
>
> which as a side effect will fix the continue to restart the loop it
> was intended to restart (that is, restart the main instruction
> scanning loop, not the loop over the channels).
>
> If that works, then just squash this change into the patch.
Alejandro says this causes a bunch of piglit regressions.
In that case, let's skip this. In that case, with the small bits of
feedback addressed, this patch is
Reviewed-by: Matt Turner <mattst88 at gmail.com>
More information about the mesa-dev
mailing list