[Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

Sun Aug 27 18:24:08 UTC 2017

Hi,

On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <apinheiro at igalia.com> wrote:

On 24/08/17 21:07, Connor Abbott wrote:
>
> Hi Alejandro,

Hi Connor,

>
> This seems really suspicious. If the live ranges are really
> independent, then the register allocator should be able to assign the
> two virtual registers to the same physical register if it needs to.

Yes, it is true, the register allocator should be able to assign two
virtual registers to the same physical register. But that is done at the
end (or really near the end), so late for the problem this optimization
is trying to fix.

Well, my understanding is that the problem is long compilation times due to
spilling and our not-so-great implementation of it. So no, register
allocation is not late for the problem. As both Curro and I explained, the
change by itself can only pessimise register allocation, so if it helps
then it must be due to a bug in the register allocator or a problem in a
subsequent pass that's getting hidden by this one.

We are also reducing the amount of instructions used.

The comments in the source code say otherwise. Any instructions eliminated
were from spilling, which this pass only accidentally reduces.

Probably not really clear on the commit message. When I say "reduce the
pressure of the register allocator" I mean having a code that the
register allocator would be able to handle without using too much time.
The problem this optimization tries to solve is that for some 16 bit CTS
tests (some with matrices and geometry shaders), the amount of virtual
registers used and instructions was really big. For the record,
initially, some tests needed 24 min just to compile. Right now, thanks
to other optimizations, the slower test without this optimization needs
1min 30 seconds. Adding some hacky timestamps, the time used  at
fs_visitor::allocate_registers (brw_fs.cpp:6096) is:

* While trying to schedule using the three available pre mode
heuristics: 7 seconds
* Allocation with spilling: 63 seconds
* Final schedule using SCHEDULE_POST: 19 seconds

With this optimization, the total time goes down to 14 seconds (10 + 0 +
3 on the previous bullet point list).

One could argue that 1min 30 seconds is okish. But taking into account
that it goes down to 14 seconds, even with some caveats (see below), I
still think that it is worth to use the optimization.

And a final comment. For that same test, this is the final stats (using
INTEL_DEBUG):

 * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
130320 cycles. 15:9 spills:fills.
 * Without the optimization: SIMD8 shader: 12312 instructions. 0 loops.
174816 cycles. 751:1851 spills:fills.

So, the fact that it helps at all with SIMD8 shows that my theory is wrong,
but since your pass reduces spilling, it clearly must be avoiding a bug
somewhere else. You need to compare the IR for a shader with the problem
with and without this pass right before register allocation. Maybe the
sources and destinations of the conversion instructions interfere without
the change due to some other pass that's increasing register pressure, in
which case that's the problem, but I doubt it. (IIRC there's a debug option
to print out the register pressure for each instruction, which will be
helpful here). If that's not the case, you should check to see if the
register allocator thinks the sources and destinations of the conversion
instructions interfere, and if so figure out why - that will likely be the
real bug.

> This change forces the two to be the same, which constrains the
> register allocator unecessarily and should make it worse, so I'm
> confused as to why this would help at all.

I didn't check that issue specifically, but I recently found that this
optimization affects copy propagation/dead code eliminate. So there are
still some room for improvement. But in any case, take into account that
this custom optimization is only used if there is a 32 to 16 bit
conversion, so only affects shaders with this specific feature.

>
> IIRC there were some issues where we unnecessarily made the sources
> and destination of an instruction interefere with each other, but if
> that's what's causing this, then we should fix that underlying issue.
>
> (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the
> generator, in which case the second half of the source is read after
> the first half of the destination is written, and we falsely thought
> that the HW did that too, so we had some code to add a fake
> interference between them, but a while ago Curro moved the expansion
> to happen before register allocation. I don't have the code in front
> of me, but I think we still have this useless code lying around, and I
> would guess this is the source of the problem.)

Taking into account what I explained before, I don't think that the
problem is the interference or this code you mention (although perhaps
Im wrong).

> hhhh
> Connor
>
> On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro" <apinheiro at igalia.com
> <mailto:apinheiro at igalia.com>> wrote:
>
>     When dealing with HF/U/UW, it is usual having a register with a
>     F/D/UD, and then convert it to HF/U/UW, and not use again the F/D/UD
>     value. In those cases it would be possible to reuse the register where
>     the F value is initially stored instead of having two. Take also into
>     account that when operating with HF/U/UW, you would need to use the
>     full register (so stride 2). Packs/unpacks would be only useful when
>     loading/storing several HF/W/UW.
>
>     Note that no instruction is removed. The main benefict is reducing the
>     amoung of registers used, so the pressure on the register allocator is
>     decreased with big shaders.
>
>     Possibly this could be integrated into an existing optimization, at it
>     is even already done by the register allocator, but was far easier to
>     write and cleaner to read as a separate optimization.
>
>     We found this issue when dealing with some Vulkan CTS tests that
>     needed several minutes to compile. Most of the time was spent on the
>     register allocator.
>
>     Right now the optimization only handles 32 to 16 bit conversion. It
>     could be possible to do the equivalent for 16 to 32 bit too, but in
>     practice, we didn't need it.
>     ---
>      src/intel/compiler/brw_fs.cpp | 77
>     +++++++++++++++++++++++++++++++++++++++++++
>      src/intel/compiler/brw_fs.h   |  1 +
>      2 files changed, 78 insertions(+)
>
>     diff --git a/src/intel/compiler/brw_fs.cpp
>     b/src/intel/compiler/brw_fs.cpp
>     index b6013a5ce85..1342150b44e 100644
>     --- a/src/intel/compiler/brw_fs.cpp
>     +++ b/src/intel/compiler/brw_fs.cpp
>     @@ -39,6 +39,7 @@
>      #include "compiler/glsl_types.h"
>      #include "compiler/nir/nir_builder.h"
>      #include "program/prog_parameter.h"
>     +#include "brw_fs_live_variables.h"
>
>      using namespace brw;
>
>     @@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_rounding_modes()
>         return progress;
>      }
>
>     +/**
>     + * When dealing with HF/W/UW, it is usual having a register with
>     a F/D/UD, and
>     + * then convert it to HF/W/UW, and not use again the F/D/UD
>     value. In those
>     + * cases it would be possible to reuse the register where the F
>     value is
>     + * initially stored instead of having two. Take also into account
>     that when
>     + * operating with HF/W/UW, you would need to use the full
>     register (so stride
>     + * 2). Packs/unpacks would be only useful when loading/storing
>     several
>     + * HF/W/UWs.
>     + *
>     + * So something like this:
>     + *  mov(8) vgrf14<2>:HF, vgrf39:F
>     + *
>     + * Became:
>     + *  mov(8) vgrf39<2>:HF, vgrf39:F
>     + *
>     + * Note that no instruction is removed. The main benefict is
>     reducing the
>     + * amoung of registers used, so the pressure on the register
>     allocator is
>     + * decreased with big shaders.
>     + */
>     +bool
>     +fs_visitor::reuse_16bit_conversions_vgrf()
>     +{
>     +   bool progress = false;
>     +   int ip = -1;
>     +
>     +   calculate_live_intervals();
>     +
>     +   foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
>     +      ip++;
>     +
>     +      if (inst->dst.file != VGRF || inst->src[0].file != VGRF)
>     +         continue;
>     +
>     +      if (inst->opcode != BRW_OPCODE_MOV)
>     +         continue;
>     +
>     +      if (type_sz(inst->dst.type) != 2 || inst->dst.stride != 2 ||
>     +          type_sz(inst->src[0].type) != 4 || inst->src[0].stride
>     != 1) {
>     +         continue;
>     +      }
>     +
>     +      int src_reg = inst->src[0].nr;
>     +      int src_offset = inst->src[0].offset;
>     +      unsigned src_var = live_intervals->var_from_vgrf[src_reg];
>     +      int src_end = live_intervals->end[src_var];
>     +      int dst_reg = inst->dst.nr <http://dst.nr>;
>     +
>     +      if (src_end > ip)
>     +         continue;
>     +
>     +      foreach_block_and_inst(block, fs_inst, scan_inst, cfg) {
>     +         if (scan_inst->dst.file == VGRF &&
>     +             scan_inst->dst.nr <http://dst.nr> == dst_reg) {
>     +            scan_inst->dst.nr <http://dst.nr> = src_reg;
>     +            scan_inst->dst.offset = src_offset;
>     +            progress = true;
>     +         }
>     +
>     +         for (int i = 0; i < scan_inst->sources; i++) {
>     +            if (scan_inst->src[i].file == VGRF &&
>     +                scan_inst->src[i].nr == dst_reg) {
>     +               scan_inst->src[i].nr = src_reg;
>     +               scan_inst->src[i].offset = src_offset;
>     +               progress = true;
>     +            }
>     +         }
>     +      }
>     +   }
>     +
>     +   if (progress)
>     +      invalidate_live_intervals();
>     +
>     +   return progress;
>     +}
>     +
>
>      static void
>      clear_deps_for_inst_src(fs_inst *inst, bool *deps, int first_grf,
>     int grf_len)
>     @@ -5829,6 +5905,7 @@ fs_visitor::optimize()
>
>         OPT(opt_drop_redundant_mov_to_flags);
>         OPT(remove_extra_rounding_modes);
>     +   OPT(reuse_16bit_conversions_vgrf);
>
>         do {
>            progress = false;
>     diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
>     index b9476e69edb..2685f748b87 100644
>     --- a/src/intel/compiler/brw_fs.h
>     +++ b/src/intel/compiler/brw_fs.h
>     @@ -151,6 +151,7 @@ public:
>         bool dead_code_eliminate();
>         bool remove_duplicate_mrf_writes();
>         bool remove_extra_rounding_modes();
>     +   bool reuse_16bit_conversions_vgrf();
>
>         bool opt_sampler_eot();
>         bool virtual_grf_interferes(int a, int b);
>     --
>     2.11.0
>
>     _______________________________________________
>     mesa-dev mailing list
>     mesa-dev at lists.freedesktop.org <mailto:mesa-dev at lists.freedesktop.org>
>     https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>     <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170827/8d6fd750/attachment-0001.html>