[Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization
Alejandro Piñeiro
apinheiro at igalia.com
Mon Aug 28 10:24:52 UTC 2017
On 27/08/17 20:24, Connor Abbott wrote:
> Hi,
>
> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <apinheiro at igalia.com
> <mailto:apinheiro at igalia.com>> wrote:
>
> On 24/08/17 21:07, Connor Abbott wrote:
> >
> > Hi Alejandro,
>
> Hi Connor,
>
> >
> > This seems really suspicious. If the live ranges are really
> > independent, then the register allocator should be able to
> assign the
> > two virtual registers to the same physical register if it needs to.
>
> Yes, it is true, the register allocator should be able to assign two
> virtual registers to the same physical register. But that is done
> at the
> end (or really near the end), so late for the problem this
> optimization
> is trying to fix.
>
>
> Well, my understanding is that the problem is long compilation times
> due to spilling and our not-so-great implementation of it. So no,
> register allocation is not late for the problem. As both Curro and I
> explained, the change by itself can only pessimise register
> allocation, so if it helps then it must be due to a bug in the
> register allocator or a problem in a subsequent pass that's getting
> hidden by this one.
Ok.
>
> We are also reducing the amount of instructions used.
>
>
> The comments in the source code say otherwise. Any instructions
> eliminated were from spilling, which this pass only accidentally reduces.
Yes, sorry, I explained myself poorly. The optimization itself doesn't
remove any instructions. But using it reduces the final number of
instructions, although as you say, they are likely due reducing the
spilling.
>
>
>
> Probably not really clear on the commit message. When I say
> "reduce the
> pressure of the register allocator" I mean having a code that the
> register allocator would be able to handle without using too much
> time.
> The problem this optimization tries to solve is that for some 16
> bit CTS
> tests (some with matrices and geometry shaders), the amount of virtual
> registers used and instructions was really big. For the record,
> initially, some tests needed 24 min just to compile. Right now, thanks
> to other optimizations, the slower test without this optimization
> needs
> 1min 30 seconds. Adding some hacky timestamps, the time used at
> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>
> * While trying to schedule using the three available pre mode
> heuristics: 7 seconds
> * Allocation with spilling: 63 seconds
> * Final schedule using SCHEDULE_POST: 19 seconds
>
> With this optimization, the total time goes down to 14 seconds (10
> + 0 +
> 3 on the previous bullet point list).
>
> One could argue that 1min 30 seconds is okish. But taking into account
> that it goes down to 14 seconds, even with some caveats (see below), I
> still think that it is worth to use the optimization.
>
> And a final comment. For that same test, this is the final stats
> (using
> INTEL_DEBUG):
>
> * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
> 130320 cycles. 15:9 spills:fills.
> * Without the optimization: SIMD8 shader: 12312 instructions. 0
> loops.
> 174816 cycles. 751:1851 spills:fills.
>
>
> So, the fact that it helps at all with SIMD8 shows that my theory is
> wrong, but since your pass reduces spilling, it clearly must be
> avoiding a bug somewhere else. You need to compare the IR for a shader
> with the problem with and without this pass right before register
> allocation. Maybe the sources and destinations of the conversion
> instructions interfere without the change due to some other pass
> that's increasing register pressure, in which case that's the problem,
> but I doubt it.
Ok, thanks for the hints.
> (IIRC there's a debug option to print out the register pressure for
> each instruction, which will be helpful here).
I tried to use that option back when I was working on this patch,
without too much luck. Will try again.
> If that's not the case, you should check to see if the register
> allocator thinks the sources and destinations of the conversion
> instructions interfere, and if so figure out why - that will likely be
> the real bug.
Ok.
Thanks Connor and Curro for the comments. I will work on a different
solution.
>
>
> > This change forces the two to be the same, which constrains the
> > register allocator unecessarily and should make it worse, so I'm
> > confused as to why this would help at all.
>
> I didn't check that issue specifically, but I recently found that this
> optimization affects copy propagation/dead code eliminate. So
> there are
> still some room for improvement. But in any case, take into
> account that
> this custom optimization is only used if there is a 32 to 16 bit
> conversion, so only affects shaders with this specific feature.
>
> >
> > IIRC there were some issues where we unnecessarily made the sources
> > and destination of an instruction interefere with each other, but if
> > that's what's causing this, then we should fix that underlying
> issue.
> >
> > (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the
> > generator, in which case the second half of the source is read after
> > the first half of the destination is written, and we falsely thought
> > that the HW did that too, so we had some code to add a fake
> > interference between them, but a while ago Curro moved the expansion
> > to happen before register allocation. I don't have the code in front
> > of me, but I think we still have this useless code lying around,
> and I
> > would guess this is the source of the problem.)
>
> Taking into account what I explained before, I don't think that the
> problem is the interference or this code you mention (although perhaps
> Im wrong).
>
> > hhhh
> > Connor
> >
> > On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro"
> <apinheiro at igalia.com <mailto:apinheiro at igalia.com>
> > <mailto:apinheiro at igalia.com <mailto:apinheiro at igalia.com>>> wrote:
> >
> > When dealing with HF/U/UW, it is usual having a register with a
> > F/D/UD, and then convert it to HF/U/UW, and not use again
> the F/D/UD
> > value. In those cases it would be possible to reuse the
> register where
> > the F value is initially stored instead of having two. Take
> also into
> > account that when operating with HF/U/UW, you would need to
> use the
> > full register (so stride 2). Packs/unpacks would be only
> useful when
> > loading/storing several HF/W/UW.
> >
> > Note that no instruction is removed. The main benefict is
> reducing the
> > amoung of registers used, so the pressure on the register
> allocator is
> > decreased with big shaders.
> >
> > Possibly this could be integrated into an existing
> optimization, at it
> > is even already done by the register allocator, but was far
> easier to
> > write and cleaner to read as a separate optimization.
> >
> > We found this issue when dealing with some Vulkan CTS tests that
> > needed several minutes to compile. Most of the time was
> spent on the
> > register allocator.
> >
> > Right now the optimization only handles 32 to 16 bit
> conversion. It
> > could be possible to do the equivalent for 16 to 32 bit too,
> but in
> > practice, we didn't need it.
> > ---
> > src/intel/compiler/brw_fs.cpp | 77
> > +++++++++++++++++++++++++++++++++++++++++++
> > src/intel/compiler/brw_fs.h | 1 +
> > 2 files changed, 78 insertions(+)
> >
> > diff --git a/src/intel/compiler/brw_fs.cpp
> > b/src/intel/compiler/brw_fs.cpp
> > index b6013a5ce85..1342150b44e 100644
> > --- a/src/intel/compiler/brw_fs.cpp
> > +++ b/src/intel/compiler/brw_fs.cpp
> > @@ -39,6 +39,7 @@
> > #include "compiler/glsl_types.h"
> > #include "compiler/nir/nir_builder.h"
> > #include "program/prog_parameter.h"
> > +#include "brw_fs_live_variables.h"
> >
> > using namespace brw;
> >
> > @@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_rounding_modes()
> > return progress;
> > }
> >
> > +/**
> > + * When dealing with HF/W/UW, it is usual having a register
> with
> > a F/D/UD, and
> > + * then convert it to HF/W/UW, and not use again the F/D/UD
> > value. In those
> > + * cases it would be possible to reuse the register where the F
> > value is
> > + * initially stored instead of having two. Take also into
> account
> > that when
> > + * operating with HF/W/UW, you would need to use the full
> > register (so stride
> > + * 2). Packs/unpacks would be only useful when loading/storing
> > several
> > + * HF/W/UWs.
> > + *
> > + * So something like this:
> > + * mov(8) vgrf14<2>:HF, vgrf39:F
> > + *
> > + * Became:
> > + * mov(8) vgrf39<2>:HF, vgrf39:F
> > + *
> > + * Note that no instruction is removed. The main benefict is
> > reducing the
> > + * amoung of registers used, so the pressure on the register
> > allocator is
> > + * decreased with big shaders.
> > + */
> > +bool
> > +fs_visitor::reuse_16bit_conversions_vgrf()
> > +{
> > + bool progress = false;
> > + int ip = -1;
> > +
> > + calculate_live_intervals();
> > +
> > + foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
> > + ip++;
> > +
> > + if (inst->dst.file != VGRF || inst->src[0].file != VGRF)
> > + continue;
> > +
> > + if (inst->opcode != BRW_OPCODE_MOV)
> > + continue;
> > +
> > + if (type_sz(inst->dst.type) != 2 || inst->dst.stride
> != 2 ||
> > + type_sz(inst->src[0].type) != 4 ||
> inst->src[0].stride
> > != 1) {
> > + continue;
> > + }
> > +
> > + int src_reg = inst->src[0].nr;
> > + int src_offset = inst->src[0].offset;
> > + unsigned src_var =
> live_intervals->var_from_vgrf[src_reg];
> > + int src_end = live_intervals->end[src_var];
> > + int dst_reg = inst->dst.nr <http://dst.nr>
> <http://dst.nr>;
> > +
> > + if (src_end > ip)
> > + continue;
> > +
> > + foreach_block_and_inst(block, fs_inst, scan_inst, cfg) {
> > + if (scan_inst->dst.file == VGRF &&
> > + scan_inst->dst.nr <http://dst.nr>
> <http://dst.nr> == dst_reg) {
> > + scan_inst->dst.nr <http://dst.nr>
> <http://dst.nr> = src_reg;
> > + scan_inst->dst.offset = src_offset;
> > + progress = true;
> > + }
> > +
> > + for (int i = 0; i < scan_inst->sources; i++) {
> > + if (scan_inst->src[i].file == VGRF &&
> > + scan_inst->src[i].nr == dst_reg) {
> > + scan_inst->src[i].nr = src_reg;
> > + scan_inst->src[i].offset = src_offset;
> > + progress = true;
> > + }
> > + }
> > + }
> > + }
> > +
> > + if (progress)
> > + invalidate_live_intervals();
> > +
> > + return progress;
> > +}
> > +
> >
> > static void
> > clear_deps_for_inst_src(fs_inst *inst, bool *deps, int
> first_grf,
> > int grf_len)
> > @@ -5829,6 +5905,7 @@ fs_visitor::optimize()
> >
> > OPT(opt_drop_redundant_mov_to_flags);
> > OPT(remove_extra_rounding_modes);
> > + OPT(reuse_16bit_conversions_vgrf);
> >
> > do {
> > progress = false;
> > diff --git a/src/intel/compiler/brw_fs.h
> b/src/intel/compiler/brw_fs.h
> > index b9476e69edb..2685f748b87 100644
> > --- a/src/intel/compiler/brw_fs.h
> > +++ b/src/intel/compiler/brw_fs.h
> > @@ -151,6 +151,7 @@ public:
> > bool dead_code_eliminate();
> > bool remove_duplicate_mrf_writes();
> > bool remove_extra_rounding_modes();
> > + bool reuse_16bit_conversions_vgrf();
> >
> > bool opt_sampler_eot();
> > bool virtual_grf_interferes(int a, int b);
> > --
> > 2.11.0
> >
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev at lists.freedesktop.org
> <mailto:mesa-dev at lists.freedesktop.org>
> <mailto:mesa-dev at lists.freedesktop.org
> <mailto:mesa-dev at lists.freedesktop.org>>
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
> > <https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>>
> >
> >
> >
>
>
>
More information about the mesa-dev
mailing list