<div dir="auto"><div>Hi,<br><div class="gmail_extra"><br><div class="gmail_quote">On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <<a href="mailto:apinheiro@igalia.com" target="_blank">apinheiro@igalia.com</a>> wrote:<br type="attribution"><blockquote class="m_-8615757320103330257quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 24/08/17 21:07, Connor Abbott wrote:<br> ><br> > Hi Alejandro,<br> <br> Hi Connor,<br> <div class="m_-8615757320103330257quoted-text"><br> ><br> > This seems really suspicious. If the live ranges are really<br> > independent, then the register allocator should be able to assign the<br> > two virtual registers to the same physical register if it needs to.<br> <br> </div>Yes, it is true, the register allocator should be able to assign two<br> virtual registers to the same physical register. But that is done at the<br> end (or really near the end), so late for the problem this optimization<br> is trying to fix.</blockquote></div></div></div><div dir="auto"><br></div><div dir="auto">Well, my understanding is that the problem is long compilation times due to spilling and our not-so-great implementation of it. So no, register allocation is not late for the problem. As both Curro and I explained, the change by itself can only pessimise register allocation, so if it helps then it must be due to a bug in the register allocator or a problem in a subsequent pass that's getting hidden by this one.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="m_-8615757320103330257quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">We are also reducing the amount of instructions used.<br></blockquote></div></div></div><div dir="auto"><br></div><div dir="auto">The comments in the source code say otherwise. Any instructions eliminated were from spilling, which this pass only accidentally reduces.</div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="m_-8615757320103330257quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> Probably not really clear on the commit message. When I say "reduce the<br> pressure of the register allocator" I mean having a code that the<br> register allocator would be able to handle without using too much time.<br> The problem this optimization tries to solve is that for some 16 bit CTS<br> tests (some with matrices and geometry shaders), the amount of virtual<br> registers used and instructions was really big. For the record,<br> initially, some tests needed 24 min just to compile. Right now, thanks<br> to other optimizations, the slower test without this optimization needs<br> 1min 30 seconds. Adding some hacky timestamps, the time used at<br> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:<br> <br> * While trying to schedule using the three available pre mode<br> heuristics: 7 seconds<br> * Allocation with spilling: 63 seconds<br> * Final schedule using SCHEDULE_POST: 19 seconds<br> <br> With this optimization, the total time goes down to 14 seconds (10 + 0 +<br> 3 on the previous bullet point list).<br> <br> One could argue that 1min 30 seconds is okish. But taking into account<br> that it goes down to 14 seconds, even with some caveats (see below), I<br> still think that it is worth to use the optimization.<br> <br> And a final comment. For that same test, this is the final stats (using<br> INTEL_DEBUG):<br> <br> * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.<br> 130320 cycles. 15:9 spills:fills.<br> * Without the optimization: SIMD8 shader: 12312 instructions. 0 loops.<br> 174816 cycles. 751:1851 spills:fills.<br></blockquote></div></div></div><div dir="auto"><br></div><div dir="auto">So, the fact that it helps at all with SIMD8 shows that my theory is wrong, but since your pass reduces spilling, it clearly must be avoiding a bug somewhere else. You need to compare the IR for a shader with the problem with and without this pass right before register allocation. Maybe the sources and destinations of the conversion instructions interfere without the change due to some other pass that's increasing register pressure, in which case that's the problem, but I doubt it. (IIRC there's a debug option to print out the register pressure for each instruction, which will be helpful here). If that's not the case, you should check to see if the register allocator thinks the sources and destinations of the conversion instructions interfere, and if so figure out why - that will likely be the real bug.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="m_-8615757320103330257quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class="m_-8615757320103330257quoted-text"><br> > This change forces the two to be the same, which constrains the<br> > register allocator unecessarily and should make it worse, so I'm<br> > confused as to why this would help at all.<br> <br> </div>I didn't check that issue specifically, but I recently found that this<br> optimization affects copy propagation/dead code eliminate. So there are<br> still some room for improvement. But in any case, take into account that<br> this custom optimization is only used if there is a 32 to 16 bit<br> conversion, so only affects shaders with this specific feature.<br> <div class="m_-8615757320103330257quoted-text"><br> ><br> > IIRC there were some issues where we unnecessarily made the sources<br> > and destination of an instruction interefere with each other, but if<br> > that's what's causing this, then we should fix that underlying issue.<br> ><br> > (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the<br> > generator, in which case the second half of the source is read after<br> > the first half of the destination is written, and we falsely thought<br> > that the HW did that too, so we had some code to add a fake<br> > interference between them, but a while ago Curro moved the expansion<br> > to happen before register allocation. I don't have the code in front<br> > of me, but I think we still have this useless code lying around, and I<br> > would guess this is the source of the problem.)<br> <br> </div>Taking into account what I explained before, I don't think that the<br> problem is the interference or this code you mention (although perhaps<br> Im wrong).<br> <br> > hhhh<br> <div class="m_-8615757320103330257quoted-text">> Connor<br> ><br> > On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro" <<a href="mailto:apinheiro@igalia.com" target="_blank">apinheiro@igalia.com</a><br> </div><div class="m_-8615757320103330257elided-text">> <mailto:<a href="mailto:apinheiro@igalia.com" target="_blank">apinheiro@igalia.com</a>>> wrote:<br> ><br> > When dealing with HF/U/UW, it is usual having a register with a<br> > F/D/UD, and then convert it to HF/U/UW, and not use again the F/D/UD<br> > value. In those cases it would be possible to reuse the register where<br> > the F value is initially stored instead of having two. Take also into<br> > account that when operating with HF/U/UW, you would need to use the<br> > full register (so stride 2). Packs/unpacks would be only useful when<br> > loading/storing several HF/W/UW.<br> ><br> > Note that no instruction is removed. The main benefict is reducing the<br> > amoung of registers used, so the pressure on the register allocator is<br> > decreased with big shaders.<br> ><br> > Possibly this could be integrated into an existing optimization, at it<br> > is even already done by the register allocator, but was far easier to<br> > write and cleaner to read as a separate optimization.<br> ><br> > We found this issue when dealing with some Vulkan CTS tests that<br> > needed several minutes to compile. Most of the time was spent on the<br> > register allocator.<br> ><br> > Right now the optimization only handles 32 to 16 bit conversion. It<br> > could be possible to do the equivalent for 16 to 32 bit too, but in<br> > practice, we didn't need it.<br> > ---<br> > src/intel/compiler/brw_fs.cpp | 77<br> > +++++++++++++++++++++++++++++<wbr>++++++++++++++<br> > src/intel/compiler/brw_fs.h | 1 +<br> > 2 files changed, 78 insertions(+)<br> ><br> > diff --git a/src/intel/compiler/brw_fs.cp<wbr>p<br> > b/src/intel/compiler/brw_fs.c<wbr>pp<br> > index b6013a5ce85..1342150b44e 100644<br> > --- a/src/intel/compiler/brw_fs.cp<wbr>p<br> > +++ b/src/intel/compiler/brw_fs.cp<wbr>p<br> > @@ -39,6 +39,7 @@<br> > #include "compiler/glsl_types.h"<br> > #include "compiler/nir/nir_builder.h"<br> > #include "program/prog_parameter.h"<br> > +#include "brw_fs_live_variables.h"<br> ><br> > using namespace brw;<br> ><br> > @@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_round<wbr>ing_modes()<br> > return progress;<br> > }<br> ><br> > +/**<br> > + * When dealing with HF/W/UW, it is usual having a register with<br> > a F/D/UD, and<br> > + * then convert it to HF/W/UW, and not use again the F/D/UD<br> > value. In those<br> > + * cases it would be possible to reuse the register where the F<br> > value is<br> > + * initially stored instead of having two. Take also into account<br> > that when<br> > + * operating with HF/W/UW, you would need to use the full<br> > register (so stride<br> > + * 2). Packs/unpacks would be only useful when loading/storing<br> > several<br> > + * HF/W/UWs.<br> > + *<br> > + * So something like this:<br> > + * mov(8) vgrf14<2>:HF, vgrf39:F<br> > + *<br> > + * Became:<br> > + * mov(8) vgrf39<2>:HF, vgrf39:F<br> > + *<br> > + * Note that no instruction is removed. The main benefict is<br> > reducing the<br> > + * amoung of registers used, so the pressure on the register<br> > allocator is<br> > + * decreased with big shaders.<br> > + */<br> > +bool<br> > +fs_visitor::reuse_16bit_conv<wbr>ersions_vgrf()<br> > +{<br> > + bool progress = false;<br> > + int ip = -1;<br> > +<br> > + calculate_live_intervals();<br> > +<br> > + foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {<br> > + ip++;<br> > +<br> > + if (inst->dst.file != VGRF || inst->src[0].file != VGRF)<br> > + continue;<br> > +<br> > + if (inst->opcode != BRW_OPCODE_MOV)<br> > + continue;<br> > +<br> > + if (type_sz(inst->dst.type) != 2 || inst->dst.stride != 2 ||<br> > + type_sz(inst->src[0].type) != 4 || inst->src[0].stride<br> > != 1) {<br> > + continue;<br> > + }<br> > +<br> > + int src_reg = inst->src[0].nr;<br> > + int src_offset = inst->src[0].offset;<br> > + unsigned src_var = live_intervals->var_from_vgrf[<wbr>src_reg];<br> > + int src_end = live_intervals->end[src_var];<br> </div>> + int dst_reg = inst-><a href="http://dst.nr" rel="noreferrer" target="_blank">dst.nr</a> <<a href="http://dst.nr" rel="noreferrer" target="_blank">http://dst.nr</a>>;<br> <div class="m_-8615757320103330257quoted-text">> +<br> > + if (src_end > ip)<br> > + continue;<br> > +<br> > + foreach_block_and_inst(block, fs_inst, scan_inst, cfg) {<br> > + if (scan_inst->dst.file == VGRF &&<br> </div>> + scan_inst-><a href="http://dst.nr" rel="noreferrer" target="_blank">dst.nr</a> <<a href="http://dst.nr" rel="noreferrer" target="_blank">http://dst.nr</a>> == dst_reg) {<br> > + scan_inst-><a href="http://dst.nr" rel="noreferrer" target="_blank">dst.nr</a> <<a href="http://dst.nr" rel="noreferrer" target="_blank">http://dst.nr</a>> = src_reg;<br> <div class="m_-8615757320103330257elided-text">> + scan_inst->dst.offset = src_offset;<br> > + progress = true;<br> > + }<br> > +<br> > + for (int i = 0; i < scan_inst->sources; i++) {<br> > + if (scan_inst->src[i].file == VGRF &&<br> > + scan_inst->src[i].nr == dst_reg) {<br> > + scan_inst->src[i].nr = src_reg;<br> > + scan_inst->src[i].offset = src_offset;<br> > + progress = true;<br> > + }<br> > + }<br> > + }<br> > + }<br> > +<br> > + if (progress)<br> > + invalidate_live_intervals();<br> > +<br> > + return progress;<br> > +}<br> > +<br> ><br> > static void<br> > clear_deps_for_inst_src(fs_ins<wbr>t *inst, bool *deps, int first_grf,<br> > int grf_len)<br> > @@ -5829,6 +5905,7 @@ fs_visitor::optimize()<br> ><br> > OPT(opt_drop_redundant_mov_<wbr>to_flags);<br> > OPT(remove_extra_rounding_mod<wbr>es);<br> > + OPT(reuse_16bit_conversions_v<wbr>grf);<br> ><br> > do {<br> > progress = false;<br> > diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h<br> > index b9476e69edb..2685f748b87 100644<br> > --- a/src/intel/compiler/brw_fs.h<br> > +++ b/src/intel/compiler/brw_fs.h<br> > @@ -151,6 +151,7 @@ public:<br> > bool dead_code_eliminate();<br> > bool remove_duplicate_mrf_writes();<br> > bool remove_extra_rounding_modes();<br> > + bool reuse_16bit_conversions_vgrf()<wbr>;<br> ><br> > bool opt_sampler_eot();<br> > bool virtual_grf_interferes(int a, int b);<br> > --<br> > 2.11.0<br> ><br> > _____________________________<wbr>__________________<br> > mesa-dev mailing list<br> </div>> <a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.<wbr>org</a> <mailto:<a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedes<wbr>ktop.org</a>><br> > <a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.<wbr>org/mailman/listinfo/mesa-dev</a><br> > <<a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.or<wbr>g/mailman/listinfo/mesa-dev</a>><br> ><br> ><br> ><br> <br> <br> </blockquote></div><br></div></div></div>