<div dir="auto"><br><div class="gmail_extra" dir="auto"><span style="font-family:sans-serif">Hi Alejandro</span><span style="font-family:sans-serif">,</span></div><div class="gmail_extra" dir="auto"><div dir="auto" style="font-family:sans-serif"><br></div><div dir="auto" style="font-family:sans-serif">This seems really suspicious. If the live ranges are really independent, then the register allocator should be able to assign the two virtual registers to the same physical register if it needs to. This change forces the two to be the same, which constrains the register allocator unecessarily and should make it worse, so I'm confused as to why this would help at all.</div><div dir="auto" style="font-family:sans-serif"><br></div><div dir="auto" style="font-family:sans-serif">IIRC there were some issues where we unnecessarily made the sources and destination of an instruction interefere with each other, but if that's what's causing this, then we should fix that underlying issue.</div><div dir="auto" style="font-family:sans-serif"><br></div><div dir="auto" style="font-family:sans-serif">(From what I remember, a lot of SIMD16 were expanded to SIMD8 in the generator, in which case the second half of the source is read after the first half of the destination is written, and we falsely thought that the HW did that too, so we had some code to add a fake interference between them, but a while ago Curro moved the expansion to happen before register allocation. I don't have the code in front of me, but I think we still have this useless code lying around, and I would guess this is the source of the problem.)</div><div dir="auto" style="font-family:sans-serif"><br></div><div dir="auto" style="font-family:sans-serif">Connor</div><br><div class="gmail_quote" dir="auto">On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro" <<a href="mailto:apinheiro@igalia.com" target="_blank">apinheiro@igalia.com</a>> wrote:<br type="attribution"><blockquote class="m_2012952128143901494quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">When dealing with HF/U/UW, it is usual having a register with a<br>
F/D/UD, and then convert it to HF/U/UW, and not use again the F/D/UD<br>
value. In those cases it would be possible to reuse the register where<br>
the F value is initially stored instead of having two. Take also into<br>
account that when operating with HF/U/UW, you would need to use the<br>
full register (so stride 2). Packs/unpacks would be only useful when<br>
loading/storing several HF/W/UW.<br>
<br>
Note that no instruction is removed. The main benefict is reducing the<br>
amoung of registers used, so the pressure on the register allocator is<br>
decreased with big shaders.<br>
<br>
Possibly this could be integrated into an existing optimization, at it<br>
is even already done by the register allocator, but was far easier to<br>
write and cleaner to read as a separate optimization.<br>
<br>
We found this issue when dealing with some Vulkan CTS tests that<br>
needed several minutes to compile. Most of the time was spent on the<br>
register allocator.<br>
<br>
Right now the optimization only handles 32 to 16 bit conversion. It<br>
could be possible to do the equivalent for 16 to 32 bit too, but in<br>
practice, we didn't need it.<br>
---<br>
src/intel/compiler/brw_fs.cpp | 77 ++++++++++++++++++++++++++++++<wbr>+++++++++++++<br>
src/intel/compiler/brw_fs.h | 1 +<br>
2 files changed, 78 insertions(+)<br>
<br>
diff --git a/src/intel/compiler/brw_fs.cp<wbr>p b/src/intel/compiler/brw_fs.cp<wbr>p<br>
index b6013a5ce85..1342150b44e 100644<br>
--- a/src/intel/compiler/brw_fs.cp<wbr>p<br>
+++ b/src/intel/compiler/brw_fs.cp<wbr>p<br>
@@ -39,6 +39,7 @@<br>
#include "compiler/glsl_types.h"<br>
#include "compiler/nir/nir_builder.h"<br>
#include "program/prog_parameter.h"<br>
+#include "brw_fs_live_variables.h"<br>
<br>
using namespace brw;<br>
<br>
@@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_round<wbr>ing_modes()<br>
return progress;<br>
}<br>
<br>
+/**<br>
+ * When dealing with HF/W/UW, it is usual having a register with a F/D/UD, and<br>
+ * then convert it to HF/W/UW, and not use again the F/D/UD value. In those<br>
+ * cases it would be possible to reuse the register where the F value is<br>
+ * initially stored instead of having two. Take also into account that when<br>
+ * operating with HF/W/UW, you would need to use the full register (so stride<br>
+ * 2). Packs/unpacks would be only useful when loading/storing several<br>
+ * HF/W/UWs.<br>
+ *<br>
+ * So something like this:<br>
+ * mov(8) vgrf14<2>:HF, vgrf39:F<br>
+ *<br>
+ * Became:<br>
+ * mov(8) vgrf39<2>:HF, vgrf39:F<br>
+ *<br>
+ * Note that no instruction is removed. The main benefict is reducing the<br>
+ * amoung of registers used, so the pressure on the register allocator is<br>
+ * decreased with big shaders.<br>
+ */<br>
+bool<br>
+fs_visitor::reuse_16bit_conve<wbr>rsions_vgrf()<br>
+{<br>
+ bool progress = false;<br>
+ int ip = -1;<br>
+<br>
+ calculate_live_intervals();<br>
+<br>
+ foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {<br>
+ ip++;<br>
+<br>
+ if (inst->dst.file != VGRF || inst->src[0].file != VGRF)<br>
+ continue;<br>
+<br>
+ if (inst->opcode != BRW_OPCODE_MOV)<br>
+ continue;<br>
+<br>
+ if (type_sz(inst->dst.type) != 2 || inst->dst.stride != 2 ||<br>
+ type_sz(inst->src[0].type) != 4 || inst->src[0].stride != 1) {<br>
+ continue;<br>
+ }<br>
+<br>
+ int src_reg = inst->src[0].nr;<br>
+ int src_offset = inst->src[0].offset;<br>
+ unsigned src_var = live_intervals->var_from_vgrf[<wbr>src_reg];<br>
+ int src_end = live_intervals->end[src_var];<br>
+ int dst_reg = inst-><a href="http://dst.nr" rel="noreferrer" target="_blank">dst.nr</a>;<br>
+<br>
+ if (src_end > ip)<br>
+ continue;<br>
+<br>
+ foreach_block_and_inst(block, fs_inst, scan_inst, cfg) {<br>
+ if (scan_inst->dst.file == VGRF &&<br>
+ scan_inst-><a href="http://dst.nr" rel="noreferrer" target="_blank">dst.nr</a> == dst_reg) {<br>
+ scan_inst-><a href="http://dst.nr" rel="noreferrer" target="_blank">dst.nr</a> = src_reg;<br>
+ scan_inst->dst.offset = src_offset;<br>
+ progress = true;<br>
+ }<br>
+<br>
+ for (int i = 0; i < scan_inst->sources; i++) {<br>
+ if (scan_inst->src[i].file == VGRF &&<br>
+ scan_inst->src[i].nr == dst_reg) {<br>
+ scan_inst->src[i].nr = src_reg;<br>
+ scan_inst->src[i].offset = src_offset;<br>
+ progress = true;<br>
+ }<br>
+ }<br>
+ }<br>
+ }<br>
+<br>
+ if (progress)<br>
+ invalidate_live_intervals();<br>
+<br>
+ return progress;<br>
+}<br>
+<br>
<br>
static void<br>
clear_deps_for_inst_src(fs_in<wbr>st *inst, bool *deps, int first_grf, int grf_len)<br>
@@ -5829,6 +5905,7 @@ fs_visitor::optimize()<br>
<br>
OPT(opt_drop_redundant_mov_to_<wbr>flags);<br>
OPT(remove_extra_rounding_mode<wbr>s);<br>
+ OPT(reuse_16bit_conversions_v<wbr>grf);<br>
<br>
do {<br>
progress = false;<br>
diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h<br>
index b9476e69edb..2685f748b87 100644<br>
--- a/src/intel/compiler/brw_fs.h<br>
+++ b/src/intel/compiler/brw_fs.h<br>
@@ -151,6 +151,7 @@ public:<br>
bool dead_code_eliminate();<br>
bool remove_duplicate_mrf_writes();<br>
bool remove_extra_rounding_modes();<br>
+ bool reuse_16bit_conversions_vgrf()<wbr>;<br>
<br>
bool opt_sampler_eot();<br>
bool virtual_grf_interferes(int a, int b);<br>
<font color="#888888">--<br>
2.11.0<br>
<br>
______________________________<wbr>_________________<br>
mesa-dev mailing list<br>
<a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.org</a><br>
<a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/mesa-dev</a><br>
</font></blockquote></div><br><br></div></div>