<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Sep 6, 2017 at 3:58 PM, Chema Casanova <<a href="mailto:jmcasanova@igalia.com" target="_blank">jmcasanova@igalia.com</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Connor and Curro, <div><div class="h5"> On 28/08/17 12:24, Alejandro Piñeiro wrote: > On 27/08/17 20:24, Connor Abbott wrote: >> Hi, >> >> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <<a href="mailto:apinheiro@igalia.com">apinheiro@igalia.com</a> >> <mailto:<a href="mailto:apinheiro@igalia.com">apinheiro@igalia.com</a>>> wrote: >> >> On 24/08/17 21:07, Connor Abbott wrote: >> > >> > Hi Alejandro, >> >> Hi Connor, >> >> > >> > This seems really suspicious. If the live ranges are really >> > independent, then the register allocator should be able to >> assign the >> > two virtual registers to the same physical register if it needs to. >> >> Yes, it is true, the register allocator should be able to assign two >> virtual registers to the same physical register. But that is done >> at the >> end (or really near the end), so late for the problem this >> optimization >> is trying to fix. >> >> >> Well, my understanding is that the problem is long compilation times >> due to spilling and our not-so-great implementation of it. So no, >> register allocation is not late for the problem. As both Curro and I >> explained, the change by itself can only pessimise register >> allocation, so if it helps then it must be due to a bug in the >> register allocator or a problem in a subsequent pass that's getting >> hidden by this one. > > Ok. > >> >> We are also reducing the amount of instructions used. >> >> >> The comments in the source code say otherwise. Any instructions >> eliminated were from spilling, which this pass only accidentally reduces. > > Yes, sorry, I explained myself poorly. The optimization itself doesn't > remove any instructions. But using it reduces the final number of > instructions, although as you say, they are likely due reducing the > spilling. > >> >> >> >> Probably not really clear on the commit message. When I say >> "reduce the >> pressure of the register allocator" I mean having a code that the >> register allocator would be able to handle without using too much >> time. >> The problem this optimization tries to solve is that for some 16 >> bit CTS >> tests (some with matrices and geometry shaders), the amount of virtual >> registers used and instructions was really big. For the record, >> initially, some tests needed 24 min just to compile. Right now, thanks >> to other optimizations, the slower test without this optimization >> needs >> 1min 30 seconds. Adding some hacky timestamps, the time used at >> fs_visitor::allocate_registers (brw_fs.cpp:6096) is: >> >> * While trying to schedule using the three available pre mode >> heuristics: 7 seconds >> * Allocation with spilling: 63 seconds >> * Final schedule using SCHEDULE_POST: 19 seconds >> >> With this optimization, the total time goes down to 14 seconds (10 >> + 0 + >> 3 on the previous bullet point list). >> >> One could argue that 1min 30 seconds is okish. But taking into account >> that it goes down to 14 seconds, even with some caveats (see below), I >> still think that it is worth to use the optimization. >> >> And a final comment. For that same test, this is the final stats >> (using >> INTEL_DEBUG): >> >> * With the optimization: SIMD8 shader: 4610 instructions. 0 loops. >> 130320 cycles. 15:9 spills:fills. >> * Without the optimization: SIMD8 shader: 12312 instructions. 0 >> loops. >> 174816 cycles. 751:1851 spills:fills. >> >> >> So, the fact that it helps at all with SIMD8 shows that my theory is >> wrong, but since your pass reduces spilling, it clearly must be >> avoiding a bug somewhere else. You need to compare the IR for a shader >> with the problem with and without this pass right before register >> allocation. Maybe the sources and destinations of the conversion >> instructions interfere without the change due to some other pass >> that's increasing register pressure, in which case that's the problem, >> but I doubt it. > > Ok, thanks for the hints. </div></div>After some research we found that we need to adapt the live_variables algorithm to support 32 to 16-bit conversions. Because of the HW alignment restrictions these conversions need that the result register uses stride=2, so it is not continuous (stride!=1) so by definition is_partial_write returns true. Any of the next last 3 conditions could be true when we use 16-bit types. bool fs_inst::is_partial_write() const { return ((this->predicate && this->opcode != BRW_OPCODE_SEL) || (this->exec_size * type_sz(this->dst.type)) < 32 || !this->dst.is_contiguous() || this->dst.offset % REG_SIZE != 0); } So at the check on the setup_one_write function at brw_fs_live_variables.cpp the variable isn't marked as defined completely in the block. if (inst->dst.file == VGRF && !inst->is_partial_write()) { if (!BITSET_TEST(bd->use, var)) BITSET_SET(bd->def, var); } That makes that the live start of the variable is expected to defined before the beginning of the block. In the commented test by Alejandro we have a big unrolled loop that increases the pressure of the register artificially making all result registers of a conversion to start life at instruction 0. </blockquote><div> </div><div>Quick drive-by comment. Curro and I talked about this quite a bit in the office yesterday as I've been hitting similar issues with the vec2->double conversion that we do when loading 64-bit values from UBOs and SSBOs. He's got a patch to liveness which makes registers which are only ever partially written not extned their live ranges infinitely upward. That's probably easier than trying to track things per-component.</div><div> </div><div>--Jason </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> A simpler approach to the submitted optimization would be to consider as a fully defined variable in a block the result of a 32-16 bit conversion. This way the register allocator has more freedom to assign the register because the live_interval is correct for this variable. Something like the following code: diff --git a/src/intel/compiler/brw_fs_live_variables.cpp b/src/intel/compiler/brw_fs_live_variables.cpp index c449672a51..98d8583eaa 100644 --- a/src/intel/compiler/brw_fs_live_variables.cpp +++ b/src/intel/compiler/brw_fs_live_variables.cpp @@ -83,7 +83,10 @@ fs_live_variables::setup_one_write(struct block_data *bd, fs_inst *inst, /* The def[] bitset marks when an initialization in a block completely * screens off previous updates of that variable (VGRF channel). */ - if (inst->dst.file == VGRF && !inst->is_partial_write()) { + if (inst->dst.file == VGRF && (!inst->is_partial_write() || + ((type_sz(inst->dst.type) == 2) && + (inst->opcode == BRW_OPCODE_MOV) && + (type_sz(inst->src[0].type) == 4)))) { if (!BITSET_TEST(bd->use, var)) BITSET_SET(bd->def, var); } This new code reduces the spills from 1217 to 1089 in our worse tests compared to using the submitted optimization and without regressions. Although this is a simple fix that address the problem with the conversions (that is the main use that 16bit arithmetic of the VK_KHR_16bit_storage extension), there are other cases that we didn't addressed with the reuse_16bit_conversions_register that need a more complex approach. I'm thinking about the cases when 16-bit types are packed/unpacked or shuffled/unshuffled in the storage operations. They are also partial_writes but the aggregation of several operations make that the register could be considered as completely defined in a block. - This would need doing a "subregister" tracking for stride=2 and offset=[0,2] for shuffled components that are written using 32-bit storage operations. - And the same for packet 16-bit values on SIMD8 that use half register continuously. So if only half of the register is defined for a given variable and only that half is read by the shader we can consider consistent that the variable is completely defined. But in the case that the different halves are defined in different blocks things get complicated but could stay as partial write. Does it makes sense to go deeper with an approach like this one? Maybe is not strictly needed for this series but it could be an interesting follow up for adjusting the liveness of the payloads of the 16-bit store operations and any future use of 16-bit types. Thanks in advance for your feedback. Chema >> (IIRC there's a debug option to print out the register pressure for >> each instruction, which will be helpful here). > > I tried to use that option back when I was working on this patch, > without too much luck. Will try again. > >> If that's not the case, you should check to see if the register >> allocator thinks the sources and destinations of the conversion >> instructions interfere, and if so figure out why - that will likely be >> the real bug. > > Ok. > > Thanks Connor and Curro for the comments. I will work on a different > solution. <div class="HOEnZb"><div class="h5">_______________________________________________ mesa-dev mailing list <a href="mailto:mesa-dev@lists.freedesktop.org">mesa-dev@lists.freedesktop.org</a> <a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/mesa-dev</a> </div></div></blockquote></div> </div></div>