<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr">Bah... previous e-mail unfinished. Please ignore.<br></div><br><div class="gmail_quote"><div dir="ltr">On Thu, Jan 17, 2019 at 4:15 AM Francisco Jerez <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Jason Ekstrand <<a href="mailto:jason@jlekstrand.net" target="_blank">jason@jlekstrand.net</a>> writes:<br> <br> > The pass was discovered to cause problems with the MUL+MACH combination<br> > we emit for nir_[iu]mul_high. In an experimental branch of mine, I ran<br> > into issues where the MUL+MACH ended up using a strided source due to<br> > working on half of a uint64_t and the new lowering pass helpfully tried<br> > to fix the multiply which wrote to an unstriated accumulator.<br> <br> > Not only did the multiply not need to be fixed<br> <br> That's far from clear, and inconsistent with what this patch is doing,<br> since the fix is still being applied (Wouldn't it make sense to clarify<br> that in the commit message since it's slightly misleading about it?).<br> <br> The original instruction was technically violating the first CHV/BXT<br> double-precision regioning restriction before the pass was introduced,<br> that's why it made any changes in the first place. The integer<br> multiplication lowering code was just lucky enough that violating the<br> restriction didn't matter in this case, but I doubt that the reason for<br> that had anything to do with the accumulator being the explicit<br> destination...<br></blockquote><div><br></div><div><div>Explicit, no, but I do suspect that does have to do with it being the accumulator. This restriction isn't theoretical; if you violate it with a GRF, you will get data corruption; I've seen it myself. I could see two possible explanations:</div><div><br></div><div> 1. Under the hood the accumulator is written with a Q type and an internal stride of 8 bytes, hence the restriction does apply but is implicitly satisfied for D type source strides of 1 and 2.</div><div> 2. The data path to the accumulator is a special case in the hardware and doesn't use the normal general-purpose regioning logic and so doesn't require the restriction.<br></div><div><br></div><div>I don't find the first one very convincing at all. Among other things, if it were the true reason, it would imply that we would need to use a stride of exactly 2 on D type sources which but empirical evidence suggests that "mul(8) acc0<1> g5<8,8,1>UD g9<16,8,2>UW" works just fine.<br></div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> > but the "fix" ended up breaking it because a MOV to the accumulator is<br> > not the same as using it as a multiply destination due to the magic<br> > way the 33/64 bits of the<br> <br> Technically it has 66 bits (it wasn't a typo when I said that to you<br> earlier on IRC). That's how it can t hold the result of a SIMD16<br> 16x16-bit integer multiplication with 33-bit signed precision per scalar<br> component.<br></blockquote><div><br></div><div><div>Yes, there are 33 bits available for WxW multiplies but this is dealing with a DxD multiply which only has 64 bits according to this bit of bspec text:<br></div><div><br></div><div>As there are only 64 bits per channel in DWord mode (D and UD), it is sufficient to store the multiplication result of two DWord operands as long as the post source modified sources are still within 32 bits. If any one source is type UD and is negated, the negated result becomes 33 bits. The DWord multiplication result is then 65 bits, bigger than the storage capacity of accumulators. Consequently, the results are unpredictable.</div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> > accumulator are handled for different instruction types.<br> ><br> > Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass"<br> > Cc: Francisco Jerez <<a href="mailto:currojerez@riseup.net" target="_blank">currojerez@riseup.net</a>><br> > ---<br> > src/intel/compiler/brw_fs_lower_regioning.cpp | 16 +++++++++++++++-<br> > 1 file changed, 15 insertions(+), 1 deletion(-)<br> ><br> > diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp b/src/intel/compiler/brw_fs_lower_regioning.cpp<br> > index cc4163b4c2c..b8a89e82272 100644<br> > --- a/src/intel/compiler/brw_fs_lower_regioning.cpp<br> > +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp<br> > @@ -53,7 +53,13 @@ namespace {<br> > unsigned<br> > required_dst_byte_stride(const fs_inst *inst)<br> > {<br> > - if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&<br> > + if (inst->dst.is_accumulator()) {<br> > + /* Even though it's not explicitly documented in the PRMs or the<br> > + * BSpec, writes to the accumulator appear to not need any special<br> > + * treatment with respect too their destination stride alignment.<br> > + */<br> <br> The code is not really doing what the comment says. The<br> destination/source stride alignment restriction will still be honored<br> for this instruction. It's just that the destination *has* to be left<br> untouched while doing that in the case of an integer MUL/MACH<br> instruction (that's the only reason I asked you to return the original<br> byte stride of the destination), because splitting off the region into a<br> MOV would lead to data loss due to the inconsistent semantics of the<br> accumulator destination for integer MUL/MACH (which update the whole 66<br> bits) and every other integer arithmetic instruction (which update the<br> bottom 33 bits and *apparently* leave the top 33 bits uninitialized) --<br> IOW this is only here so that the assert below doesn't fire.<br></blockquote><div><br></div><div>Ok, now I'm very confused. It sounds to me like you still think it's broken but it's less broken than a MOV to acc0 so we should just go with it? That's very disconcerting.... If there's more investigation you'd like to see done, I'm willing to poke at it a bit more and see if we can get a better understanding.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> > + return inst->dst.stride * type_sz(inst->dst.type);<br> > + } else if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&<br> <br> The code changes themselves are just as I wished, so this gets my:<br> <br> Reviewed-by: Francisco Jerez <<a href="mailto:currojerez@riseup.net" target="_blank">currojerez@riseup.net</a>><br> <br> assuming that you clarify the commit message and comment above.<br></blockquote><div><br></div><div>I agree that the commit message above was a bit terse but it's entirely unclear what it *should* say. How about something like this:</div><div><br></div><div>BEGIN COMMITMSG<br></div><div>In some shaders, you can end up with a stride in the source of a SHADER_OPCODE_MULH. One way this can happen is if the MULH is acting on the top bits of a 64-bit value due to 64-bit integer lowering. In this case, the compiler will produce something like this:</div><div><br></div><div>mul(8) acc0<1>UD g5<8,4,2>UD 0x0004UW { align1 1Q };<br>mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable };<br><br></div><div>The new region fixup pass looks at the MUL and sees a strided source and unstrided destination and determines that the sequence is illegal. It then attempts to fix the illegal stride by replacing the destination of the MUL with a temporary and emitting a MOV into the accumulator:</div><div><br></div><div>mul(8) g9<2>UD g5<8,4,2>UD 0x0004UW { align1 1Q };<br>mov(8) acc0<1>UD g9<8,4,2>UD { align1 1Q };<br>mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable };<br><br></div><div>Unfortunately, this new sequence isn't correct because MOV accesses the accumulator with a different precision to MUL and, instead of filling the bottom 32 bits with the source and zeroing the top 32 bits, it leaves the top 32 (or maybe 31) bits alone and full of garbage. When the MACH comes along and tries to complete the multiplication, the result is correct in the bottom 32 bits (which we throw away) and garbage in the top 32 bits which are actually returned by MACH.</div><div><br></div><div>This commit does two things: First, it adds an assert to ensure that we don't try to rewrite accumulator destinations of MUL instructions so we can avoid this precision issue. Second, it modifies required_dst_byte_stride to require a tightly packed stride so that we fix up the sources instead and the actual code which gets emitted is this:</div><div><br></div><div>mov(8) g9<1>UD g5<8,4,2>UD { align1 1Q };<br>mul(8) acc0<1>UD g9<8,8,1>UD 0x0004UW { align1 1Q };<br>mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable };</div><div>END COMMITMSG<br></div><div><br></div><div>Is that a better commit message?</div><div><br></div><div>You may be saying to yourself, "See, the working code has an unstrided source because we fix the source; the restriction still applies" Well... I tried that too... I modified the fixup pass to just completely ignore the restriction if it has an accumulator destination and it emits the following:</div><div><br></div><div>mul(8) acc0<1>UD g5<8,4,2>UD 0x0004UW { align1 1Q };<br>mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable };<br></div><div><br></div><div>which works perfectly well. I suspect that the source strides still have to match but I really don't think the hardware cares about the destination stride when it's an accumulator.</div><div><br></div><div>--Jason<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> > !is_byte_raw_mov(inst)) {<br> > return get_exec_type_size(inst);<br> > } else {<br> > @@ -316,6 +322,14 @@ namespace {<br> > bool<br> > lower_dst_region(fs_visitor *v, bblock_t *block, fs_inst *inst)<br> > {<br> > + /* We cannot replace the result of an integer multiply which writes the<br> > + * accumulator because MUL+MACH pairs act on the accumulator as a 64-bit<br> > + * value whereas the MOV will act on only 32 or 33 bits of the<br> > + * accumulator.<br> > + */<br> > + assert(inst->opcode != BRW_OPCODE_MUL || !inst->dst.is_accumulator() ||<br> > + brw_reg_type_is_floating_point(inst->dst.type));<br> > +<br> > const fs_builder ibld(v, block, inst);<br> > const unsigned stride = required_dst_byte_stride(inst) /<br> > type_sz(inst->dst.type);<br> > -- <br> > 2.20.1<br> </blockquote></div></div></div></div></div></div></div></div>