<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Thu, Jan 17, 2019 at 5:11 PM Francisco Jerez <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Subject is still inaccurate. How about "intel/fs: Don't touch<br> accumulator destination while applying regioning alignment rule."<br></blockquote><div><br></div><div>Sounds good to me. I'll fix it.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> Jason Ekstrand <<a href="mailto:jason@jlekstrand.net" target="_blank">jason@jlekstrand.net</a>> writes:<br> <br> > In some shaders, you can end up with a stride in the source of a<br> > SHADER_OPCODE_MULH. One way this can happen is if the MULH is acting on<br> > the top bits of a 64-bit value due to 64-bit integer lowering. In this<br> > case, the compiler will produce something like this:<br> ><br> > mul(8) acc0<1>UD g5<8,4,2>UD 0x0004UW { align1 1Q };<br> > mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable };<br> ><br> > The new region fixup pass looks at the MUL and sees a strided source and<br> > unstrided destination and determines that the sequence is illegal. It<br> > then attempts to fix the illegal stride by replacing the destination of<br> > the MUL with a temporary and emitting a MOV into the accumulator:<br> ><br> > mul(8) g9<2>UD g5<8,4,2>UD 0x0004UW { align1 1Q };<br> > mov(8) acc0<1>UD g9<8,4,2>UD { align1 1Q };<br> > mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable };<br> ><br> > Unfortunately, this new sequence isn't correct because MOV accesses the<br> > accumulator with a different precision to MUL and, instead of filling<br> > the bottom 32 bits with the source and zeroing the top 32 bits, it<br> > leaves the top 32 (or maybe 31) bits alone and full of garbage. When<br> > the MACH comes along and tries to complete the multiplication, the<br> > result is correct in the bottom 32 bits (which we throw away) and<br> > garbage in the top 32 bits which are actually returned by MACH.<br> ><br> > This commit does two things: First, it adds an assert to ensure that we<br> > don't try to rewrite accumulator destinations of MUL instructions so we<br> > can avoid this precision issue. Second, it modifies<br> > required_dst_byte_stride to require a tightly packed stride so that we<br> > fix up the sources instead and the actual code which gets emitted is<br> > this:<br> ><br> > mov(8) g9<1>UD g5<8,4,2>UD { align1 1Q };<br> > mul(8) acc0<1>UD g9<8,8,1>UD 0x0004UW { align1 1Q };<br> > mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable };<br> ><br> > Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass"<br> > Cc: Francisco Jerez <<a href="mailto:currojerez@riseup.net" target="_blank">currojerez@riseup.net</a>><br> > ---<br> > src/intel/compiler/brw_fs_lower_regioning.cpp | 24 ++++++++++++++++++-<br> > 1 file changed, 23 insertions(+), 1 deletion(-)<br> ><br> > diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp b/src/intel/compiler/brw_fs_lower_regioning.cpp<br> > index cc4163b4c2c..00cb5769ebe 100644<br> > --- a/src/intel/compiler/brw_fs_lower_regioning.cpp<br> > +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp<br> > @@ -53,7 +53,21 @@ namespace {<br> > unsigned<br> > required_dst_byte_stride(const fs_inst *inst)<br> > {<br> > - if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&<br> > + if (inst->dst.is_accumulator()) {<br> > + /* If the destination is an accumulator, insist that we leave the<br> > + * stride alone. We cannot "fix" accumulator destinations by writing<br> > + * to a temporary and emitting a MOV into the original destination.<br> > + * For multiply instructions (our one use of the accumulator), the<br> > + * MUL writes the full 66 bits of the accumulator whereas the MOV we<br> > + * would emit only writes 33 bits ane leaves the top 33 bits<br> <br> and*<br> <br> > + * undefined.<br> > + *<br> > + * It's safe to just require the original stride here because the<br> > + * lowering pass will detect the mismatch in required_src_byte_stride<br> > + * just fix up the sources of the multiply instead of the destination.<br> <br> There isn't such a thing as "required_src_byte_stride". Conjunction<br> missing between that sentence and the "just fix up..." one.<br> <br> Code is still:<br> <br> Reviewed-by: Francisco Jerez <<a href="mailto:currojerez@riseup.net" target="_blank">currojerez@riseup.net</a>><br></blockquote><div><br></div><div>Thanks! I'll do one more Jenkins run and push it.</div><div><br></div><div>--Jason<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> > + */<br> > + return inst->dst.stride * type_sz(inst->dst.type);<br> > + } else if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&<br> > !is_byte_raw_mov(inst)) {<br> > return get_exec_type_size(inst);<br> > } else {<br> > @@ -316,6 +330,14 @@ namespace {<br> > bool<br> > lower_dst_region(fs_visitor *v, bblock_t *block, fs_inst *inst)<br> > {<br> > + /* We cannot replace the result of an integer multiply which writes the<br> > + * accumulator because MUL+MACH pairs act on the accumulator as a 66-bit<br> > + * value whereas the MOV will act on only 32 or 33 bits of the<br> > + * accumulator.<br> > + */<br> > + assert(inst->opcode != BRW_OPCODE_MUL || !inst->dst.is_accumulator() ||<br> > + brw_reg_type_is_floating_point(inst->dst.type));<br> > +<br> > const fs_builder ibld(v, block, inst);<br> > const unsigned stride = required_dst_byte_stride(inst) /<br> > type_sz(inst->dst.type);<br> > -- <br> > 2.20.1<br> </blockquote></div></div>