<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Mon, Jan 21, 2019 at 4:55 AM Iago Toral <<a href="mailto:itoral@igalia.com">itoral@igalia.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="text-align:left;direction:ltr"><div>On Fri, 2019-01-18 at 11:51 -0600, Jason Ekstrand wrote:</div><blockquote type="cite" style="margin:0px 0px 0px 0.8ex;border-left:2px solid rgb(114,159,207);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Tue, Jan 15, 2019 at 7:55 AM Iago Toral Quiroga <<a href="mailto:itoral@igalia.com" target="_blank">itoral@igalia.com</a>> wrote:<br></div><blockquote type="cite" style="margin:0px 0px 0px 0.8ex;border-left:2px solid rgb(114,159,207);padding-left:1ex">Broadwell hardware has a bug that manifests in SIMD8 executions of<br> 16-bit MAD instructions when any of the sources is a Y or W component.<br> We pack these components in the same SIMD register as components X and<br> Z respectively, but starting at offset 16B (so they live in the second<br> half of the register). The problem does not exist in SKL or later.<br> <br> We work around this issue by moving any such sources to a temporary<br> starting at offset 0B. We want to do this after the main optimization loop<br> to prevent copy-propagation and friends to undo the fix.<br> <br> Reviewed-by: Topi Pohjolainen <<a href="mailto:topi.pohjolainen@intel.com" target="_blank">topi.pohjolainen@intel.com</a>><br> ---<br> src/intel/compiler/brw_fs.cpp | 48 +++++++++++++++++++++++++++++++++++<br> src/intel/compiler/brw_fs.h | 1 +<br> 2 files changed, 49 insertions(+)<br> <br> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp<br> index 0b3ec94e2d2..d6096cd667d 100644<br> --- a/src/intel/compiler/brw_fs.cpp<br> +++ b/src/intel/compiler/brw_fs.cpp<br> @@ -6540,6 +6540,48 @@ fs_visitor::optimize()<br> validate();<br> }<br> <br> +/**<br> + * Broadwell hardware has a bug that manifests in SIMD8 executions of 16-bit<br> + * MAD instructions when any of the sources is a Y or W component. We pack<br> + * these components in the same SIMD register as components X and Z<br> + * respectively, but starting at offset 16B (so they live in the second half<br> + * of the register).<br></blockquote><div><br></div><div>What exactly do you mean by a Y or W component? Is this for the case where you have a scalar that happens to land at certain offsets? Or does it apply to regular stride == 1 MADs? If it applied in the stride == 1 case, then I really don't see what this is doing to fix it. It might help if you provided some before and after assembly example.</div></div></div></blockquote><div><br></div><div>This happens when a source to a half-float MAD instruction starts at offset 16B, which at the time I wrote this patch, happened when we were packing the Y component or the W component of a 16-bit vec2/vec4 into the second half of a SIMD register. I have an old version of that branch and the CTS tests and was able to reproduce the problem. Here is a code trace which is not working as expected, making some CTS tests fail:</div><div><br></div><div>send(8) g16<1>UW g19<8,8,1>UD</div><div> data ( DC byte scattered read, 0, 4) mlen 1 rlen 1 { align1 1Q };</div><div>mov(8) g14.8<1>HF g16<16,8,2>HF { align1 1Q };</div><div>mad(8) g18<1>HF -g17<4,4,1>HF g14.8<4,4,1>HF g11<4,4,1>HF { align16 1Q };</div><div>mov(8) g21<2>W g18<8,8,1>W { align1 1Q };</div><div><br></div><div>If we run this pass, we would produce the same code, only that we would replace the MAD instruction with this:</div><div><br></div><div>mov(8) g22<1>HF g14.8<8,8,1>HF { align1 WE_all 1Q };</div><div>mad(8) g18<1>HF -g17<4,4,1>HF g22<4,4,1>HF g11<4,4,1>HF { align16 1Q };</div><div><br></div><div>Which makes the test pass.</div><div><br></div><div>It seems our compiler produces different code now than when I found this and these same tests now pass without this pass because we simply don't hit that scenario any more. It seems as if our shuffling code after a load is not attempting to pack 2 16-bit vector components in each VGRF anymore as we used to do when we implemented 16-bit storage and therefore we no longer hit this scenario. Independently of whether this change was intended or a bug, the hardware bug is there so I think we still want to have code to dea with it.</div></div></blockquote><div><br></div><div>Thanks for the info! It makes a lot more sense now. Does this only apply to wide reads or does it also fail with a stride of 0? Also, is it only 16 or is it all non-zero values or everything >= 16? It'd be good if we could get more data.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="text-align:left;direction:ltr"><blockquote type="cite" style="margin:0px 0px 0px 0.8ex;border-left:2px solid rgb(114,159,207);padding-left:1ex"><div dir="ltr"><blockquote type="cite" style="margin:0px 0px 0px 0.8ex;border-left:2px solid rgb(114,159,207);padding-left:1ex"><br></blockquote><div class="gmail_quote"><div>Also, this seems like something that should go in the new region restrictions pass as a special case in has_invalid_src_region.</div></div></div></blockquote><div><br></div><div>Yes, I guess that makes sense now that we have this pass. I'll put this there.</div><div><br></div><blockquote type="cite" style="margin:0px 0px 0px 0.8ex;border-left:2px solid rgb(114,159,207);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div>--Jason<br></div><div> </div><blockquote type="cite" style="margin:0px 0px 0px 0.8ex;border-left:2px solid rgb(114,159,207);padding-left:1ex"> + *<br> + * We work around this issue by moving any such sources to a temporary<br> + * starting at offset 0B. We want to do this after the main optimization loop<br> + * to prevent copy-propagation and friends to undo the fix.<br> + */<br> +void<br> +fs_visitor::fixup_hf_mad()<br> +{<br> + if (devinfo->gen != 8)<br> + return;<br> +<br> + bool progress = false;<br> +<br> + foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {<br> + if (inst->opcode != BRW_OPCODE_MAD ||<br> + inst->dst.type != BRW_REGISTER_TYPE_HF ||<br> + inst->exec_size > 8)<br> + continue;<br> +<br> + for (int i = 0; i < 3; i++) {<br> + if (inst->src[i].offset > 0) {<br> + assert(inst->src[i].type == BRW_REGISTER_TYPE_HF);<br> + const fs_builder ibld =<br> + <a href="http://bld.at" rel="noreferrer" target="_blank">bld.at</a>(block, inst).exec_all().group(inst->exec_size, 0);<br> + fs_reg tmp = ibld.vgrf(inst->src[i].type);<br> + ibld.MOV(tmp, inst->src[i]);<br> + inst->src[i] = tmp;<br> + progress = true;<br> + }<br> + }<br> + }<br> +<br> + if (progress)<br> + invalidate_live_intervals();<br> +}<br> +<br> /**<br> * Three source instruction must have a GRF/MRF destination register.<br> * ARF NULL is not allowed. Fix that up by allocating a temporary GRF.<br> @@ -6698,6 +6740,7 @@ fs_visitor::run_vs()<br> assign_curb_setup();<br> assign_vs_urb_setup();<br> <br> + fixup_hf_mad();<br> fixup_3src_null_dest();<br> allocate_registers(8, true);<br> <br> @@ -6782,6 +6825,7 @@ fs_visitor::run_tcs_single_patch()<br> assign_curb_setup();<br> assign_tcs_single_patch_urb_setup();<br> <br> + fixup_hf_mad();<br> fixup_3src_null_dest();<br> allocate_registers(8, true);<br> <br> @@ -6816,6 +6860,7 @@ fs_visitor::run_tes()<br> assign_curb_setup();<br> assign_tes_urb_setup();<br> <br> + fixup_hf_mad();<br> fixup_3src_null_dest();<br> allocate_registers(8, true);<br> <br> @@ -6865,6 +6910,7 @@ fs_visitor::run_gs()<br> assign_curb_setup();<br> assign_gs_urb_setup();<br> <br> + fixup_hf_mad();<br> fixup_3src_null_dest();<br> allocate_registers(8, true);<br> <br> @@ -6965,6 +7011,7 @@ fs_visitor::run_fs(bool allow_spilling, bool do_rep_send)<br> <br> assign_urb_setup();<br> <br> + fixup_hf_mad();<br> fixup_3src_null_dest();<br> allocate_registers(8, allow_spilling);<br> <br> @@ -7009,6 +7056,7 @@ fs_visitor::run_cs(unsigned min_dispatch_width)<br> <br> assign_curb_setup();<br> <br> + fixup_hf_mad();<br> fixup_3src_null_dest();<br> allocate_registers(min_dispatch_width, true);<br> <br> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h<br> index 68287bcdcea..1879d4bc7f7 100644<br> --- a/src/intel/compiler/brw_fs.h<br> +++ b/src/intel/compiler/brw_fs.h<br> @@ -103,6 +103,7 @@ public:<br> void setup_vs_payload();<br> void setup_gs_payload();<br> void setup_cs_payload();<br> + void fixup_hf_mad();<br> void fixup_3src_null_dest();<br> void assign_curb_setup();<br> void calculate_urb_setup();<br> </blockquote></div></div></blockquote></div> </blockquote></div></div>