[Mesa-dev] [PATCH v3 25/42] intel/compiler: workaround for SIMD8 half-float MAD in gen8
Iago Toral
itoral at igalia.com
Mon Jan 21 10:55:31 UTC 2019
On Fri, 2019-01-18 at 11:51 -0600, Jason Ekstrand wrote:
> On Tue, Jan 15, 2019 at 7:55 AM Iago Toral Quiroga <itoral at igalia.com
> > wrote:
> > Broadwell hardware has a bug that manifests in SIMD8 executions of
> >
> > 16-bit MAD instructions when any of the sources is a Y or W
> > component.
> >
> > We pack these components in the same SIMD register as components X
> > and
> >
> > Z respectively, but starting at offset 16B (so they live in the
> > second
> >
> > half of the register). The problem does not exist in SKL or later.
> >
> >
> >
> > We work around this issue by moving any such sources to a temporary
> >
> > starting at offset 0B. We want to do this after the main
> > optimization loop
> >
> > to prevent copy-propagation and friends to undo the fix.
> >
> >
> >
> > Reviewed-by: Topi Pohjolainen <topi.pohjolainen at intel.com>
> >
> > ---
> >
> > src/intel/compiler/brw_fs.cpp | 48
> > +++++++++++++++++++++++++++++++++++
> >
> > src/intel/compiler/brw_fs.h | 1 +
> >
> > 2 files changed, 49 insertions(+)
> >
> >
> >
> > diff --git a/src/intel/compiler/brw_fs.cpp
> > b/src/intel/compiler/brw_fs.cpp
> >
> > index 0b3ec94e2d2..d6096cd667d 100644
> >
> > --- a/src/intel/compiler/brw_fs.cpp
> >
> > +++ b/src/intel/compiler/brw_fs.cpp
> >
> > @@ -6540,6 +6540,48 @@ fs_visitor::optimize()
> >
> > validate();
> >
> > }
> >
> >
> >
> > +/**
> >
> > + * Broadwell hardware has a bug that manifests in SIMD8 executions
> > of 16-bit
> >
> > + * MAD instructions when any of the sources is a Y or W component.
> > We pack
> >
> > + * these components in the same SIMD register as components X and
> > Z
> >
> > + * respectively, but starting at offset 16B (so they live in the
> > second half
> >
> > + * of the register).
>
> What exactly do you mean by a Y or W component? Is this for the case
> where you have a scalar that happens to land at certain offsets? Or
> does it apply to regular stride == 1 MADs? If it applied in the
> stride == 1 case, then I really don't see what this is doing to fix
> it. It might help if you provided some before and after assembly
> example.
This happens when a source to a half-float MAD instruction starts at
offset 16B, which at the time I wrote this patch, happened when we
were packing the Y component or the W component of a 16-bit vec2/vec4
into the second half of a SIMD register. I have an old version of that
branch and the CTS tests and was able to reproduce the problem. Here is
a code trace which is not working as expected, making some CTS tests
fail:
send(8) g16<1>UW g19<8,8,1>UD
data ( DC byte scattered read, 0, 4) mlen 1 rlen 1 { align1 1Q
};mov(8) g14.8<1>HF g16<16,8,2>HF {
align1 1Q };mad(8) g18<1>HF -
g17<4,4,1>HF g14.8<4,4,1>HF g11<4,4,1>HF { align16 1Q
};mov(8) g21<2>W g18<8,8,1>W {
align1 1Q };
If we run this pass, we would produce the same code, only that we would
replace the MAD instruction with this:
mov(8) g22<1>HF g14.8<8,8,1>HF {
align1 WE_all 1Q };mad(8) g18<1>HF -
g17<4,4,1>HF g22<4,4,1>HF g11<4,4,1>HF { align16 1Q };
Which makes the test pass.
It seems our compiler produces different code now than when I found
this and these same tests now pass without this pass because we simply
don't hit that scenario any more. It seems as if our shuffling code
after a load is not attempting to pack 2 16-bit vector components in
each VGRF anymore as we used to do when we implemented 16-bit storage
and therefore we no longer hit this scenario. Independently of whether
this change was intended or a bug, the hardware bug is there so I think
we still want to have code to dea with it.
> Also, this seems like something that should go in the new region
> restrictions pass as a special case in has_invalid_src_region.
Yes, I guess that makes sense now that we have this pass. I'll put this
there.
> --Jason
>
> > + *
> >
> > + * We work around this issue by moving any such sources to a
> > temporary
> >
> > + * starting at offset 0B. We want to do this after the main
> > optimization loop
> >
> > + * to prevent copy-propagation and friends to undo the fix.
> >
> > + */
> >
> > +void
> >
> > +fs_visitor::fixup_hf_mad()
> >
> > +{
> >
> > + if (devinfo->gen != 8)
> >
> > + return;
> >
> > +
> >
> > + bool progress = false;
> >
> > +
> >
> > + foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
> >
> > + if (inst->opcode != BRW_OPCODE_MAD ||
> >
> > + inst->dst.type != BRW_REGISTER_TYPE_HF ||
> >
> > + inst->exec_size > 8)
> >
> > + continue;
> >
> > +
> >
> > + for (int i = 0; i < 3; i++) {
> >
> > + if (inst->src[i].offset > 0) {
> >
> > + assert(inst->src[i].type == BRW_REGISTER_TYPE_HF);
> >
> > + const fs_builder ibld =
> >
> > + bld.at(block, inst).exec_all().group(inst-
> > >exec_size, 0);
> >
> > + fs_reg tmp = ibld.vgrf(inst->src[i].type);
> >
> > + ibld.MOV(tmp, inst->src[i]);
> >
> > + inst->src[i] = tmp;
> >
> > + progress = true;
> >
> > + }
> >
> > + }
> >
> > + }
> >
> > +
> >
> > + if (progress)
> >
> > + invalidate_live_intervals();
> >
> > +}
> >
> > +
> >
> > /**
> >
> > * Three source instruction must have a GRF/MRF destination
> > register.
> >
> > * ARF NULL is not allowed. Fix that up by allocating a temporary
> > GRF.
> >
> > @@ -6698,6 +6740,7 @@ fs_visitor::run_vs()
> >
> > assign_curb_setup();
> >
> > assign_vs_urb_setup();
> >
> >
> >
> > + fixup_hf_mad();
> >
> > fixup_3src_null_dest();
> >
> > allocate_registers(8, true);
> >
> >
> >
> > @@ -6782,6 +6825,7 @@ fs_visitor::run_tcs_single_patch()
> >
> > assign_curb_setup();
> >
> > assign_tcs_single_patch_urb_setup();
> >
> >
> >
> > + fixup_hf_mad();
> >
> > fixup_3src_null_dest();
> >
> > allocate_registers(8, true);
> >
> >
> >
> > @@ -6816,6 +6860,7 @@ fs_visitor::run_tes()
> >
> > assign_curb_setup();
> >
> > assign_tes_urb_setup();
> >
> >
> >
> > + fixup_hf_mad();
> >
> > fixup_3src_null_dest();
> >
> > allocate_registers(8, true);
> >
> >
> >
> > @@ -6865,6 +6910,7 @@ fs_visitor::run_gs()
> >
> > assign_curb_setup();
> >
> > assign_gs_urb_setup();
> >
> >
> >
> > + fixup_hf_mad();
> >
> > fixup_3src_null_dest();
> >
> > allocate_registers(8, true);
> >
> >
> >
> > @@ -6965,6 +7011,7 @@ fs_visitor::run_fs(bool allow_spilling, bool
> > do_rep_send)
> >
> >
> >
> > assign_urb_setup();
> >
> >
> >
> > + fixup_hf_mad();
> >
> > fixup_3src_null_dest();
> >
> > allocate_registers(8, allow_spilling);
> >
> >
> >
> > @@ -7009,6 +7056,7 @@ fs_visitor::run_cs(unsigned
> > min_dispatch_width)
> >
> >
> >
> > assign_curb_setup();
> >
> >
> >
> > + fixup_hf_mad();
> >
> > fixup_3src_null_dest();
> >
> > allocate_registers(min_dispatch_width, true);
> >
> >
> >
> > diff --git a/src/intel/compiler/brw_fs.h
> > b/src/intel/compiler/brw_fs.h
> >
> > index 68287bcdcea..1879d4bc7f7 100644
> >
> > --- a/src/intel/compiler/brw_fs.h
> >
> > +++ b/src/intel/compiler/brw_fs.h
> >
> > @@ -103,6 +103,7 @@ public:
> >
> > void setup_vs_payload();
> >
> > void setup_gs_payload();
> >
> > void setup_cs_payload();
> >
> > + void fixup_hf_mad();
> >
> > void fixup_3src_null_dest();
> >
> > void assign_curb_setup();
> >
> > void calculate_urb_setup();
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190121/0743bd07/attachment.html>
More information about the mesa-dev
mailing list