[Mesa-dev] [PATCH v3 25/42] intel/compiler: workaround for SIMD8 half-float MAD in gen8

Tue Jan 22 00:48:45 UTC 2019

On Mon, Jan 21, 2019 at 4:55 AM Iago Toral <itoral at igalia.com> wrote:

> On Fri, 2019-01-18 at 11:51 -0600, Jason Ekstrand wrote:
>
> On Tue, Jan 15, 2019 at 7:55 AM Iago Toral Quiroga <itoral at igalia.com>
> wrote:
>
> Broadwell hardware has a bug that manifests in SIMD8 executions of
> 16-bit MAD instructions when any of the sources is a Y or W component.
> We pack these components in the same SIMD register as components X and
> Z respectively, but starting at offset 16B (so they live in the second
> half of the register). The problem does not exist in SKL or later.
>
> We work around this issue by moving any such sources to a temporary
> starting at offset 0B. We want to do this after the main optimization loop
> to prevent copy-propagation and friends to undo the fix.
>
> Reviewed-by: Topi Pohjolainen <topi.pohjolainen at intel.com>
> ---
>  src/intel/compiler/brw_fs.cpp | 48 +++++++++++++++++++++++++++++++++++
>  src/intel/compiler/brw_fs.h   |  1 +
>  2 files changed, 49 insertions(+)
>
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index 0b3ec94e2d2..d6096cd667d 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -6540,6 +6540,48 @@ fs_visitor::optimize()
>     validate();
>  }
>
> +/**
> + * Broadwell hardware has a bug that manifests in SIMD8 executions of
> 16-bit
> + * MAD instructions when any of the sources is a Y or W component. We pack
> + * these components in the same SIMD register as components X and Z
> + * respectively, but starting at offset 16B (so they live in the second
> half
> + * of the register).
>
>
> What exactly do you mean by a Y or W component?  Is this for the case
> where you have a scalar that happens to land at certain offsets?  Or does
> it apply to regular stride == 1 MADs?  If it applied in the stride == 1
> case, then I really don't see what this is doing to fix it.  It might help
> if you provided some before and after assembly example.
>
>
> This happens when a source to a half-float MAD instruction starts at
> offset 16B, which at the time I wrote this patch, happened when we were
> packing the Y component or the W component of a 16-bit vec2/vec4 into the
> second half of a SIMD register. I have an old version of that branch and
> the CTS tests and was able to reproduce the problem. Here is a code trace
> which is not working as expected, making some CTS tests fail:
>
> send(8)         g16<1>UW        g19<8,8,1>UD
>                             data ( DC byte scattered read, 0, 4) mlen 1
> rlen 1 { align1 1Q };
> mov(8)          g14.8<1>HF      g16<16,8,2>HF                   { align1
> 1Q };
> mad(8)          g18<1>HF        -g17<4,4,1>HF   g14.8<4,4,1>HF  g11<4,4,1>HF
> { align16 1Q };
> mov(8)          g21<2>W         g18<8,8,1>W                     { align1
> 1Q };
>
> If we run this pass, we would produce the same code, only that we would
> replace the MAD instruction with this:
>
> mov(8)          g22<1>HF        g14.8<8,8,1>HF                  { align1
> WE_all 1Q };
> mad(8)          g18<1>HF        -g17<4,4,1>HF   g22<4,4,1>HF    g11<4,4,1>HF
> { align16 1Q };
>
> Which makes the test pass.
>
> It seems our compiler produces different code now than when I found this
> and these same tests now pass without this pass because we simply don't hit
> that scenario any more. It seems as if our shuffling code after a load is
> not attempting to pack 2 16-bit vector components in each VGRF anymore as
> we used to do when we implemented 16-bit storage and therefore we no longer
> hit this scenario. Independently of whether this change was intended or a
> bug, the hardware bug is there so I think we still want to have code to dea
> with it.
>

Thanks for the info!  It makes a lot more sense now.  Does this only apply
to wide reads or does it also fail with a stride of 0?  Also, is it only 16
or is it all non-zero values or everything >= 16?  It'd be good if we could
get more data.

>
> Also, this seems like something that should go in the new region
> restrictions pass as a special case in has_invalid_src_region.
>
>
> Yes, I guess that makes sense now that we have this pass. I'll put this
> there.
>
> --Jason
>
>
> + *
> + * We work around this issue by moving any such sources to a temporary
> + * starting at offset 0B. We want to do this after the main optimization
> loop
> + * to prevent copy-propagation and friends to undo the fix.
> + */
> +void
> +fs_visitor::fixup_hf_mad()
> +{
> +   if (devinfo->gen != 8)
> +      return;
> +
> +   bool progress = false;
> +
> +   foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
> +      if (inst->opcode != BRW_OPCODE_MAD ||
> +          inst->dst.type != BRW_REGISTER_TYPE_HF ||
> +          inst->exec_size > 8)
> +         continue;
> +
> +      for (int i = 0; i < 3; i++) {
> +         if (inst->src[i].offset > 0) {
> +            assert(inst->src[i].type == BRW_REGISTER_TYPE_HF);
> +            const fs_builder ibld =
> +               bld.at(block, inst).exec_all().group(inst->exec_size, 0);
> +            fs_reg tmp = ibld.vgrf(inst->src[i].type);
> +            ibld.MOV(tmp, inst->src[i]);
> +            inst->src[i] = tmp;
> +            progress = true;
> +         }
> +      }
> +   }
> +
> +   if (progress)
> +      invalidate_live_intervals();
> +}
> +
>  /**
>   * Three source instruction must have a GRF/MRF destination register.
>   * ARF NULL is not allowed.  Fix that up by allocating a temporary GRF.
> @@ -6698,6 +6740,7 @@ fs_visitor::run_vs()
>     assign_curb_setup();
>     assign_vs_urb_setup();
>
> +   fixup_hf_mad();
>     fixup_3src_null_dest();
>     allocate_registers(8, true);
>
> @@ -6782,6 +6825,7 @@ fs_visitor::run_tcs_single_patch()
>     assign_curb_setup();
>     assign_tcs_single_patch_urb_setup();
>
> +   fixup_hf_mad();
>     fixup_3src_null_dest();
>     allocate_registers(8, true);
>
> @@ -6816,6 +6860,7 @@ fs_visitor::run_tes()
>     assign_curb_setup();
>     assign_tes_urb_setup();
>
> +   fixup_hf_mad();
>     fixup_3src_null_dest();
>     allocate_registers(8, true);
>
> @@ -6865,6 +6910,7 @@ fs_visitor::run_gs()
>     assign_curb_setup();
>     assign_gs_urb_setup();
>
> +   fixup_hf_mad();
>     fixup_3src_null_dest();
>     allocate_registers(8, true);
>
> @@ -6965,6 +7011,7 @@ fs_visitor::run_fs(bool allow_spilling, bool
> do_rep_send)
>
>        assign_urb_setup();
>
> +      fixup_hf_mad();
>        fixup_3src_null_dest();
>        allocate_registers(8, allow_spilling);
>
> @@ -7009,6 +7056,7 @@ fs_visitor::run_cs(unsigned min_dispatch_width)
>
>     assign_curb_setup();
>
> +   fixup_hf_mad();
>     fixup_3src_null_dest();
>     allocate_registers(min_dispatch_width, true);
>
> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
> index 68287bcdcea..1879d4bc7f7 100644
> --- a/src/intel/compiler/brw_fs.h
> +++ b/src/intel/compiler/brw_fs.h
> @@ -103,6 +103,7 @@ public:
>     void setup_vs_payload();
>     void setup_gs_payload();
>     void setup_cs_payload();
> +   void fixup_hf_mad();
>     void fixup_3src_null_dest();
>     void assign_curb_setup();
>     void calculate_urb_setup();
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190121/4cfd9415/attachment-0001.html>