[Mesa-dev] [PATCH 40/95] i965/vec4: add a SIMD lowering pass

Francisco Jerez currojerez at riseup.net
Wed Aug 3 21:32:07 UTC 2016


Iago Toral Quiroga <itoral at igalia.com> writes:

> Generally, instructions in Align16 mode only ever write to a single
> register and don't need anny form of SIMD splitting, that's why we
> have never had a SIMD splitting pass in the vec4 backend. However,
> double-precision instructions typically write 2 registers and in
> some cases they run into certain hardware bugs and limitations
> that we need to work around by splitting the instructions so we only
> write to 1 register at a time.
>
> This patch implements a basic SIMD splitting pass for this purpose.
> Notice that it does not attempt to be as flexible and generic as the
> FS version, because as I explain above, the cases where this needs
> to act are more limited, so we take advantage of that to simplify
> the implementation.
>
> Because we only use double-precision instructions in Align16 mode
> in gen7 (gen8+ is fully scalar and gens < 7 do not implement fp64)
> the pass is restricted to act on gen7 hardware only.
>
> For now the pass only handles the gen7 restriction where any
> instruction that writes 2 registers also needs to read 2 registers.
> This affects double-precision instructions reading uniforms, for
> example. Later patches will extend the lowering pass adding a few
> more cases.
> ---
>  src/mesa/drivers/dri/i965/brw_ir_vec4.h |   1 +
>  src/mesa/drivers/dri/i965/brw_vec4.cpp  | 100 +++++++++++++++++++++++++++++++-
>  src/mesa/drivers/dri/i965/brw_vec4.h    |   2 +
>  3 files changed, 102 insertions(+), 1 deletion(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_ir_vec4.h b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> index 721772e..f66c093 100644
> --- a/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> +++ b/src/mesa/drivers/dri/i965/brw_ir_vec4.h
> @@ -167,6 +167,7 @@ public:
>     unsigned sol_vertex; /**< gen6: used for setting dst index in SVB header */
>  
>     uint8_t exec_size;
> +   uint8_t group;
>  
>     bool is_send_from_grf();
>     unsigned regs_read(unsigned arg) const;
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> index 8316691..829b7d3 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> @@ -1947,6 +1947,101 @@ vec4_visitor::convert_to_hw_regs()
>     }
>  }
>  
> +/**
> + * Get the closest native SIMD width supported by the hardware for instruction
> + * \p inst.  The instruction will be left untouched by
> + * vec4_visitor::lower_simd_width() if the returned value matches the
> + * instruction's original execution size.
> + */
> +static unsigned
> +get_lowered_simd_width(const struct brw_device_info *devinfo,
> +                       const vec4_instruction *inst)
> +{
> +   /* For now we only need to split some cases of double-precision instructions
> +    * that write 2 registers. We only need to care about this in gen7 because
> +    * that is the only hardware that implements fp64 in Align16.
> +    */
> +   if (devinfo->gen != 7 || inst->regs_written < 2)
> +      return inst->exec_size;
> +
This check is logically part of the Gen7 workaround below, and is going
to prevent adding more workarounds to the function that affect non-Gen7
platforms or instructions with regs_written less than two.  Maybe wrap
the workaround below around an if-conditional with the converse
condition instead of this?

> +   unsigned lowered_width = MIN2(8, inst->exec_size);

The 8 above should be a 16, Align16 instructions can potentially be
wider than 8.

> +
> +   /* HSW PRM, 3D Media GPGPU Engine, Region Alignment Rules for Direct
> +    * Register Addressing:
> +    *
> +    *    "When destination spans two registers, the source MUST span two
> +    *     registers."
> +    */
> +   for (unsigned i = 0; i < 3; i++) {
> +      if (inst->src[i].file == BAD_FILE)
> +         continue;
> +      if (inst->regs_read(i) < 2)
> +         lowered_width = MIN2(lowered_width, 4);
> +   }
> +

This is going to need a number of additional workarounds for IVB, but I
guess they can be added later on...

> +   return lowered_width;
> +}
> +
> +bool
> +vec4_visitor::lower_simd_width()
> +{
> +   bool progress = false;
> +
> +   foreach_block_and_inst_safe(block, vec4_instruction, inst, cfg) {
> +      const unsigned lowered_width = get_lowered_simd_width(devinfo, inst);
> +      assert(lowered_width <= inst->exec_size);
> +      if (lowered_width == inst->exec_size)
> +         continue;
> +
> +      /* For now we only support splitting 8-wide instructions into 4-wide */
> +      assert(inst->exec_size == 8 && lowered_width == 4);
> +
Seems artificial, there is no reason for the logic below to care whether
the original and lowered execution size are exactly equal to these, and
SIMD16 vec4 instructions are useful.

> +      /* We always split so that each lowered instruction writes exactly to
> +       * one register.
> +       */
> +      assert(inst->regs_written == inst->exec_size / lowered_width);
> +

AFAICT the only reason why you need this is because the vec4 back-end is
missing a horiz_offset() helper like the FS back-end's, so you end up
using offset() below that counts in GRF units instead of channels, so
the result would be wrong if each half of the destination register
wasn't exactly one GRF.  Because using something like horiz_offset()
would make the implementation easier to follow overall I suggest you
take that path and drop this restriction.

> +      for (unsigned n = 0; n < inst->exec_size / lowered_width; n++)  {
> +         dst_reg dst = offset(inst->dst, n);
> +
> +         src_reg srcs[3];
> +         for (int i = 0; i < 3; i++) {
> +            srcs[i] = inst->src[i];
> +
> +            if (srcs[i].file == BAD_FILE)
> +               continue;
> +
> +            if (!is_uniform(srcs[i])) {
> +               if (type_sz(srcs[i].type) == 8) {
> +                  srcs[i] = offset(srcs[i], n);
> +               } else {
> +                  assert(lowered_width * n < 8);
> +                  srcs[i].subnr += lowered_width * n;
> +               }
> +            }

You need to check for source/destination overlap in order to make sure
that the destination of the first lowered instruction doesn't
inadvertently corrupt the sources of the second if they aren't properly
aligned in the GRF file.  That's one of the reasons why the FS SIMD
lowering pass used to allocate temporaries for the sources and
destinations of lowered instructions, copy propagation would be able to
get rid of the copy in most cases.  I ended up implementing the same
optimization recently which involved an amount of additional complexity,
so I wouldn't encourage you to do the same unless it's really
necessary...  I believe the easiest thing you could do would be to
allocate temporary registers to hold the destination of each lowered
half of the instruction and then emit MOVs into the real destination,
copy propagation will then hopefully eliminate the copies where they
aren't necessary.

> +         }
> +
> +         vec4_instruction *linst = new(mem_ctx)
> +            vec4_instruction(inst->opcode, dst, srcs[0], srcs[1], srcs[2]);
> +         linst->exec_size = lowered_width;
> +         linst->group = lowered_width * n;
> +         linst->regs_written = 1;
> +         linst->conditional_mod = inst->conditional_mod;
> +         linst->predicate = inst->predicate;
> +         linst->saturate = inst->saturate;

This seems like a buggy and open-coded copy constructor.  You're only
copying four of the many fields that can potentially affect the behavior
of a vec4_instruction.  Maybe define proper copy constructor instead?

> +         inst->insert_before(block, linst);
> +      }
> +
> +      inst->remove(block);
> +      progress = true;
> +   }
> +
> +   if (progress)
> +      invalidate_live_intervals();
> +
> +   return progress;
> +}
> +
>  bool
>  vec4_visitor::run()
>  {
> @@ -2002,9 +2097,12 @@ vec4_visitor::run()
>        backend_shader::dump_instructions(filename);
>     }
>  
> -   bool progress;
> +   bool progress = false;
>     int iteration = 0;
>     int pass_num = 0;
> +
> +   OPT(lower_simd_width);
> +

I have a feeling that doing this before the optimization loop isn't
going to be a good plan.  How is this going to handle cases like
conditional SIMD splitting based on whether the strides and swizzles of
the instruction are supported, which are likely to change during
optimization and other lowering passes?

>     do {
>        progress = false;
>        pass_num = 0;
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h b/src/mesa/drivers/dri/i965/brw_vec4.h
> index cf7cdab..e4c4e91 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4.h
> +++ b/src/mesa/drivers/dri/i965/brw_vec4.h
> @@ -160,6 +160,8 @@ public:
>     void opt_schedule_instructions();
>     void convert_to_hw_regs();
>  
> +   bool lower_simd_width();
> +
>     vec4_instruction *emit(vec4_instruction *inst);
>  
>     vec4_instruction *emit(enum opcode opcode);
> -- 
> 2.7.4
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160803/190de271/attachment.sig>


More information about the mesa-dev mailing list