[Mesa-dev] [PATCH] i965/vec4: Opportunistically coalesce SIMD8 instructions

Tue Feb 17 16:59:37 PST 2015

On Tue, Feb 17, 2015 at 4:44 PM, Ben Widawsky
<benjamin.widawsky at intel.com> wrote:
> With scalar VS, it so happens that many vertex shaders will line up in a such a
> way that two SIMD8 instructions can be collapsed into 1 SIMD16 instruction. For
> example
>
> The following two MOVs
> mov(8)          g124<1>F        g6<8,8,1>F                      { align1 1Q compacted };
> mov(8)          g125<1>F        g7<8,8,1>F                      { align1 1Q compacted };
>
> Could be represented as a single MOV
> mov(16)         g124<1>F        g6<8,8,1>F                      { align1 1H compacted };
>
> The basic algorithm is very simple. For two consecutive instructions, check if
> all source, and dst registers are adjacent. If so, reuse the first instruction
> by adjusting the compression bits and then killing the second instruction. The
> caveat is (shown above) is 1Q->1H is insufficient. As mentioned in the comments,
> the second quarter of the DMask is invalid for us, so we actually must generate
> the follow if possible:
> mov(16)         g124<1>F        g6<8,8,1>F                      { align1 WE_all 1H compacted };
>
> The next step would be to try informing the instruction scheduler and register
> allocator to make this happen more often. Anecdotally the most often occurance
> is for the blit shader generated by meta, and it always leaves things in good
> order for us.
>
> The scalar VS is only available on later platforms. This same thing could be
> applied to the FS, but there we hope to be using SIMD16 already for most
> instructions. It shouldn't hurt to throw this same optimization at the FS for
> cases where we have to fall back though.
>
> Cc: Kenneth Graunke <kenneth at whitecape.org>
> Cc: Kristian Høgsberg <krh at bitplanet.net>
> Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> ---
>
> I have no had time to benchmark this very much, nor run piglit on it. I am just
> sending it out before it bitrots too much further.

I would be surprised if it had a measurable effect. Compressed
instructions (i.e., SIMD16) are apparently just split into a pair of
SIMD8 instructions by the instruction decoder. So, this should
basically just be reducing code size, like instruction compaction.

>
> ---
>
>  src/mesa/drivers/dri/i965/brw_fs.cpp | 74 ++++++++++++++++++++++++++++++++++++
>  src/mesa/drivers/dri/i965/brw_fs.h   |  1 +
>  2 files changed, 75 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
> index 200a494..cc21cdf 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> @@ -3716,6 +3716,78 @@ fs_visitor::allocate_registers()
>        prog_data->total_scratch = brw_get_scratch_size(last_scratch);
>  }
>
> +static bool
> +is_ops_adjacent(fs_inst *a, fs_inst *b)
> +{
> +   if (a->opcode != b->opcode)
> +      return false;
> +
> +   if (a->dst.reg != b->dst.reg - 1)
> +      return false;
> +
> +   assert(a->sources == b->sources);
> +
> +   for (int i = 0; i < a->sources; i++) {
> +      if (a->src[i].file != b->src[i].file)
> +         return false;
> +
> +      if (a->src[i].file == HW_REG &&
> +          (a->src[i].fixed_hw_reg.nr == b->src[i].fixed_hw_reg.nr - 1))
> +         continue;
> +      else if (a->src[i].file == GRF &&
> +               (a->src[i].reg ==  b->src[i].reg - 1))
> +         continue;
> +      else if (a->src[i].file == IMM &&
> +               a->src[i].fixed_hw_reg.dw1.ud == b->src[i].fixed_hw_reg.dw1.ud)
> +         continue;
> +
> +      return false;
> +   }
> +
> +   return true;
> +}
> +
> +/* Try to upconvert a SIMD8 instruction into a fake SIMD16 instruction.
> + *
> + * If we have two operations in sequence, and they are using sequentially
> + * contiguous operands, the two SIMD8 instructions may be combined into 1 SIMD16
> + * instruction. For example:
> + * mov(8)          g124<1>F        g6<8,8,1>F
> + * mov(8)          g125<1>F        g7<8,8,1>F
> + *
> + * Is the same as:
> + * mov(16)         g124<1>F        g6<8,8,1>F
> + *
> + * This is trickier than it initially sounds. On the surface it sounds like a
> + * good idea to simply combine the instructions as shown above, and convert
> + * 1Q->1H. The main problem is that we're executing the shader with SIMD8 mode.
> + * This means that 1/4 of the DMask is useful, and the rest is junk. All we can
> + * do therefore is use WE_all if possible.

Oh, wow.

Presumably you tried without setting WE_all and it failed piglit?

I've never figured out what the high bits of the execution mask
contains in a SIMD8 shader. Something I read made me think the low 8
bits simply repeated (seems like a useful behavior), and other text
makes me think they're undefined. From the IVB PRM:

Note: When branching instructions are predicated, branching is
evaluated on all channels enabled at dispatch. This means, the
appropriate number of flag register bits must be initialized or used
in predication depending on the execution mask (EMask). Uninitalized
flags may result in undesired branching. For example, if using DMask
as EMask and if all 32 channels of DMask are enabled, a SIMD8 kernel
must initialize unused flag bits so that predication on branching is
evaluated correctly.