[Mesa-dev] [PATCH] i965/vec4: Opportunistically coalesce SIMD8 instructions
Kenneth Graunke
kenneth at whitecape.org
Tue Feb 17 21:39:11 PST 2015
On Tuesday, February 17, 2015 04:59:37 PM Matt Turner wrote:
> On Tue, Feb 17, 2015 at 4:44 PM, Ben Widawsky
> <benjamin.widawsky at intel.com> wrote:
> > With scalar VS, it so happens that many vertex shaders will line up in a such a
> > way that two SIMD8 instructions can be collapsed into 1 SIMD16 instruction. For
> > example
> >
> > The following two MOVs
> > mov(8) g124<1>F g6<8,8,1>F { align1 1Q compacted };
> > mov(8) g125<1>F g7<8,8,1>F { align1 1Q compacted };
> >
> > Could be represented as a single MOV
> > mov(16) g124<1>F g6<8,8,1>F { align1 1H compacted };
> >
> > The basic algorithm is very simple. For two consecutive instructions, check if
> > all source, and dst registers are adjacent. If so, reuse the first instruction
> > by adjusting the compression bits and then killing the second instruction. The
> > caveat is (shown above) is 1Q->1H is insufficient. As mentioned in the comments,
> > the second quarter of the DMask is invalid for us, so we actually must generate
> > the follow if possible:
> > mov(16) g124<1>F g6<8,8,1>F { align1 WE_all 1H compacted };
> >
> > The next step would be to try informing the instruction scheduler and register
> > allocator to make this happen more often. Anecdotally the most often occurance
> > is for the blit shader generated by meta, and it always leaves things in good
> > order for us.
> >
> > The scalar VS is only available on later platforms. This same thing could be
> > applied to the FS, but there we hope to be using SIMD16 already for most
> > instructions. It shouldn't hurt to throw this same optimization at the FS for
> > cases where we have to fall back though.
> >
> > Cc: Kenneth Graunke <kenneth at whitecape.org>
> > Cc: Kristian Høgsberg <krh at bitplanet.net>
> > Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> > ---
> >
> > I have no had time to benchmark this very much, nor run piglit on it. I am just
> > sending it out before it bitrots too much further.
>
> I would be surprised if it had a measurable effect. Compressed
> instructions (i.e., SIMD16) are apparently just split into a pair of
> SIMD8 instructions by the instruction decoder. So, this should
> basically just be reducing code size, like instruction compaction.
>
> >
> > ---
> >
> > src/mesa/drivers/dri/i965/brw_fs.cpp | 74 ++++++++++++++++++++++++++++++++++++
> > src/mesa/drivers/dri/i965/brw_fs.h | 1 +
> > 2 files changed, 75 insertions(+)
> >
> > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > index 200a494..cc21cdf 100644
> > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > @@ -3716,6 +3716,78 @@ fs_visitor::allocate_registers()
> > prog_data->total_scratch = brw_get_scratch_size(last_scratch);
> > }
> >
> > +static bool
> > +is_ops_adjacent(fs_inst *a, fs_inst *b)
> > +{
> > + if (a->opcode != b->opcode)
> > + return false;
> > +
> > + if (a->dst.reg != b->dst.reg - 1)
> > + return false;
> > +
> > + assert(a->sources == b->sources);
> > +
> > + for (int i = 0; i < a->sources; i++) {
> > + if (a->src[i].file != b->src[i].file)
> > + return false;
> > +
> > + if (a->src[i].file == HW_REG &&
> > + (a->src[i].fixed_hw_reg.nr == b->src[i].fixed_hw_reg.nr - 1))
> > + continue;
> > + else if (a->src[i].file == GRF &&
> > + (a->src[i].reg == b->src[i].reg - 1))
> > + continue;
> > + else if (a->src[i].file == IMM &&
> > + a->src[i].fixed_hw_reg.dw1.ud == b->src[i].fixed_hw_reg.dw1.ud)
> > + continue;
> > +
> > + return false;
> > + }
> > +
> > + return true;
> > +}
> > +
> > +/* Try to upconvert a SIMD8 instruction into a fake SIMD16 instruction.
> > + *
> > + * If we have two operations in sequence, and they are using sequentially
> > + * contiguous operands, the two SIMD8 instructions may be combined into 1 SIMD16
> > + * instruction. For example:
> > + * mov(8) g124<1>F g6<8,8,1>F
> > + * mov(8) g125<1>F g7<8,8,1>F
> > + *
> > + * Is the same as:
> > + * mov(16) g124<1>F g6<8,8,1>F
> > + *
> > + * This is trickier than it initially sounds. On the surface it sounds like a
> > + * good idea to simply combine the instructions as shown above, and convert
> > + * 1Q->1H. The main problem is that we're executing the shader with SIMD8 mode.
> > + * This means that 1/4 of the DMask is useful, and the rest is junk. All we can
> > + * do therefore is use WE_all if possible.
>
> Oh, wow.
>
> Presumably you tried without setting WE_all and it failed piglit?
>
> I've never figured out what the high bits of the execution mask
> contains in a SIMD8 shader. Something I read made me think the low 8
> bits simply repeated (seems like a useful behavior), and other text
> makes me think they're undefined. From the IVB PRM:
>
> Note: When branching instructions are predicated, branching is
> evaluated on all channels enabled at dispatch. This means, the
> appropriate number of flag register bits must be initialized or used
> in predication depending on the execution mask (EMask). Uninitalized
> flags may result in undesired branching. For example, if using DMask
> as EMask and if all 32 channels of DMask are enabled, a SIMD8 kernel
> must initialize unused flag bits so that predication on branching is
> evaluated correctly.
Another thing of note: on Gen8+, we use VMask as EMask. Previously, we
used DMask. I believe I saw failures in glsl-fs-derivs without the
VECTOR_MASK_ENABLE flag in gen8_ps_state.c.
Just in case it's related.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150217/cda378ae/attachment.sig>
More information about the mesa-dev
mailing list