[Mesa-dev] [PATCH 37/59] i965/fs: generalize SIMD16 interference workaround

Tue May 3 19:46:10 UTC 2016

On Tuesday, May 3, 2016 3:13:26 PM PDT Connor Abbott wrote:
> On Tue, May 3, 2016 at 2:52 PM, Kenneth Graunke <kenneth at whitecape.org> 
wrote:
> > On Friday, April 29, 2016 1:29:34 PM PDT Samuel Iglesias Gonsálvez wrote:
> >> From: Connor Abbott <connor.w.abbott at intel.com>
> >>
> >> Work based on registers read/written instead of dispatch_width, so that
> >> the interferences are added for 64-bit sources/destinations as well.
> >> ---
> >>  src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp | 67 +++++++++++++++
> > +-------
> >>  1 file changed, 48 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp b/src/
mesa/
> > drivers/dri/i965/brw_fs_reg_allocate.cpp
> >> index 2347cd5..f0e96f9 100644
> >> --- a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
> >> +++ b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
> >> @@ -541,6 +541,34 @@ setup_mrf_hack_interference(fs_visitor *v, struct
> > ra_graph *g,
> >>     }
> >>  }
> >>
> >> +static unsigned
> >> +get_reg_node(const fs_reg& reg, unsigned first_payload_node)
> >> +{
> >> +   switch (reg.file) {
> >> +   case VGRF:
> >> +      return reg.nr;
> >> +   case ARF:
> >> +   case FIXED_GRF:
> >> +      return first_payload_node + reg.nr;
> >
> > This can't possibly be right for ARFs.  I think it's OK for FIXED_GRF.
> >
> > I imagine we can just drop ARF.
> >
> >> +   case MRF:
> >> +   default:
> >> +      unreachable("unhandled register file");
> >> +   }
> >> +
> >> +   return 0;
> >> +}
> >> +
> >> +static void
> >> +add_reg_interference(struct ra_graph *g, const fs_reg& reg1,
> >> +                     const fs_reg& reg2, unsigned first_payload_node)
> >> +{
> >> +   if ((reg1.file == VGRF || reg1.file == ARF || reg1.file == FIXED_GRF) 
&&
> >> +       (reg2.file == VGRF || reg2.file == ARF || reg2.file == 
FIXED_GRF)) {
> >> +      ra_add_node_interference(g, get_reg_node(reg1, 
first_payload_node),
> >> +                               get_reg_node(reg2, first_payload_node));
> >> +   }
> >> +}
> >> +
> >>  bool
> >>  fs_visitor::assign_regs(bool allow_spilling)
> >>  {
> >> @@ -643,26 +671,27 @@ fs_visitor::assign_regs(bool allow_spilling)
> >>        }
> >>     }
> >>
> >> -   if (dispatch_width > 8) {
> >> -      /* In 16-wide dispatch we have an issue where a compressed
> >> -       * instruction is actually two instructions executed 
simultaneiously.
> >> -       * It's actually ok to have the source and destination registers 
be
> >> -       * the same.  In this case, each instruction over-writes its own
> >> -       * source and there's no problem.  The real problem here is if the
> >> -       * source and destination registers are off by one.  Then you can 
end
> >> -       * up in a scenario where the first instruction over-writes the
> >> -       * source of the second instruction.  Since the compiler doesn't 
know
> >> -       * about this level of granularity, we simply make the source and
> >> -       * destination interfere.
> >> -       */
> >> -      foreach_block_and_inst(block, fs_inst, inst, cfg) {
> >> -         if (inst->dst.file != VGRF)
> >> -            continue;
> >> +   /* When instructions both read/write more than a single SIMD8 
register,
> > we
> >> +    * have an issue where an instruction is actually two instructions
> > executed
> >> +    * simultaneiously.  It's actually ok to have the source and 
destination
> >> +    * registers be the same.  In this case, each instruction over-writes
> > its
> >> +    * own source and there's no problem.  The real problem here is if 
the
> >> +    * source and destination registers are off by one.  Then you can end 
up
> > in
> >> +    * a scenario where the first instruction over-writes the source of 
the
> >> +    * second instruction.  Since the compiler doesn't know about this 
level
> > of
> >> +    * granularity, we simply make the source and destination interfere.
> >> +    */
> >> +   foreach_block_and_inst(block, fs_inst, inst, cfg) {
> >> +      if (inst->dst.file != VGRF &&
> >> +          inst->dst.file != ARF && inst->dst.file != FIXED_GRF)
> >> +         continue;
> >>
> >> -         for (int i = 0; i < inst->sources; ++i) {
> >> -            if (inst->src[i].file == VGRF) {
> >> -               ra_add_node_interference(g, inst->dst.nr, inst-
>src[i].nr);
> >> -            }
> >> +      if (inst->regs_written <= 1)
> >> +         continue;
> >> +
> >> +      for (int i = 0; i < inst->sources; ++i) {
> >> +         if (inst->regs_read(i) > 1) {
> >> +            add_reg_interference(g, inst->dst, inst->src[i],
> > first_payload_node);
> >>           }
> >>        }
> >>     }
> >>
> >
> > This seems wrong to me.  The point of the original code is that SIMD16
> > instructions internally decode as two SIMD8 instructions, which operate
> > sequentially.  So, the first runs, and clobbers the source for the
> > second.
> >
> > I can't imagine that this happens for double floats - I imagine the
> > hardware reads and writes entire DFs in one instruction.
> >
> > I'm not sure why this patch is needed at all.  Can we drop it, or do
> > things break?
> 
> AFAICT, that's not how things work. Essentially, the HW can't
> read/write more than one SIMD8 register at a time, so even for SIMD8
> it has to break double reads/writes into two. If you think about it,
> this makes the restriction on horizontal strides not crossing a
> register make a lot more sense -- the HW just uses the horizontal
> stride inside a single read/write, and the vertical stride to figure
> out how to break up an instruction into multiple instructions. To me,
> this theory seemed to be confirmed by the fact that this patch did fix
> things, at least when I wrote it. You could try reverting it and
> seeing if it breaks things now, though.

Okay, I think I misunderstood.  I was thinking this meant it would
somehow break up a DF operation into multiple instructions, each of
which operated on...part of a double...which would be insane.

But I can definitely see it breaking up something like:

   add(8) g10:DF g10:DF g20:DF

into

   add(4) g10:DF g10:DF g20:DF
   add(4) g11:DF g11:DF g21:DF

to avoid crossing register boundaries.  That makes sense.

I'm still curious whether this helps anything today.

I also think this may have some unintended side effects: a send message
usually reads multiple registers, and writes multiple registers, and
I'm not sure that we need them to conflict.  (I think the original code
this is replacing was unnecessarily harsh as well...)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160503/a2328bca/attachment.sig>