[Mesa-dev] [PATCH 37/59] i965/fs: generalize SIMD16 interference workaround

Tue May 3 19:56:39 UTC 2016

On Tue, May 3, 2016 at 3:46 PM, Kenneth Graunke <kenneth at whitecape.org> wrote:
> On Tuesday, May 3, 2016 3:13:26 PM PDT Connor Abbott wrote:
>> On Tue, May 3, 2016 at 2:52 PM, Kenneth Graunke <kenneth at whitecape.org>
> wrote:
>> > On Friday, April 29, 2016 1:29:34 PM PDT Samuel Iglesias Gonsálvez wrote:
>> >> From: Connor Abbott <connor.w.abbott at intel.com>
>> >>
>> >> Work based on registers read/written instead of dispatch_width, so that
>> >> the interferences are added for 64-bit sources/destinations as well.
>> >> ---
>> >>  src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp | 67 +++++++++++++++
>> > +-------
>> >>  1 file changed, 48 insertions(+), 19 deletions(-)
>> >>
>> >> diff --git a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp b/src/
> mesa/
>> > drivers/dri/i965/brw_fs_reg_allocate.cpp
>> >> index 2347cd5..f0e96f9 100644
>> >> --- a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>> >> +++ b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>> >> @@ -541,6 +541,34 @@ setup_mrf_hack_interference(fs_visitor *v, struct
>> > ra_graph *g,
>> >>     }
>> >>  }
>> >>
>> >> +static unsigned
>> >> +get_reg_node(const fs_reg& reg, unsigned first_payload_node)
>> >> +{
>> >> +   switch (reg.file) {
>> >> +   case VGRF:
>> >> +      return reg.nr;
>> >> +   case ARF:
>> >> +   case FIXED_GRF:
>> >> +      return first_payload_node + reg.nr;
>> >
>> > This can't possibly be right for ARFs.  I think it's OK for FIXED_GRF.
>> >
>> > I imagine we can just drop ARF.
>> >
>> >> +   case MRF:
>> >> +   default:
>> >> +      unreachable("unhandled register file");
>> >> +   }
>> >> +
>> >> +   return 0;
>> >> +}
>> >> +
>> >> +static void
>> >> +add_reg_interference(struct ra_graph *g, const fs_reg& reg1,
>> >> +                     const fs_reg& reg2, unsigned first_payload_node)
>> >> +{
>> >> +   if ((reg1.file == VGRF || reg1.file == ARF || reg1.file == FIXED_GRF)
> &&
>> >> +       (reg2.file == VGRF || reg2.file == ARF || reg2.file ==
> FIXED_GRF)) {
>> >> +      ra_add_node_interference(g, get_reg_node(reg1,
> first_payload_node),
>> >> +                               get_reg_node(reg2, first_payload_node));
>> >> +   }
>> >> +}
>> >> +
>> >>  bool
>> >>  fs_visitor::assign_regs(bool allow_spilling)
>> >>  {
>> >> @@ -643,26 +671,27 @@ fs_visitor::assign_regs(bool allow_spilling)
>> >>        }
>> >>     }
>> >>
>> >> -   if (dispatch_width > 8) {
>> >> -      /* In 16-wide dispatch we have an issue where a compressed
>> >> -       * instruction is actually two instructions executed
> simultaneiously.
>> >> -       * It's actually ok to have the source and destination registers
> be
>> >> -       * the same.  In this case, each instruction over-writes its own
>> >> -       * source and there's no problem.  The real problem here is if the
>> >> -       * source and destination registers are off by one.  Then you can
> end
>> >> -       * up in a scenario where the first instruction over-writes the
>> >> -       * source of the second instruction.  Since the compiler doesn't
> know
>> >> -       * about this level of granularity, we simply make the source and
>> >> -       * destination interfere.
>> >> -       */
>> >> -      foreach_block_and_inst(block, fs_inst, inst, cfg) {
>> >> -         if (inst->dst.file != VGRF)
>> >> -            continue;
>> >> +   /* When instructions both read/write more than a single SIMD8
> register,
>> > we
>> >> +    * have an issue where an instruction is actually two instructions
>> > executed
>> >> +    * simultaneiously.  It's actually ok to have the source and
> destination
>> >> +    * registers be the same.  In this case, each instruction over-writes
>> > its
>> >> +    * own source and there's no problem.  The real problem here is if
> the
>> >> +    * source and destination registers are off by one.  Then you can end
> up
>> > in
>> >> +    * a scenario where the first instruction over-writes the source of
> the
>> >> +    * second instruction.  Since the compiler doesn't know about this
> level
>> > of
>> >> +    * granularity, we simply make the source and destination interfere.
>> >> +    */
>> >> +   foreach_block_and_inst(block, fs_inst, inst, cfg) {
>> >> +      if (inst->dst.file != VGRF &&
>> >> +          inst->dst.file != ARF && inst->dst.file != FIXED_GRF)
>> >> +         continue;
>> >>
>> >> -         for (int i = 0; i < inst->sources; ++i) {
>> >> -            if (inst->src[i].file == VGRF) {
>> >> -               ra_add_node_interference(g, inst->dst.nr, inst-
>>src[i].nr);
>> >> -            }
>> >> +      if (inst->regs_written <= 1)
>> >> +         continue;
>> >> +
>> >> +      for (int i = 0; i < inst->sources; ++i) {
>> >> +         if (inst->regs_read(i) > 1) {
>> >> +            add_reg_interference(g, inst->dst, inst->src[i],
>> > first_payload_node);
>> >>           }
>> >>        }
>> >>     }
>> >>
>> >
>> > This seems wrong to me.  The point of the original code is that SIMD16
>> > instructions internally decode as two SIMD8 instructions, which operate
>> > sequentially.  So, the first runs, and clobbers the source for the
>> > second.
>> >
>> > I can't imagine that this happens for double floats - I imagine the
>> > hardware reads and writes entire DFs in one instruction.
>> >
>> > I'm not sure why this patch is needed at all.  Can we drop it, or do
>> > things break?
>>
>> AFAICT, that's not how things work. Essentially, the HW can't
>> read/write more than one SIMD8 register at a time, so even for SIMD8
>> it has to break double reads/writes into two. If you think about it,
>> this makes the restriction on horizontal strides not crossing a
>> register make a lot more sense -- the HW just uses the horizontal
>> stride inside a single read/write, and the vertical stride to figure
>> out how to break up an instruction into multiple instructions. To me,
>> this theory seemed to be confirmed by the fact that this patch did fix
>> things, at least when I wrote it. You could try reverting it and
>> seeing if it breaks things now, though.
>
> Okay, I think I misunderstood.  I was thinking this meant it would
> somehow break up a DF operation into multiple instructions, each of
> which operated on...part of a double...which would be insane.
>
> But I can definitely see it breaking up something like:
>
>    add(8) g10:DF g10:DF g20:DF
>
> into
>
>    add(4) g10:DF g10:DF g20:DF
>    add(4) g11:DF g11:DF g21:DF
>
> to avoid crossing register boundaries.  That makes sense.

Right, I think that's what's happening.

>
> I'm still curious whether this helps anything today.

I can try and run piglit with the patch reverted later (not today),
although you're free to try to :)

>
> I also think this may have some unintended side effects: a send message
> usually reads multiple registers, and writes multiple registers, and
> I'm not sure that we need them to conflict.  (I think the original code
> this is replacing was unnecessarily harsh as well...)

Right, I think sends should read/write registers atomically... I guess
we need to fix this up.