[Mesa-dev] [PATCH 40/95] i965/vec4: add a SIMD lowering pass

Tue Aug 23 09:15:58 UTC 2016

On Wed, 2016-08-03 at 14:32 -0700, Francisco Jerez wrote:
> Iago Toral Quiroga <itoral at igalia.com> writes:...)writes:...)

(...)

> > +   return lo)+}++bool+vec4_visitor::lower_simd_width()
> > +{
> > +   bool progress = false;
> > +
> > +   foreach_block_and_inst_safe(block, vec4_instruction, inst, cfg)
> > {
> > +      const unsigned lowered_width =
> > get_lowered_simd_width(devinfo, inst);
> > +      assert(lowered_width <= inst->exec_size);
> > +      if (lowered_width == inst->exec_size)
> > +         continue;
> > +
> > +      /* For now we only support splitting 8-wide instructions
> > into 4-wide */
> > +      assert(inst->exec_size == 8 && lowered_width == 4);
> > +
> Seems artificial, there is no reason for the logic below to care
> whether
> the original and lowered execution size are exactly equal to these,
> and
> SIMD16 vec4 instructions are useful.

Right, I tried to keep things simple because so far we only ever needed
to split 8-wide DF instructions and left the assertion to make sure
that if we ever changed that we had to review that the splitting did
the right thing for other cases, but I can remove this and try to make
the pass a bit more general as you suggest.

There is still something that I think we need to check here and it is
that in align16, as far as I understand things, we can't split in a way
that leads to regions that are not 16-byte aligned, right? I suppose I
can add an assert for that below when we generate the split
instructions.

> > 
> > +      /* We always split so that each lowered instruction writes
> > exactly to
> > +       * one register.
> > +       */
> > +      assert(inst->regs_written == inst->exec_size /
> > lowered_width);
> > +
> AFAICT the only reason why you need this is because the vec4 back-end 
> is
> missing a horiz_offset() helper like the FS back-end's, so you end up
> using offset() below that counts in GRF units instead of channels, so
> the result would be wrong if each half of the destination register
> wasn't exactly one GRF.  Because using something like horiz_offset()
> would make the implementation easier to follow overall I suggest you
> take that path and drop this restriction.

Sure, I'll do this.

> > 
> > +      for (unsigned n = 0; n < inst->exec_size / lowered_width;
> > n++)  {
> > +         dst_reg dst = offset(inst->dst, n);
> > +
> > +         src_reg srcs[3];
> > +         for (int i = 0; i < 3; i++) {
> > +            srcs[i] = inst->src[i];
> > +
> > +            if (srcs[i].file == BAD_FILE)
> > +               continue;
> > +
> > +            if (!is_uniform(srcs[i])) {
> > +               if (type_sz(srcs[i].type) == 8) {
> > +                  srcs[i] = offset(srcs[i], n);
> > +               } else {
> > +                  assert(lowered_width * n < 8);
> > +                  srcs[i].subnr += lowered_width * n;
> > +               }
> > +            }
> You need to check for source/destination overlap in order to make
> sure
> that the destination of the first lowered instruction doesn't
> inadvertently corrupt the sources of the second if they aren't
> properly
> aligned in the GRF file.  That's one of the reasons why the FS SIMD
> lowering pass used to allocate temporaries for the sources and
> destinations of lowered instructions, copy propagation would be able
> to
> get rid of the copy in most cases.  I ended up implementing the same
> optimization recently which involved an amount of additional
> complexity,
> so I wouldn't encourage you to do the same unless it's really
> necessary...  I believe the easiest thing you could do would be to
> allocate temporary registers to hold the destination of each lowered
> half of the instruction and then emit MOVs into the real destination,
> copy propagation will then hopefully eliminate the copies where they
> aren't necessary.

Aha, ok.

> > 
> > +         }
> > +
> > +         vec4_instruction *linst = new(mem_ctx)
> > +            vec4_instruction(inst->opcode, dst, srcs[0], srcs[1],
> > srcs[2]);
> > +         linst->exec_size = lowered_width;
> > +         linst->group = lowered_width * n;
> > +         linst->regs_written = 1;
> > +         linst->conditional_mod = inst->conditional_mod;
> > +         linst->predicate = inst->predicate;
> > +         linst->saturate = inst->saturate;
> This seems like a buggy and open-coded copy constructor.  You're only
> copying four of the many fields that can potentially affect the
> behavior
> of a vec4_instruction.  Maybe define proper copy constructor instead?

Yeah, I forgot to clean this up... a copy-constructor sounds like a
good plan.

> > 
> > +         inst->insert_before(block, linst);
> > +      }
> > +
> > +      inst->remove(block);
> > +      progress = true;
> > +   }
> > +
> > +   if (progress)
> > +      invalidate_live_intervals();
> > +
> > +   return progress;
> > +}
> > +
> >  bool
> >  vec4_visitor::run()
> >  {
> > @@ -2002,9 +2097,12 @@ vec4_visitor::run()
> >        backend_shader::dump_instructions(filename);
> >     }
> >  
> > -   bool progress;
> > +   bool progress = false;
> >     int iteration = 0;
> >     int pass_num = 0;
> > +
> > +   OPT(lower_simd_width);
> > +
> I have a feeling that doing this before the optimization loop isn't
> going to be a good plan.  How is this going to handle cases like
> conditional SIMD splitting based on whether the strides and swizzles
> of
> the instruction are supported, which are likely to change during
> optimization and other lowering passes?

Right, since we are scalarizing everything in this series that has not
been a problem, but you're right that we will eventually need to move
this after the main optimization loop.

Iago

> > 
> >     do {
> >        progress = false;
> >        pass_num = 0;
> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h
> > b/src/mesa/drivers/dri/i965/brw_vec4.h
> > index cf7cdab..e4c4e91 100644
> > --- a/src/mesa/drivers/dri/i965/brw_vec4.h
> > +++ b/src/mesa/drivers/dri/i965/brw_vec4.h
> > @@ -160,6 +160,8 @@ public:
> >     void opt_schedule_instructions();
> >     void convert_to_hw_regs();
> >  
> > +   bool lower_simd_width();
> > +
> >     vec4_instruction *emit(vec4_instruction *inst);
> >  
> >     vec4_instruction *emit(enum opcode opcode);