[Mesa-dev] [PATCH v2 3/6] i965/fs/gen7: split instructions that run into exec masking bugs

Tue Jul 12 06:26:23 UTC 2016

On Mon, 2016-07-11 at 12:19 -0700, Francisco Jerez wrote:
> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> 
> > 
> > From: Iago Toral Quiroga <itoral at igalia.com>
> > 
> > In fp64 we can produce code like this:
> > 
> > mov(16) vgrf2<2>:UD, vgrf3<2>:UD
> > 
> > That our simd lowering pass would typically split in instructions
> > with a
> > width of 8, writing to two consecutive registers each.
> > Unfortunately, gen7
> > hardware has a bug affecting execution masking and as a result, the
> > second GRF register write won't work properly. Curro verified this:
> > 
> > "The problem is that pre-Gen8 EUs are hardwired to use the
> > QtrCtrl+1
> >  (where QtrCtrl is the 8-bit quarter of the execution mask signals
> >  specified in the instruction control fields) for the second
> >  compressed half of any single-precision instruction (for
> >  double-precision instructions it's hardwired to use NibCtrl+1,
> >  at least on HSW), which means that the EU will apply the wrong
> >  execution controls for the second sequential GRF write if the
> > number
> >  of channels per GRF is not exactly eight in single-precision mode
> > (or
> >  four in double-float mode)."
> > 
> > In practice, this means that we cannot write more than one
> > consecutive GRF in a single instruction if the number of channels
> > per GRF is not exactly eight in single-precision mode (or four
> > in double-float mode).
> > 
> > This patch makes our SIMD lowering pass split this kind of
> > instructions
> > so that the split versions only write to a single register. In the
> > example above this means that we split the write in 4 instructions,
> > each
> > one writing 4 UD elements (width = 4) to a single register.
> > 
> > v2 (Curro):
> >  - Make explicit that the thing about hardwiring NibCtrl+1 for the
> > second
> >    compressed half is known to happen in Haswell and the issue with
> > IVB
> >    might not be exactly the same.
> >  - Assign max_width instead of returning early so that we can
> > handle
> >    multiple restrictions affecting to the same instruction.
> >  - Avoid division by 0 if the instruction does not write any
> > registers.
> >  - Ignore instructions what have WE_all set.
> >  - Use the instruction execution type size instead of the dst type
> > size.
> > ---
> >  src/mesa/drivers/dri/i965/brw_fs.cpp | 28
> > ++++++++++++++++++++++++++++
> >  1 file changed, 28 insertions(+)
> > 
> > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > index 2f473cc..4d57412 100644
> > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > @@ -4691,6 +4691,34 @@ get_fpu_lowered_simd_width(const struct
> > brw_device_info *devinfo,
> >      */
> >     unsigned reg_count = inst->regs_written;
> >  
> You've put this right in the middle of one of my previous workarounds
> ;), can you move it down a bit more next to the "According to the IVB
> PRMs" line, or close to the end of the function?

Oh, sure.

> > 
> > +   /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where
> > QtrCtrl is
> > +    * the 8-bit quarter of the execution mask signals specified in
> > the
> > +    * instruction control fields) for the second compressed half
> > of any
> > +    * single-precision instruction (for double-precision
> > instructions
> > +    * it's hardwired to use NibCtrl+1, at least on HSW), which
> > means that
> > +    * the EU will apply the wrong execution controls for the
> > second
> > +    * sequential GRF write if the number of channels per GRF is
> > not exactly
> > +    * eight in single-precision mode (or four in double-float
> > mode).
> > +    *
> > +    * In this situation we calculate the maximum size of the split
> > +    * instructions so they only ever write to a single register.
> > +    */
> > +   if (devinfo->gen < 8 && inst->regs_written > 1 &&
> > +       !inst->force_writemask_all) {
> > +      unsigned channels_per_grf = inst->exec_size / inst-
> > >regs_written;
> Could be declared const.

Ok.

> > 
> > +      unsigned exec_type_size = 0;
> > +      for (int i = 0; i < inst->sources; i++) {
> > +         if (inst->src[i].file == BAD_FILE)
> > +            break;
> It wouldn't be right to break early if the instruction has any valid
> sources after a non-present one.  This should probably be:

I thought that couldn't happen. Will fix it.

> > 
> >           if (inst->src[i].file != BAD_FILE)
> >              exec_type_size = MAX2(exec_type_size, type_sz(inst-
> > >src[i].type));
> instead.
> 
> > 
> > +         exec_type_size = MAX2(exec_type_size, type_sz(inst-
> > >src[i].type));
> > +      }
> > +      assert(exec_type_size);
> > +
> > +      if (channels_per_grf != REG_SIZE / exec_type_size) {
> I think you really need to use (exec_type_size == 8 ? 4 : 8) instead
> of
> the RHS of this expression.  The hardware shifts exactly 8 channels
> per
> compressed half of the instruction regardless of the execution type,
> so
> this formula would give you an incorrect answer for exec_type_size <
> 4.

Ok, so it really has two strict working modes. Will fix this too.

Thanks!

> > 
> > +         max_width = MIN2(max_width, channels_per_grf);
> > +      }
> Redundant braces.
> 
> > 
> > +   }
> > +
> >     for (unsigned i = 0; i < inst->sources; i++)
> >        reg_count = MAX2(reg_count, (unsigned)inst->regs_read(i));
> >  
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev