[Mesa-dev] [PATCH 3/6] i965/fs/gen7: split instructions that run into exec masking bugs

Fri Jul 8 08:28:32 UTC 2016

On Fri, 2016-07-08 at 09:21 +0200, Iago Toral wrote:
> On Thu, 2016-07-07 at 19:36 -0700, Francisco Jerez wrote:
> > 
> > Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> > 
> > > 
> > > 
> > > From: Iago Toral Quiroga <itoral at igalia.com>
> > > 
> > > In fp64 we can produce code like this:
> > > 
> > > mov(16) vgrf2<2>:UD, vgrf3<2>:UD
> > > 
> > > That our simd lowering pass would typically split in instructions
> > > with a
> > > width of 8, writing to two consecutive registers each.
> > > Unfortunately, gen7
> > > hardware has a bug affecting execution masking and as a result,
> > > the
> > > second GRF register write won't work properly. Curro verified
> > > this:
> > > 
> > > "The problem is that pre-Gen8 EUs are hardwired to use the
> > > QtrCtrl+1
> > >  (where QtrCtrl is the 8-bit quarter of the execution mask
> > > signals
> > >  specified in the instruction control fields) for the second
> > >  compressed half of any single-precision instruction (for
> > >  double-precision instructions it's hardwired to use NibCtrl+1),
> > >  which means that the EU will apply the wrong execution controls
> > >  for the second sequential GRF write if the number of channels
> > > per
> > >  GRF is not exactly eight in single-precision mode (or four in
> > >  double-float mode)."
> > > 
> > > In practice, this means that we cannot write more than one
> > > consecutive GRF in a single instruction if the number of channels
> > > per GRF is not exactly eight in single-precision mode (or four
> > > in double-float mode).
> > > 
> > > This patch makes our SIMD lowering pass split this kind of
> > > instructions
> > > so that the split versions only write to a single register. In
> > > the
> > > example above this means that we split the write in 4
> > > instructions,
> > > each
> > > one writing 4 UD elements (width = 4) to a single register.
> > > ---
> > >  src/mesa/drivers/dri/i965/brw_fs.cpp | 20 ++++++++++++++++++++
> > >  1 file changed, 20 insertions(+)
> > > 
> > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > index 2f473cc..caf88d1 100644
> > > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > @@ -4677,6 +4677,26 @@ static unsigned
> > >  get_fpu_lowered_simd_width(const struct brw_device_info
> > > *devinfo,
> > >                             const fs_inst *inst)
> > >  {
> > > +   /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where
> > > QtrCtrl is
> > > +    * the 8-bit quarter of the execution mask signals specified
> > > in
> > > the
> > > +    * instruction control fields) for the second compressed half
> > > of any
> > > +    * single-precision instruction (for double-precision
> > > instructions
> > > +    * it's hardwired to use NibCtrl+1), which means that the EU
> > > will
> > When I found out the hardware issue you describe in this comment I
> > only
> > had a HSW at hand, so I looked into this again today in order to
> > verify
> > whether IVB/VLV behave in the same way, and I'm afraid they
> > don't...  I
> > haven't tried it on real IVB hardware, but at least the simulator
> > behaves the same as HSW for single precision execution types, while
> > for
> > double precision types instruction decompression seems to be
> > completely
> > busted.  AFAICT it applies the same channel enable signals to both
> > halves of the compressed instruction which will be just wrong under
> > non-uniform control flow.  Can you clarify in the comment above
> > that
> > the
> > text in parentheses referring to double-precision instructions may
> > only
> > apply to HSW?
> Sure.
> 
> > 
> > Have you been able to get any of the FP64 non-uniform control flow
> > tests
> > to pass on IVB?  If you have I guess this may be a simulator-only
> > bug,
> > although I'm not sure the FS tests you have written will be non-
> > uniform
> > enough to reproduce the issue.  If you haven't, we may have to add
> > another check here in order to lower all non-force_writemask_all DF
> > instructions to SIMD4 on IVB/VLV...  :(
> For now we have focused on HSW only but thanks for the update
> regarding
> IVB. I'll make sure to pay special attention to non-uniform control-
> flow cases there when we start testing things there (we intend to
> start
> putting effort on IVB in the next days).
> 
> I suppose that for now we can leave this patch as gen7 specific
> (rather
> than HSW specific) and fix it up if needed for IVB when we actually
> verify that it needs extra work there. Does that sound like a good
> plan?
> 
> > 
> > > 
> > > 
> > > +    * apply the wrong execution controls for the second
> > > sequential
> > > GRF
> > > +    * write if the number of channels per GRF is not exactly
> > > eight
> > > in
> > > +    * single-precision mode (or four in double-float mode).
> > > +    *
> > > +    * In this situation we calculate the maximum size of the
> > > split
> > > +    * instructions so they only ever write to a single register.
> > > +    */
> > > +   unsigned type_size = type_sz(inst->dst.type);
> > > +   unsigned channels_per_grf = inst->exec_size / inst-
> > > > 
> > > > regs_written;
> > This will cause a division by zero if the instruction doesn't write
> > any
> > registers.  Strictly speaking you'd need to check the source types
> > too
> > in order to find out whether the instruction is compressed...
> Aha, makes sense, so I check the dst and sources and pick the largest
> to figure out the actual size of the instruction.
> 
> > 
> > > 
> > > 
> > > +   assert(channels_per_grf > 0);
> > > +   if (devinfo->gen < 8 && inst->regs_written > 1 &&
> > > +       channels_per_grf != REG_SIZE / type_size) {
> > I believe the hardware is more stupid than that, it doesn't really
> > calculate the number of components that fit in a single GRF and
> > then
> > shifts QtrCtrl based on that, but rather it's hardwired to shift by
> > four
> > channels in DF mode (at least on HSW) or by eight channels for any
> > other
> > execution type, so you need to find out what the execution type of
> > the
> > instruction is (which is not necessarily the same as the
> > destination
> > type).
> Ok, then summing up, the process would be like this:
> 
> Split gen7 instructions that write more than 1 register and have a DF
> execution type (i.e. either the dst or any of the sources are DF) and
> have a stride that is not exactly 1 on either the DST (if DF) or any
> of
> the DF sources (with the caveat that for IVB we might need to do
> something different in the end).

Well, not DF sources/dst but any source/dst that has a stride other
than 1 since the problem happens when we don't use all the channels
available in the register for either 32b or 64b types.

> Does that sound right to you?
> 
> > 
> > > 
> > > 
> > > +      return channels_per_grf;
> > For this to interact nicely with the other restrictions implemented
> > in
> > the same function in case several of them ever apply at the same
> > time,
> > move the check down and have it assign 'max_width = MIN2(max_width,
> > channels_per_grf)' instead of the early return.
> 
> Ah, good point. Will do.
> 
> Thanks Curro!
> 
> > 
> > > 
> > > 
> > > +   }
> > > +
> > >     /* Maximum execution size representable in the instruction
> > > controls. */
> > >     unsigned max_width = MIN2(32, inst->exec_size);
> > >  
> > > _______________________________________________
> > > mesa-dev mailing list
> > > mesa-dev at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev