[Mesa-dev] [PATCH 3/6] i965/fs/gen7: split instructions that run into exec masking bugs

Iago Toral itoral at igalia.com
Fri Jul 8 07:21:16 UTC 2016


On Thu, 2016-07-07 at 19:36 -0700, Francisco Jerez wrote:
> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> 
> > 
> > From: Iago Toral Quiroga <itoral at igalia.com>
> > 
> > In fp64 we can produce code like this:
> > 
> > mov(16) vgrf2<2>:UD, vgrf3<2>:UD
> > 
> > That our simd lowering pass would typically split in instructions
> > with a
> > width of 8, writing to two consecutive registers each.
> > Unfortunately, gen7
> > hardware has a bug affecting execution masking and as a result, the
> > second GRF register write won't work properly. Curro verified this:
> > 
> > "The problem is that pre-Gen8 EUs are hardwired to use the
> > QtrCtrl+1
> >  (where QtrCtrl is the 8-bit quarter of the execution mask signals
> >  specified in the instruction control fields) for the second
> >  compressed half of any single-precision instruction (for
> >  double-precision instructions it's hardwired to use NibCtrl+1),
> >  which means that the EU will apply the wrong execution controls
> >  for the second sequential GRF write if the number of channels per
> >  GRF is not exactly eight in single-precision mode (or four in
> >  double-float mode)."
> > 
> > In practice, this means that we cannot write more than one
> > consecutive GRF in a single instruction if the number of channels
> > per GRF is not exactly eight in single-precision mode (or four
> > in double-float mode).
> > 
> > This patch makes our SIMD lowering pass split this kind of
> > instructions
> > so that the split versions only write to a single register. In the
> > example above this means that we split the write in 4 instructions,
> > each
> > one writing 4 UD elements (width = 4) to a single register.
> > ---
> >  src/mesa/drivers/dri/i965/brw_fs.cpp | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> > 
> > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > index 2f473cc..caf88d1 100644
> > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > @@ -4677,6 +4677,26 @@ static unsigned
> >  get_fpu_lowered_simd_width(const struct brw_device_info *devinfo,
> >                             const fs_inst *inst)
> >  {
> > +   /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where
> > QtrCtrl is
> > +    * the 8-bit quarter of the execution mask signals specified in
> > the
> > +    * instruction control fields) for the second compressed half
> > of any
> > +    * single-precision instruction (for double-precision
> > instructions
> > +    * it's hardwired to use NibCtrl+1), which means that the EU
> > will
> When I found out the hardware issue you describe in this comment I
> only
> had a HSW at hand, so I looked into this again today in order to
> verify
> whether IVB/VLV behave in the same way, and I'm afraid they
> don't...  I
> haven't tried it on real IVB hardware, but at least the simulator
> behaves the same as HSW for single precision execution types, while
> for
> double precision types instruction decompression seems to be
> completely
> busted.  AFAICT it applies the same channel enable signals to both
> halves of the compressed instruction which will be just wrong under
> non-uniform control flow.  Can you clarify in the comment above that
> the
> text in parentheses referring to double-precision instructions may
> only
> apply to HSW?

Sure.

> Have you been able to get any of the FP64 non-uniform control flow
> tests
> to pass on IVB?  If you have I guess this may be a simulator-only
> bug,
> although I'm not sure the FS tests you have written will be non-
> uniform
> enough to reproduce the issue.  If you haven't, we may have to add
> another check here in order to lower all non-force_writemask_all DF
> instructions to SIMD4 on IVB/VLV...  :(

For now we have focused on HSW only but thanks for the update regarding
IVB. I'll make sure to pay special attention to non-uniform control-
flow cases there when we start testing things there (we intend to start
putting effort on IVB in the next days).

I suppose that for now we can leave this patch as gen7 specific (rather
than HSW specific) and fix it up if needed for IVB when we actually
verify that it needs extra work there. Does that sound like a good
plan?

> > 
> > +    * apply the wrong execution controls for the second sequential
> > GRF
> > +    * write if the number of channels per GRF is not exactly eight
> > in
> > +    * single-precision mode (or four in double-float mode).
> > +    *
> > +    * In this situation we calculate the maximum size of the split
> > +    * instructions so they only ever write to a single register.
> > +    */
> > +   unsigned type_size = type_sz(inst->dst.type);
> > +   unsigned channels_per_grf = inst->exec_size / inst-
> > >regs_written;
> This will cause a division by zero if the instruction doesn't write
> any
> registers.  Strictly speaking you'd need to check the source types
> too
> in order to find out whether the instruction is compressed...

Aha, makes sense, so I check the dst and sources and pick the largest
to figure out the actual size of the instruction.

> > 
> > +   assert(channels_per_grf > 0);
> > +   if (devinfo->gen < 8 && inst->regs_written > 1 &&
> > +       channels_per_grf != REG_SIZE / type_size) {
> I believe the hardware is more stupid than that, it doesn't really
> calculate the number of components that fit in a single GRF and then
> shifts QtrCtrl based on that, but rather it's hardwired to shift by
> four
> channels in DF mode (at least on HSW) or by eight channels for any
> other
> execution type, so you need to find out what the execution type of
> the
> instruction is (which is not necessarily the same as the destination
> type).

Ok, then summing up, the process would be like this:

Split gen7 instructions that write more than 1 register and have a DF
execution type (i.e. either the dst or any of the sources are DF) and
have a stride that is not exactly 1 on either the DST (if DF) or any of
the DF sources (with the caveat that for IVB we might need to do
something different in the end).

Does that sound right to you?

> > 
> > +      return channels_per_grf;
> For this to interact nicely with the other restrictions implemented
> in
> the same function in case several of them ever apply at the same
> time,
> move the check down and have it assign 'max_width = MIN2(max_width,
> channels_per_grf)' instead of the early return.


Ah, good point. Will do.

Thanks Curro!

> > 
> > +   }
> > +
> >     /* Maximum execution size representable in the instruction
> > controls. */
> >     unsigned max_width = MIN2(32, inst->exec_size);
> >  
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev


More information about the mesa-dev mailing list