[Mesa-dev] [PATCH 3/6] i965/fs/gen7: split instructions that run into exec masking bugs

Fri Jul 8 20:50:23 UTC 2016

Iago Toral <itoral at igalia.com> writes:

> On Thu, 2016-07-07 at 19:36 -0700, Francisco Jerez wrote:
>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>> 
>> > 
>> > From: Iago Toral Quiroga <itoral at igalia.com>
>> > 
>> > In fp64 we can produce code like this:
>> > 
>> > mov(16) vgrf2<2>:UD, vgrf3<2>:UD
>> > 
>> > That our simd lowering pass would typically split in instructions
>> > with a
>> > width of 8, writing to two consecutive registers each.
>> > Unfortunately, gen7
>> > hardware has a bug affecting execution masking and as a result, the
>> > second GRF register write won't work properly. Curro verified this:
>> > 
>> > "The problem is that pre-Gen8 EUs are hardwired to use the
>> > QtrCtrl+1
>> >  (where QtrCtrl is the 8-bit quarter of the execution mask signals
>> >  specified in the instruction control fields) for the second
>> >  compressed half of any single-precision instruction (for
>> >  double-precision instructions it's hardwired to use NibCtrl+1),
>> >  which means that the EU will apply the wrong execution controls
>> >  for the second sequential GRF write if the number of channels per
>> >  GRF is not exactly eight in single-precision mode (or four in
>> >  double-float mode)."
>> > 
>> > In practice, this means that we cannot write more than one
>> > consecutive GRF in a single instruction if the number of channels
>> > per GRF is not exactly eight in single-precision mode (or four
>> > in double-float mode).
>> > 
>> > This patch makes our SIMD lowering pass split this kind of
>> > instructions
>> > so that the split versions only write to a single register. In the
>> > example above this means that we split the write in 4 instructions,
>> > each
>> > one writing 4 UD elements (width = 4) to a single register.
>> > ---
>> >  src/mesa/drivers/dri/i965/brw_fs.cpp | 20 ++++++++++++++++++++
>> >  1 file changed, 20 insertions(+)
>> > 
>> > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > index 2f473cc..caf88d1 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > @@ -4677,6 +4677,26 @@ static unsigned
>> >  get_fpu_lowered_simd_width(const struct brw_device_info *devinfo,
>> >                             const fs_inst *inst)
>> >  {
>> > +   /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where
>> > QtrCtrl is
>> > +    * the 8-bit quarter of the execution mask signals specified in
>> > the
>> > +    * instruction control fields) for the second compressed half
>> > of any
>> > +    * single-precision instruction (for double-precision
>> > instructions
>> > +    * it's hardwired to use NibCtrl+1), which means that the EU
>> > will
>> When I found out the hardware issue you describe in this comment I
>> only
>> had a HSW at hand, so I looked into this again today in order to
>> verify
>> whether IVB/VLV behave in the same way, and I'm afraid they
>> don't...  I
>> haven't tried it on real IVB hardware, but at least the simulator
>> behaves the same as HSW for single precision execution types, while
>> for
>> double precision types instruction decompression seems to be
>> completely
>> busted.  AFAICT it applies the same channel enable signals to both
>> halves of the compressed instruction which will be just wrong under
>> non-uniform control flow.  Can you clarify in the comment above that
>> the
>> text in parentheses referring to double-precision instructions may
>> only
>> apply to HSW?
>
> Sure.
>
>> Have you been able to get any of the FP64 non-uniform control flow
>> tests
>> to pass on IVB?  If you have I guess this may be a simulator-only
>> bug,
>> although I'm not sure the FS tests you have written will be non-
>> uniform
>> enough to reproduce the issue.  If you haven't, we may have to add
>> another check here in order to lower all non-force_writemask_all DF
>> instructions to SIMD4 on IVB/VLV...  :(
>
> For now we have focused on HSW only but thanks for the update regarding
> IVB. I'll make sure to pay special attention to non-uniform control-
> flow cases there when we start testing things there (we intend to start
> putting effort on IVB in the next days).
>
> I suppose that for now we can leave this patch as gen7 specific (rather
> than HSW specific) and fix it up if needed for IVB when we actually
> verify that it needs extra work there. Does that sound like a good
> plan?
>
Sure.

>> > 
>> > +    * apply the wrong execution controls for the second sequential
>> > GRF
>> > +    * write if the number of channels per GRF is not exactly eight
>> > in
>> > +    * single-precision mode (or four in double-float mode).
>> > +    *
>> > +    * In this situation we calculate the maximum size of the split
>> > +    * instructions so they only ever write to a single register.
>> > +    */
>> > +   unsigned type_size = type_sz(inst->dst.type);
>> > +   unsigned channels_per_grf = inst->exec_size / inst-
>> > >regs_written;
>> This will cause a division by zero if the instruction doesn't write
>> any
>> registers.  Strictly speaking you'd need to check the source types
>> too
>> in order to find out whether the instruction is compressed...
>
> Aha, makes sense, so I check the dst and sources and pick the largest
> to figure out the actual size of the instruction.
>
AFAIUI it's somewhat more complex than that because decompression works
slightly differently for sources and destinations: A single instruction
is expanded into multiple physical instructions so that each portion
writes some amount of data fully contained within one GRF (including
padding if the stride is greater than one) and so that each source pulls
at most one GRF worth of data from the register file (not including
padding left between components, the limiting factor is the source type
size times the decompressed execution width, regardless of the
horizontal and vertical strides and whether the source is an immediate,
scalar or vector register).  That means that in practice it will be hard
(but probably not impossible) to hit this hardware bug with the sources
alone (you'd likely need to use 32-wide W execution type instructions),
so I think it's okay if you consider the amount of data written by the
destination only.  Make sure to put the channels_per_grf declaration
inside the conditional though in order to avoid division by zero if the
instruction doesn't write any registers.

>> > 
>> > +   assert(channels_per_grf > 0);
>> > +   if (devinfo->gen < 8 && inst->regs_written > 1 &&
>> > +       channels_per_grf != REG_SIZE / type_size) {
>> I believe the hardware is more stupid than that, it doesn't really
>> calculate the number of components that fit in a single GRF and then
>> shifts QtrCtrl based on that, but rather it's hardwired to shift by
>> four
>> channels in DF mode (at least on HSW) or by eight channels for any
>> other
>> execution type, so you need to find out what the execution type of
>> the
>> instruction is (which is not necessarily the same as the destination
>> type).
>
> Ok, then summing up, the process would be like this:
>
> Split gen7 instructions that write more than 1 register and have a DF
> execution type (i.e. either the dst or any of the sources are DF) and

The destination type is not considered during the selection of the
execution type AFAIK.

> have a stride that is not exactly 1 on either the DST (if DF) or any of
> the DF sources (with the caveat that for IVB we might need to do
> something different in the end).
>

What matters in the end is the number of channels of each decompressed
chunk of the instruction, so assuming that the original instruction is
compressed the right condition will be:

| force_writemask_all ||
| decompressed_size == (exec_type == BRW_REGISTER_TYPE_DF ? 4 : 8);

for the instruction to be valid.  Though feel free to keep your
approximation of decompressed_size as channels_per_grf, it doesn't look
like the difference will matter much in practice.

> Does that sound right to you?
>
>> > 
>> > +      return channels_per_grf;
>> For this to interact nicely with the other restrictions implemented
>> in
>> the same function in case several of them ever apply at the same
>> time,
>> move the check down and have it assign 'max_width = MIN2(max_width,
>> channels_per_grf)' instead of the early return.
>
>
> Ah, good point. Will do.
>
> Thanks Curro!
>
>> > 
>> > +   }
>> > +
>> >     /* Maximum execution size representable in the instruction
>> > controls. */
>> >     unsigned max_width = MIN2(32, inst->exec_size);
>> >  
>> > _______________________________________________
>> > mesa-dev mailing list
>> > mesa-dev at lists.freedesktop.org
>> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160708/f43a063c/attachment.sig>