[Mesa-dev] [PATCH 3/6] i965/fs/gen7: split instructions that run into exec masking bugs
Iago Toral
itoral at igalia.com
Fri Jul 8 08:28:32 UTC 2016
On Fri, 2016-07-08 at 09:21 +0200, Iago Toral wrote:
> On Thu, 2016-07-07 at 19:36 -0700, Francisco Jerez wrote:
> >
> > Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> >
> > >
> > >
> > > From: Iago Toral Quiroga <itoral at igalia.com>
> > >
> > > In fp64 we can produce code like this:
> > >
> > > mov(16) vgrf2<2>:UD, vgrf3<2>:UD
> > >
> > > That our simd lowering pass would typically split in instructions
> > > with a
> > > width of 8, writing to two consecutive registers each.
> > > Unfortunately, gen7
> > > hardware has a bug affecting execution masking and as a result,
> > > the
> > > second GRF register write won't work properly. Curro verified
> > > this:
> > >
> > > "The problem is that pre-Gen8 EUs are hardwired to use the
> > > QtrCtrl+1
> > > (where QtrCtrl is the 8-bit quarter of the execution mask
> > > signals
> > > specified in the instruction control fields) for the second
> > > compressed half of any single-precision instruction (for
> > > double-precision instructions it's hardwired to use NibCtrl+1),
> > > which means that the EU will apply the wrong execution controls
> > > for the second sequential GRF write if the number of channels
> > > per
> > > GRF is not exactly eight in single-precision mode (or four in
> > > double-float mode)."
> > >
> > > In practice, this means that we cannot write more than one
> > > consecutive GRF in a single instruction if the number of channels
> > > per GRF is not exactly eight in single-precision mode (or four
> > > in double-float mode).
> > >
> > > This patch makes our SIMD lowering pass split this kind of
> > > instructions
> > > so that the split versions only write to a single register. In
> > > the
> > > example above this means that we split the write in 4
> > > instructions,
> > > each
> > > one writing 4 UD elements (width = 4) to a single register.
> > > ---
> > > src/mesa/drivers/dri/i965/brw_fs.cpp | 20 ++++++++++++++++++++
> > > 1 file changed, 20 insertions(+)
> > >
> > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > index 2f473cc..caf88d1 100644
> > > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > @@ -4677,6 +4677,26 @@ static unsigned
> > > get_fpu_lowered_simd_width(const struct brw_device_info
> > > *devinfo,
> > > const fs_inst *inst)
> > > {
> > > + /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where
> > > QtrCtrl is
> > > + * the 8-bit quarter of the execution mask signals specified
> > > in
> > > the
> > > + * instruction control fields) for the second compressed half
> > > of any
> > > + * single-precision instruction (for double-precision
> > > instructions
> > > + * it's hardwired to use NibCtrl+1), which means that the EU
> > > will
> > When I found out the hardware issue you describe in this comment I
> > only
> > had a HSW at hand, so I looked into this again today in order to
> > verify
> > whether IVB/VLV behave in the same way, and I'm afraid they
> > don't... I
> > haven't tried it on real IVB hardware, but at least the simulator
> > behaves the same as HSW for single precision execution types, while
> > for
> > double precision types instruction decompression seems to be
> > completely
> > busted. AFAICT it applies the same channel enable signals to both
> > halves of the compressed instruction which will be just wrong under
> > non-uniform control flow. Can you clarify in the comment above
> > that
> > the
> > text in parentheses referring to double-precision instructions may
> > only
> > apply to HSW?
> Sure.
>
> >
> > Have you been able to get any of the FP64 non-uniform control flow
> > tests
> > to pass on IVB? If you have I guess this may be a simulator-only
> > bug,
> > although I'm not sure the FS tests you have written will be non-
> > uniform
> > enough to reproduce the issue. If you haven't, we may have to add
> > another check here in order to lower all non-force_writemask_all DF
> > instructions to SIMD4 on IVB/VLV... :(
> For now we have focused on HSW only but thanks for the update
> regarding
> IVB. I'll make sure to pay special attention to non-uniform control-
> flow cases there when we start testing things there (we intend to
> start
> putting effort on IVB in the next days).
>
> I suppose that for now we can leave this patch as gen7 specific
> (rather
> than HSW specific) and fix it up if needed for IVB when we actually
> verify that it needs extra work there. Does that sound like a good
> plan?
>
> >
> > >
> > >
> > > + * apply the wrong execution controls for the second
> > > sequential
> > > GRF
> > > + * write if the number of channels per GRF is not exactly
> > > eight
> > > in
> > > + * single-precision mode (or four in double-float mode).
> > > + *
> > > + * In this situation we calculate the maximum size of the
> > > split
> > > + * instructions so they only ever write to a single register.
> > > + */
> > > + unsigned type_size = type_sz(inst->dst.type);
> > > + unsigned channels_per_grf = inst->exec_size / inst-
> > > >
> > > > regs_written;
> > This will cause a division by zero if the instruction doesn't write
> > any
> > registers. Strictly speaking you'd need to check the source types
> > too
> > in order to find out whether the instruction is compressed...
> Aha, makes sense, so I check the dst and sources and pick the largest
> to figure out the actual size of the instruction.
>
> >
> > >
> > >
> > > + assert(channels_per_grf > 0);
> > > + if (devinfo->gen < 8 && inst->regs_written > 1 &&
> > > + channels_per_grf != REG_SIZE / type_size) {
> > I believe the hardware is more stupid than that, it doesn't really
> > calculate the number of components that fit in a single GRF and
> > then
> > shifts QtrCtrl based on that, but rather it's hardwired to shift by
> > four
> > channels in DF mode (at least on HSW) or by eight channels for any
> > other
> > execution type, so you need to find out what the execution type of
> > the
> > instruction is (which is not necessarily the same as the
> > destination
> > type).
> Ok, then summing up, the process would be like this:
>
> Split gen7 instructions that write more than 1 register and have a DF
> execution type (i.e. either the dst or any of the sources are DF) and
> have a stride that is not exactly 1 on either the DST (if DF) or any
> of
> the DF sources (with the caveat that for IVB we might need to do
> something different in the end).
Well, not DF sources/dst but any source/dst that has a stride other
than 1 since the problem happens when we don't use all the channels
available in the register for either 32b or 64b types.
> Does that sound right to you?
>
> >
> > >
> > >
> > > + return channels_per_grf;
> > For this to interact nicely with the other restrictions implemented
> > in
> > the same function in case several of them ever apply at the same
> > time,
> > move the check down and have it assign 'max_width = MIN2(max_width,
> > channels_per_grf)' instead of the early return.
>
> Ah, good point. Will do.
>
> Thanks Curro!
>
> >
> > >
> > >
> > > + }
> > > +
> > > /* Maximum execution size representable in the instruction
> > > controls. */
> > > unsigned max_width = MIN2(32, inst->exec_size);
> > >
> > > _______________________________________________
> > > mesa-dev mailing list
> > > mesa-dev at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list