[Mesa-dev] [PATCH 12/13] nir: add pack_double_2x32_split_y opcode

Wed Apr 13 15:29:33 UTC 2016

On Apr 13, 2016 7:57 AM, "Connor Abbott" <cwabbott0 at gmail.com> wrote:
>
> On Wed, Apr 13, 2016 at 3:24 AM, Iago Toral <itoral at igalia.com> wrote:
> > On Tue, 2016-04-12 at 13:16 -0400, Connor Abbott wrote:
> >> I'm not sure I'm so comfortable with this. For one, this is the sort
> >> of thing the backend can (and should) do. Why should be be adding
> >> stuff at the NIR level to help the backend's copy propagation pass?
> >> And perhaps more importantly, why should we be adding stuff at the NIR
> >> level to workaround the lack of spilling support for doubles? I know
> >> it's more difficult to implement than normal spilling, but it just
> >> feels dirty to workaround it because some tests happen to use too many
> >> registers.
> >
> > My thoughts for this were that NIR is producing suboptimal output for
> > this case: it is unpacking and repacking an entire 64-bit float when it
> > only really needs to operate on the high 32-bit, so fixing that in NIR,
> > where the issue is introduced, instead of having each backend fix it,
> > seemed to make more sense to me.
> >
> > I am not sure that backends can easily fix this either: what they are
> > going to see is a nir_op_pack_double_2x32_split operation, they don't
> > have any context to tell if the low 32-bit argument is actually
> > different from what is already in the register, other than emitting code
> > to check for that which kind of defeats the purpose of the optimization.
> > I suppose they could still see if the instruction that generated the low
> > 32-bit value is actually an unpack of the same register and do the right
> > thing in that case. Would you like that solution better?
> >
> > In any case, I don't think this is something for copy-propagation to
> > handle this is really not about copy-propagating a value, it is about
> > not needing that value to begin with. Register pressure comes from
> > needing to have the low 32-bit value alive until we recombine it in the
> > 64-bit value. This patch makes it so that the low 32-bit is not alive at
> > any moment, we don't need to care about it at any point, and that is why
> > it helps reduce register pressure.

I think we have two fundamental problems here.  One is that spilling is
broken.  The second is that NIR is generating code that the backend can't
handle without spilling.  I can totally believe that the later is happening
given that, thanks to the flag, a the backend can't move two SELs past each
other when it schedules and we use SEL very heavily in the double lowering.

In this particular case, I do think we have a coalescing problem more than
a need for a new opcode.  Thanks to SSA you could just as easily see a pack
where one source is an unpack and do what you're doing with this "replace
the top bits" opcode in fs/vec4_nir without adding a special opcode for
it.  Therefore, even if this little optimization helps, I think the opcode
is unneeded.  I also think it's probably solvable as Connor says with
better coalescing.

> Think of what the backend is going to turn this into, though. Say you
> have a sequence like:
>
> hi = unpack_double_2x32_split_y foo
>
> new_hi = ...
>
> lo = unpack_double_2x32_split_x foo
> bar = pack_double_2x32_split lo, new_hi
>
> then this gets transformed to, in the backend:
>
> hi = foo (stride 2, offset 1)
>
> new_hi  = ..
>
> lo = foo (stride 2, offset 0)
> bar (stride 2, offset 0) = lo (stride 2, offset 0) # <- notice this
> bar (stride 2, offset 1) hi (stride 2, offset 1)
>
> Now the move from lo to bar has the same stride, and the same offset,
> and the same offset, and bar obviously doesn't conflict with lo, so
> register coalescing can combine lo and bar. This will result in:
>
> hi = foo (stride 2, offset 1)
>
> new_hi  = ..
>
> bar (stride 2, offset 0) = foo (stride 2, offset 0)
> bar (stride 2, offset 1) hi (stride 2, offset 1)
>
> This is the same code (with the same register pressure) as if you had
> used pack_double_2x32_split_y. In fact, in the bugzilla bug you gave
> me an example of a shader where this happens, and I pointed it out to
> you. I think all you need to do to fix your problem is move down the
> unpack_2x32_split_x to right before the pack, since the backend isn't
> smart enough to reorder it.
>
> In any case, I think that we should get spilling to work on doubles
regardless.
>
> >
> > Jason also mentioned that what we should do is fix the spilling code. I
> > agree that we should do that, but I think the underlying issue here is
> > not that, but that NIR's output is leading to inefficient code, which is
> > then leading to unnecessary spilling in some cases. If we fix the
> > spilling then we can live with the problem, but the problem is still
> > there: we are still having live values that we don't need taking some
> > register space and leading to unnecessary spilling.
> >
> > In any case, I understand why you don't like the patch, so I am fine
> > with holding it back and focus on fixing spilling code first. FWIW, I
> > already had a stab at that, but I run into something that I didn't quite
> > figure out at that time:
> >
> > https://lists.freedesktop.org/archives/mesa-dev/2016-March/109370.html
>
> I didn't notice this fly by, but you're not doing the spilling right.
> For things where the stride * type_size is 64, you can't decompose it
> into two scratch reads/writes, since then the exec mask will be wrong.
> You only want to read/write a channel if the exec mask is enabled for
> it. You have to use a different message to spill doubles, since the
> normal one (the scattered read/write) assumes the exec mask channels
> correspond to 32-bit chunks. Jason would know more about this.
>
> >
> > I still have plans to get back to that though.
> >
> > Iago
> >
> >> Jason, what do you think?
> >>
> >> On Tue, Apr 12, 2016 at 4:05 AM, Samuel Iglesias Gonsálvez
> >> <siglesias at igalia.com> wrote:
> >> > From: Iago Toral Quiroga <itoral at igalia.com>
> >> >
> >> > This is useful when we only need to modify the high 32-bit chunk of
a double.
> >> > This is a common case, because this is the part that encodes the
exponent
> >> > which we manipulate in some double lowering passes. Although we can
accomplish
> >> > the same by using pack_double_2x32, this new opcode is better for
register
> >> > pressure, since we don't have to unpack both parts of the double and
keep
> >> > the low 32-bits around until we can recombine the new exponent.
> >> >
> >> > We will use this opcode in the set_exponent() function of the double
lowering
> >> > pass and with that we will fix spilling issues in some dmat4 divide
> >> > piglit tests on Intel.
> >> > ---
> >> >  src/compiler/nir/nir_opcodes.py | 14 ++++++++++++++
> >> >  1 file changed, 14 insertions(+)
> >> >
> >> > diff --git a/src/compiler/nir/nir_opcodes.py
b/src/compiler/nir/nir_opcodes.py
> >> > index 9f62e08..f92dd8d 100644
> >> > --- a/src/compiler/nir/nir_opcodes.py
> >> > +++ b/src/compiler/nir/nir_opcodes.py
> >> > @@ -276,6 +276,20 @@ di.i2 = src0.y;
> >> >  dst.x = di.u64;
> >> >  """)
> >> >
> >> > +opcode("pack_double_2x32_split_y", 0, tuint64, [0, 0], [tuint64,
tuint32], "", """
> >> > +union {
> >> > +    uint64_t u64;
> >> > +    struct {
> >> > +        uint32_t i1;
> >> > +        uint32_t i2;
> >> > +    };
> >> > +} di;
> >> > +
> >> > +di.u64 = src0;
> >> > +di.i1 = src1;
> >> > +dst = di.u64;
> >> > +""")
> >> > +
> >> >  unop_horiz("unpack_double_2x32", 2, tuint32, 1, tuint64, """
> >> >  union {
> >> >      uint64_t u64;
> >> > --
> >> > 2.5.0
> >> >
> >> > _______________________________________________
> >> > mesa-dev mailing list
> >> > mesa-dev at lists.freedesktop.org
> >> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >> _______________________________________________
> >> mesa-dev mailing list
> >> mesa-dev at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160413/e3412aee/attachment-0001.html>