[Mesa-dev] [PATCH 12/13] nir: add pack_double_2x32_split_y opcode

Thu Apr 14 08:03:30 UTC 2016

On Wed, 2016-04-13 at 10:57 -0400, Connor Abbott wrote:
> On Wed, Apr 13, 2016 at 3:24 AM, Iago Toral <itoral at igalia.com> wrote:
> > On Tue, 2016-04-12 at 13:16 -0400, Connor Abbott wrote:
> >> I'm not sure I'm so comfortable with this. For one, this is the sort
> >> of thing the backend can (and should) do. Why should be be adding
> >> stuff at the NIR level to help the backend's copy propagation pass?
> >> And perhaps more importantly, why should we be adding stuff at the NIR
> >> level to workaround the lack of spilling support for doubles? I know
> >> it's more difficult to implement than normal spilling, but it just
> >> feels dirty to workaround it because some tests happen to use too many
> >> registers.
> >
> > My thoughts for this were that NIR is producing suboptimal output for
> > this case: it is unpacking and repacking an entire 64-bit float when it
> > only really needs to operate on the high 32-bit, so fixing that in NIR,
> > where the issue is introduced, instead of having each backend fix it,
> > seemed to make more sense to me.
> >
> > I am not sure that backends can easily fix this either: what they are
> > going to see is a nir_op_pack_double_2x32_split operation, they don't
> > have any context to tell if the low 32-bit argument is actually
> > different from what is already in the register, other than emitting code
> > to check for that which kind of defeats the purpose of the optimization.
> > I suppose they could still see if the instruction that generated the low
> > 32-bit value is actually an unpack of the same register and do the right
> > thing in that case. Would you like that solution better?
> >
> > In any case, I don't think this is something for copy-propagation to
> > handle this is really not about copy-propagating a value, it is about
> > not needing that value to begin with. Register pressure comes from
> > needing to have the low 32-bit value alive until we recombine it in the
> > 64-bit value. This patch makes it so that the low 32-bit is not alive at
> > any moment, we don't need to care about it at any point, and that is why
> > it helps reduce register pressure.
> 
> Think of what the backend is going to turn this into, though. Say you
> have a sequence like:
> 
> hi = unpack_double_2x32_split_y foo
> 
> new_hi = ...
> 
> lo = unpack_double_2x32_split_x foo
> bar = pack_double_2x32_split lo, new_hi
> 
> then this gets transformed to, in the backend:
> 
> hi = foo (stride 2, offset 1)
> 
> new_hi  = ..
> 
> lo = foo (stride 2, offset 0)
> bar (stride 2, offset 0) = lo (stride 2, offset 0) # <- notice this
> bar (stride 2, offset 1) hi (stride 2, offset 1)
> 
> Now the move from lo to bar has the same stride, and the same offset,
> and the same offset, and bar obviously doesn't conflict with lo, so
> register coalescing can combine lo and bar. This will result in:
> 
> hi = foo (stride 2, offset 1)
> 
> new_hi  = ..
> 
> bar (stride 2, offset 0) = foo (stride 2, offset 0)
> bar (stride 2, offset 1) hi (stride 2, offset 1)
> 
> This is the same code (with the same register pressure) as if you had
> used pack_double_2x32_split_y. In fact, in the bugzilla bug you gave
> me an example of a shader where this happens, and I pointed it out to
> you. I think all you need to do to fix your problem is move down the
> unpack_2x32_split_x to right before the pack, since the backend isn't
> smart enough to reorder it.

Sure, I can try that and see if that is enough. I'd rather see this
addressed in NIR if possible better than in the backends.

> In any case, I think that we should get spilling to work on doubles regardless.

Yes, we all agree on this.

> >
> > Jason also mentioned that what we should do is fix the spilling code. I
> > agree that we should do that, but I think the underlying issue here is
> > not that, but that NIR's output is leading to inefficient code, which is
> > then leading to unnecessary spilling in some cases. If we fix the
> > spilling then we can live with the problem, but the problem is still
> > there: we are still having live values that we don't need taking some
> > register space and leading to unnecessary spilling.
> >
> > In any case, I understand why you don't like the patch, so I am fine
> > with holding it back and focus on fixing spilling code first. FWIW, I
> > already had a stab at that, but I run into something that I didn't quite
> > figure out at that time:
> >
> > https://lists.freedesktop.org/archives/mesa-dev/2016-March/109370.html
> 
> I didn't notice this fly by, but you're not doing the spilling right.
> For things where the stride * type_size is 64, you can't decompose it
> into two scratch reads/writes, since then the exec mask will be wrong.
> You only want to read/write a channel if the exec mask is enabled for
> it. You have to use a different message to spill doubles, since the
> normal one (the scattered read/write) assumes the exec mask channels
> correspond to 32-bit chunks. Jason would know more about this.

Right, I also thought this might be the problem some time after sending
that e-mail and was the next thing I wanted to check. I was wondering,
if there is a reason for scratch read/writes to honor the execmask
(unless we just can't avoid that of course). If we just ignored the
execmask we would be always spilling and unspilling the exact, full
registers and not just the parts of them which would save us these
problems... Anyway, I plan to look into this, it is good to have
confirmation that this is indeed a likely cause for the problem, so
thanks for bringing it up! :)

Iago

> >
> > I still have plans to get back to that though.
> >
> > Iago
> >
> >> Jason, what do you think?
> >>
> >> On Tue, Apr 12, 2016 at 4:05 AM, Samuel Iglesias Gonsálvez
> >> <siglesias at igalia.com> wrote:
> >> > From: Iago Toral Quiroga <itoral at igalia.com>
> >> >
> >> > This is useful when we only need to modify the high 32-bit chunk of a double.
> >> > This is a common case, because this is the part that encodes the exponent
> >> > which we manipulate in some double lowering passes. Although we can accomplish
> >> > the same by using pack_double_2x32, this new opcode is better for register
> >> > pressure, since we don't have to unpack both parts of the double and keep
> >> > the low 32-bits around until we can recombine the new exponent.
> >> >
> >> > We will use this opcode in the set_exponent() function of the double lowering
> >> > pass and with that we will fix spilling issues in some dmat4 divide
> >> > piglit tests on Intel.
> >> > ---
> >> >  src/compiler/nir/nir_opcodes.py | 14 ++++++++++++++
> >> >  1 file changed, 14 insertions(+)
> >> >
> >> > diff --git a/src/compiler/nir/nir_opcodes.py b/src/compiler/nir/nir_opcodes.py
> >> > index 9f62e08..f92dd8d 100644
> >> > --- a/src/compiler/nir/nir_opcodes.py
> >> > +++ b/src/compiler/nir/nir_opcodes.py
> >> > @@ -276,6 +276,20 @@ di.i2 = src0.y;
> >> >  dst.x = di.u64;
> >> >  """)
> >> >
> >> > +opcode("pack_double_2x32_split_y", 0, tuint64, [0, 0], [tuint64, tuint32], "", """
> >> > +union {
> >> > +    uint64_t u64;
> >> > +    struct {
> >> > +        uint32_t i1;
> >> > +        uint32_t i2;
> >> > +    };
> >> > +} di;
> >> > +
> >> > +di.u64 = src0;
> >> > +di.i1 = src1;
> >> > +dst = di.u64;
> >> > +""")
> >> > +
> >> >  unop_horiz("unpack_double_2x32", 2, tuint32, 1, tuint64, """
> >> >  union {
> >> >      uint64_t u64;
> >> > --
> >> > 2.5.0
> >> >
> >> > _______________________________________________
> >> > mesa-dev mailing list
> >> > mesa-dev at lists.freedesktop.org
> >> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >> _______________________________________________
> >> mesa-dev mailing list
> >> mesa-dev at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >
> >
>