[Mesa-dev] [PATCH 12/13] nir: add pack_double_2x32_split_y opcode

Iago Toral itoral at igalia.com
Fri Apr 15 11:24:44 UTC 2016


On Thu, 2016-04-14 at 13:31 -0700, Jason Ekstrand wrote:
> 
> 
> On Thu, Apr 14, 2016 at 1:50 AM, Iago Toral <itoral at igalia.com> wrote:
>         On Wed, 2016-04-13 at 08:29 -0700, Jason Ekstrand wrote:
>         >
>         > On Apr 13, 2016 7:57 AM, "Connor Abbott"
>         <cwabbott0 at gmail.com> wrote:
>         > >
>         > > On Wed, Apr 13, 2016 at 3:24 AM, Iago Toral
>         <itoral at igalia.com>
>         > wrote:
>         > > > On Tue, 2016-04-12 at 13:16 -0400, Connor Abbott wrote:
>         > > >> I'm not sure I'm so comfortable with this. For one,
>         this is the
>         > sort
>         > > >> of thing the backend can (and should) do. Why should be
>         be adding
>         > > >> stuff at the NIR level to help the backend's copy
>         propagation
>         > pass?
>         > > >> And perhaps more importantly, why should we be adding
>         stuff at
>         > the NIR
>         > > >> level to workaround the lack of spilling support for
>         doubles? I
>         > know
>         > > >> it's more difficult to implement than normal spilling,
>         but it
>         > just
>         > > >> feels dirty to workaround it because some tests happen
>         to use too
>         > many
>         > > >> registers.
>         > > >
>         > > > My thoughts for this were that NIR is producing
>         suboptimal output
>         > for
>         > > > this case: it is unpacking and repacking an entire
>         64-bit float
>         > when it
>         > > > only really needs to operate on the high 32-bit, so
>         fixing that in
>         > NIR,
>         > > > where the issue is introduced, instead of having each
>         backend fix
>         > it,
>         > > > seemed to make more sense to me.
>         > > >
>         > > > I am not sure that backends can easily fix this either:
>         what they
>         > are
>         > > > going to see is a nir_op_pack_double_2x32_split
>         operation, they
>         > don't
>         > > > have any context to tell if the low 32-bit argument is
>         actually
>         > > > different from what is already in the register, other
>         than
>         > emitting code
>         > > > to check for that which kind of defeats the purpose of
>         the
>         > optimization.
>         > > > I suppose they could still see if the instruction that
>         generated
>         > the low
>         > > > 32-bit value is actually an unpack of the same register
>         and do the
>         > right
>         > > > thing in that case. Would you like that solution better?
>         > > >
>         > > > In any case, I don't think this is something for
>         copy-propagation
>         > to
>         > > > handle this is really not about copy-propagating a
>         value, it is
>         > about
>         > > > not needing that value to begin with. Register pressure
>         comes from
>         > > > needing to have the low 32-bit value alive until we
>         recombine it
>         > in the
>         > > > 64-bit value. This patch makes it so that the low 32-bit
>         is not
>         > alive at
>         > > > any moment, we don't need to care about it at any point,
>         and that
>         > is why
>         > > > it helps reduce register pressure.
>         >
>         > I think we have two fundamental problems here.  One is that
>         spilling
>         > is broken.  The second is that NIR is generating code that
>         the backend
>         > can't handle without spilling.  I can totally believe that
>         the later
>         > is happening given that, thanks to the flag, a the backend
>         can't move
>         > two SELs past each other when it schedules and we use SEL
>         very heavily
>         > in the double lowering.
>         >
>         > In this particular case, I do think we have a coalescing
>         problem more
>         > than a need for a new opcode.  Thanks to SSA you could just
>         as easily
>         > see a pack where one source is an unpack and do what you're
>         doing with
>         > this "replace the top bits" opcode in fs/vec4_nir without
>         adding a
>         > special opcode for it.  Therefore, even if this little
>         optimization
>         > helps, I think the opcode is unneeded.  I also think it's
>         probably
>         > solvable as Connor says with better coalescing.
>         
>         
>         Yes, this is what I suggested as an alternative to do this in
>         the
>         backend in my original reply to Connor. There is a small
>         caveat here
>         though: we would have to assume that if the low 32-bit operand
>         comes
>         from an unpack opcode, it is an unpack of the same value. We
>         can't
>         verify this precisely because of SSA.
> 
> 
> I don't follow.  If you have
> 
> 
> ssa_1 = unpack(ssa_0)
> 
> ssa_2 = stuff
> 
> ssa_3 = pack_split(ssa_2, ssa_1.y)

Yes, I don't know what I was thinking. In my mind I was playing with
more complex scenarios:

ssa2 = unpack_x(ssa0)
ssa3 = unpack_y(ssa0)
ssa4 = unpack_x(ssa1)
ssa5 = unpack_y(ssa1)
ssa7 = ...
ssa8 = pack_split(ssa4, ssa7);
ssa6 = pack_split(ssa3, ssa5)

The first pack_split is exactly the case we have been discussing with
set_exponent, the second is not and that is what got me confused, but it
is similar enough and could also be optimized similarly (we just need to
see what unpack opcode we are using to select the component we want to
optimize from).
 
pack_split(ssa0.x, ssa7);
pack_split(ssa0.y, ssa5);

I suppose we could do a similar thing for the high 32-bit too, and then
we would just leave:

pack_split(ssa0.x, ssa7);
pack_split(ssa0.y, ssa1.y);

And with that we would probably get rid of all 4 unpack opcodes...

I am not sure how interesting it is to optimize all these scenarios but
if we want to do that I am thinking that we probably want to do this in
NIR as a separate opt pass?

> that should be exactly equivalent to
> 
> 
> ssa_2 = stuff
> 
> ssa_3 = pack_split_y(ssa_0, ssa_2)
> 
> 
> You can easily detect it and just do the slightly simpler thing.
> 
>  
>         It is probably okay to assume
>         this, since I don't imagine situations where we want to unpack
>         a double
>         and repack its low 32-bit into a different double value, but
>         it is a bit
>         of a hack :)
> 
> 
> Sounds like we have hacks all round :-)
> 
> 
> I don't know if you're doing this, but one thing that might help
> substantially with your spilling problems is if we run
> lower_alu_to_scalar before we lower double ops.

Yes, we are doing this already. lower_alu_to_scalar is done first in
nir_optimize and nir_lower_doubles is done last.

> 
> Is there a branch I can play with that has these spilling problems?
> If so, what tests do I need to look at?  I'm planning to start playing
> with making spilling work better and would like to fix it for the fp64
> case while I'm at it.

Great, thanks! If there is anything we can do to help just let me know.

Our gen8 development branch here:
https://github.com/Igalia/mesa/tree/i965-fp64

And our gen7 development branch (which simply adds a few patches on top
of the gen8 branch):
https://github.com/Igalia/mesa/tree/i965-fp64-gen7

Both have some piglit tests break because of spilling (with or without
the patch we have been discussing here). The varying-packing tests in
particular:

spec/arb_gpu_shader_fp64/varying-packing/simple 

With the patch(es) we have been discussing here we fix a group of 4
other tests that would otherwise fail to spill too, but since these
branches include the patches you won't see them failing. If you are
interested I can check the exact tests, I think they were in this group:

spec/arb_gpu_shader_fp64/execution/built-in-functions/fs-op-div-*

when the types involved dmat4 or something like that.

Iago

>  
>         > > Think of what the backend is going to turn this into,
>         though. Say
>         > you
>         > > have a sequence like:
>         > >
>         > > hi = unpack_double_2x32_split_y foo
>         > >
>         > > new_hi = ...
>         > >
>         > > lo = unpack_double_2x32_split_x foo
>         > > bar = pack_double_2x32_split lo, new_hi
>         > >
>         > > then this gets transformed to, in the backend:
>         > >
>         > > hi = foo (stride 2, offset 1)
>         > >
>         > > new_hi  = ..
>         > >
>         > > lo = foo (stride 2, offset 0)
>         > > bar (stride 2, offset 0) = lo (stride 2, offset 0) # <-
>         notice this
>         > > bar (stride 2, offset 1) hi (stride 2, offset 1)
>         > >
>         > > Now the move from lo to bar has the same stride, and the
>         same
>         > offset,
>         > > and the same offset, and bar obviously doesn't conflict
>         with lo, so
>         > > register coalescing can combine lo and bar. This will
>         result in:
>         > >
>         > > hi = foo (stride 2, offset 1)
>         > >
>         > > new_hi  = ..
>         > >
>         > > bar (stride 2, offset 0) = foo (stride 2, offset 0)
>         > > bar (stride 2, offset 1) hi (stride 2, offset 1)
>         > >
>         > > This is the same code (with the same register pressure) as
>         if you
>         > had
>         > > used pack_double_2x32_split_y. In fact, in the bugzilla
>         bug you gave
>         > > me an example of a shader where this happens, and I
>         pointed it out
>         > to
>         > > you. I think all you need to do to fix your problem is
>         move down the
>         > > unpack_2x32_split_x to right before the pack, since the
>         backend
>         > isn't
>         > > smart enough to reorder it.
>         > >
>         > > In any case, I think that we should get spilling to work
>         on doubles
>         > regardless.
>         > >
>         > > >
>         > > > Jason also mentioned that what we should do is fix the
>         spilling
>         > code. I
>         > > > agree that we should do that, but I think the underlying
>         issue
>         > here is
>         > > > not that, but that NIR's output is leading to
>         inefficient code,
>         > which is
>         > > > then leading to unnecessary spilling in some cases. If
>         we fix the
>         > > > spilling then we can live with the problem, but the
>         problem is
>         > still
>         > > > there: we are still having live values that we don't
>         need taking
>         > some
>         > > > register space and leading to unnecessary spilling.
>         > > >
>         > > > In any case, I understand why you don't like the patch,
>         so I am
>         > fine
>         > > > with holding it back and focus on fixing spilling code
>         first.
>         > FWIW, I
>         > > > already had a stab at that, but I run into something
>         that I didn't
>         > quite
>         > > > figure out at that time:
>         > > >
>         > > >
>         >
>         https://lists.freedesktop.org/archives/mesa-dev/2016-March/109370.html
>         > >
>         > > I didn't notice this fly by, but you're not doing the
>         spilling
>         > right.
>         > > For things where the stride * type_size is 64, you can't
>         decompose
>         > it
>         > > into two scratch reads/writes, since then the exec mask
>         will be
>         > wrong.
>         > > You only want to read/write a channel if the exec mask is
>         enabled
>         > for
>         > > it. You have to use a different message to spill doubles,
>         since the
>         > > normal one (the scattered read/write) assumes the exec
>         mask channels
>         > > correspond to 32-bit chunks. Jason would know more about
>         this.
>         > >
>         > > >
>         > > > I still have plans to get back to that though.
>         > > >
>         > > > Iago
>         > > >
>         > > >> Jason, what do you think?
>         > > >>
>         > > >> On Tue, Apr 12, 2016 at 4:05 AM, Samuel Iglesias
>         Gonsálvez
>         > > >> <siglesias at igalia.com> wrote:
>         > > >> > From: Iago Toral Quiroga <itoral at igalia.com>
>         > > >> >
>         > > >> > This is useful when we only need to modify the high
>         32-bit
>         > chunk of a double.
>         > > >> > This is a common case, because this is the part that
>         encodes
>         > the exponent
>         > > >> > which we manipulate in some double lowering passes.
>         Although we
>         > can accomplish
>         > > >> > the same by using pack_double_2x32, this new opcode
>         is better
>         > for register
>         > > >> > pressure, since we don't have to unpack both parts of
>         the
>         > double and keep
>         > > >> > the low 32-bits around until we can recombine the new
>         exponent.
>         > > >> >
>         > > >> > We will use this opcode in the set_exponent()
>         function of the
>         > double lowering
>         > > >> > pass and with that we will fix spilling issues in
>         some dmat4
>         > divide
>         > > >> > piglit tests on Intel.
>         > > >> > ---
>         > > >> >  src/compiler/nir/nir_opcodes.py | 14 ++++++++++++++
>         > > >> >  1 file changed, 14 insertions(+)
>         > > >> >
>         > > >> > diff --git a/src/compiler/nir/nir_opcodes.py
>         > b/src/compiler/nir/nir_opcodes.py
>         > > >> > index 9f62e08..f92dd8d 100644
>         > > >> > --- a/src/compiler/nir/nir_opcodes.py
>         > > >> > +++ b/src/compiler/nir/nir_opcodes.py
>         > > >> > @@ -276,6 +276,20 @@ di.i2 = src0.y;
>         > > >> >  dst.x = di.u64;
>         > > >> >  """)
>         > > >> >
>         > > >> > +opcode("pack_double_2x32_split_y", 0, tuint64, [0,
>         0],
>         > [tuint64, tuint32], "", """
>         > > >> > +union {
>         > > >> > +    uint64_t u64;
>         > > >> > +    struct {
>         > > >> > +        uint32_t i1;
>         > > >> > +        uint32_t i2;
>         > > >> > +    };
>         > > >> > +} di;
>         > > >> > +
>         > > >> > +di.u64 = src0;
>         > > >> > +di.i1 = src1;
>         > > >> > +dst = di.u64;
>         > > >> > +""")
>         > > >> > +
>         > > >> >  unop_horiz("unpack_double_2x32", 2, tuint32, 1,
>         tuint64, """
>         > > >> >  union {
>         > > >> >      uint64_t u64;
>         > > >> > --
>         > > >> > 2.5.0
>         > > >> >
>         > > >> > _______________________________________________
>         > > >> > mesa-dev mailing list
>         > > >> > mesa-dev at lists.freedesktop.org
>         > > >> >
>         https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>         > > >> _______________________________________________
>         > > >> mesa-dev mailing list
>         > > >> mesa-dev at lists.freedesktop.org
>         > > >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>         > > >
>         > > >
>         >
>         >
>         
>         
>         
> 
> 




More information about the mesa-dev mailing list