[Mesa-dev] Register spilling issues in the NIR->vec4 backend

Mon Jul 20 06:35:26 PDT 2015

Hi,
On Thu, 2015-07-16 at 08:15 -0700, Jason Ekstrand wrote:
> 
> On Jul 15, 2015 11:20 PM, "Iago Toral" <itoral at igalia.com> wrote:
> >
> > On Wed, 2015-07-15 at 11:02 -0700, Connor Abbott wrote:
> > > On Wed, Jul 15, 2015 at 7:49 AM, Iago Toral <itoral at igalia.com>
> wrote:
> > > > Hi,
> > > >
> > > > when we sent the patches for the new nir->vec4 backend we
> mentioned that
> > > > we had a few dEQP tests that would fail to link because of
> register
> > > > spilling. Now that we have added GS support we see a few
> instances of
> > > > this problem popping up in a few GS piglit tests too, for
> example this
> > > > one:
> > > >
> > > >
> tests/spec/glsl-1.50/execution/variable-indexing/gs-input-array-vec4-index-rd.shader_test
> > > >
> > > > I have been looking into what is going on with these tests and I
> came to
> > > > the conclusion that the problem is a consequence of various
> factors, but
> > > > probably the main thing contributing to it is the way our SSA
> pass
> > > > works. That said, I am not that experienced with NIR, so it
> could also
> > > > be that my analysis is missing something and I am just arriving
> to wrong
> > > > conclusions, so I'll explain my thoughts below and hopefully
> someone
> > > > else with more NIR experience can jump in and confirm or reject
> my
> > > > analysis.
> > > >
> > > > The GS code in that test looks like this:
> > > >
> > > > for (int p = 0; p < 3; p++) {
> > > >    color = ((index >= ins[p].m1.length() ?
> > > >             ins[p].m2[index-ins[p].m1.length()] :
> > > >             ins[p].m1[index]) == expect) ?
> > > >                vec4(0.0, 1.0, 0.0, 1.0) : vec4(1.0, 0.0, 0.0,
> 1.0);
> > > >    gl_Position = gl_in[p].gl_Position;
> > > >    EmitVertex();
> > > > }
> > > >
> > > > One thing that is immediately contributing to the register
> pressure is
> > > > some really awful code generated because of the indirect array
> indexing
> > > > on the inputs inside the loop. This is because of the
> > > > lower_variable_index_to_cond_assign lowering pass called from
> > > > brw_shader.cpp. This pass will convert that color assignment
> into a
> > > > bunch of nested if/else statements which makes the generated
> GLSL IR
> > > > code rather large, involving plenty of temporaries too. This is
> only
> > > > made worse by the fact that loop unrolling will replicate that 3
> times.
> > > > The result is a huge pile of GLSL IR with a few dozens of nested
> if/else
> > > > statements and temporaries that looks like [1] (that is only a
> fragment
> > > > of the GLSL IR).
> > > >
> > > > One thing that is particularly relevant in that code is that it
> has
> > > > multiple conditional assignments to the same variable
> > > > (dereference_array_value) as a consequence of this lowering
> pass.
> > > >
> > > > That much, however, is common to the NIR and non-NIR paths. The
> problem
> > > > in the NIR case is that all these assignments generate new SSA
> values,
> > > > which then become new registers in the final NIR form. This
> leads to NIR
> > > > code like [2].  In contrast, the old vec4 visitor path, is able
> to have
> > > > writes to the same variable write to the same register.
> > > >
> > > > As a result, if I print the code right before register
> allocation in the
> > > > NIR path [3] and I compare that to what we get with the old vec4
> visitor
> > > > path at that same point [4], it is clearly visible that this
> difference
> > > > is allowing the vec4 visitor path to reduce register pressure
> (see how
> > > > in [4] we have multiple writes to vgrf5, while in [3] we always
> write to
> > > > a new vgrf every time).
> > > >
> > > > So, am I missing something or is this kind of result expected
> with NIR
> > > > programs? Is there anything in the nir->vec4 pass that we can do
> to fix
> > > > this or does this need to be fixed when going out of SSA moe
> inside NIR?
> > > >
> > > > Iago
> > > >
> > > > [1] http://pastebin.com/5uA8ex2S
> > > > [2] http://pastebin.com/pqLfvAVN
> > > > [3] http://pastebin.com/64nSuUH8
> > > > [4] http://pastebin.com/WCrdYxzt
> > > >
> > > > _______________________________________________
> > > > mesa-dev mailing list
> > > > mesa-dev at lists.freedesktop.org
> > > > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > >
> > > Hi Iago,
> > >
> > > Indeed, NIR does convert conditional writes to conditional
> selectss --
> > > it's a required part of the conversion to SSA, and since our HW
> has a
> > > conditional select instruction that's just as fast as doing a
> > > conditional move, we haven't bothered much to try and change it
> back
> > > during out-of-SSA. However, doing this shouldn't make things
> worse. In
> > > your example, vgrf9, vgrf15, and vgrf17 all have very short live
> > > intervals and don't interfere with vgrf11 (unless there's another
> use
> > > of them somewhere after the snippet you pasted), which means that
> the
> > > register allocator is free to allocate the destinations of all the
> > > selects to the same register.
> > >
> > > What's happening, though, is that you're running into our terrible
> > > liveness analysis. After doing the proper liveness analysis, we
> figure
> > > out the place each register first becomes live and last becomes
> dead,
> > > and then we consider registers that have overlapping ranges to
> > > interfere. So we consider vgrf11 to interfere with vgrf15 and
> vgrf17,
> > > even though it really doesn't. The trouble with making it do the
> right
> > > thing is that we may actually need to extend the live ranges of
> > > registers when the exec masks don't match up, either because one
> uses
> > > writemask_all or because they have incompatible exec masks due to
> > > containing different datatypes (half-float vs. float, etc.). For
> > > example, in your snippet, if vgrf15 were to be written with
> > > force_writemask_all, then it actually would interfere with vgrf11.
> But
> > > right now we're simply being conservative and assuming everything
> is
> > > incompatible, which produces extra register pressure in situations
> > > like the one you have. We have the same problem with FS, and I
> have
> > > some idea how to make it better there, but I'm not sure if we
> would
> > > care enough to port it to vec4.
> > >
> > > So the long answer is that there are some unfortunate features of
> the
> > > backend that are making things much worse than they should be wrt
> > > register allocation, and separating things out like SSA does seems
> to
> > > make it worse in certain situations. The short answer is: go fix
> > > register spilling!
> > >
> > > Connor
> >
> > I see, thanks for the detailed reply Connor. I'll have a look at the
> > vec4 register spilling code and Ben's work in that area and see what
> I
> > can do.
> 
> Thanks for working on that, it is the core problem here.  In the FS,
> we have an INTEL_DEBUG=spill option that triggers tells it to spill
> *everything*. That helps a lot when debugging spilling.
> --Jason

FYI, I have just fixed this. The problem was not related to register
spilling though but to incorrect live analysis for conditional writes
with the select opcode. I have just sent this patch for review:
http://lists.freedesktop.org/archives/mesa-dev/2015-July/089219.html

That fixes all the piglit and dEQP tests that failed to link due to
excessive register spilling with the NIR-vec4 backend.

I can still have a go at fixing issues with register spilling though.
Looking at previous work from Ben it looks like he was trying to make it
work with register sizes >1. Is that what you all had in mind or is
there something else?

Iago