[Mesa-dev] Register spilling issues in the NIR->vec4 backend

Wed Jul 15 11:02:03 PDT 2015

On Wed, Jul 15, 2015 at 7:49 AM, Iago Toral <itoral at igalia.com> wrote:
> Hi,
>
> when we sent the patches for the new nir->vec4 backend we mentioned that
> we had a few dEQP tests that would fail to link because of register
> spilling. Now that we have added GS support we see a few instances of
> this problem popping up in a few GS piglit tests too, for example this
> one:
>
> tests/spec/glsl-1.50/execution/variable-indexing/gs-input-array-vec4-index-rd.shader_test
>
> I have been looking into what is going on with these tests and I came to
> the conclusion that the problem is a consequence of various factors, but
> probably the main thing contributing to it is the way our SSA pass
> works. That said, I am not that experienced with NIR, so it could also
> be that my analysis is missing something and I am just arriving to wrong
> conclusions, so I'll explain my thoughts below and hopefully someone
> else with more NIR experience can jump in and confirm or reject my
> analysis.
>
> The GS code in that test looks like this:
>
> for (int p = 0; p < 3; p++) {
>    color = ((index >= ins[p].m1.length() ?
>             ins[p].m2[index-ins[p].m1.length()] :
>             ins[p].m1[index]) == expect) ?
>                vec4(0.0, 1.0, 0.0, 1.0) : vec4(1.0, 0.0, 0.0, 1.0);
>    gl_Position = gl_in[p].gl_Position;
>    EmitVertex();
> }
>
> One thing that is immediately contributing to the register pressure is
> some really awful code generated because of the indirect array indexing
> on the inputs inside the loop. This is because of the
> lower_variable_index_to_cond_assign lowering pass called from
> brw_shader.cpp. This pass will convert that color assignment into a
> bunch of nested if/else statements which makes the generated GLSL IR
> code rather large, involving plenty of temporaries too. This is only
> made worse by the fact that loop unrolling will replicate that 3 times.
> The result is a huge pile of GLSL IR with a few dozens of nested if/else
> statements and temporaries that looks like [1] (that is only a fragment
> of the GLSL IR).
>
> One thing that is particularly relevant in that code is that it has
> multiple conditional assignments to the same variable
> (dereference_array_value) as a consequence of this lowering pass.
>
> That much, however, is common to the NIR and non-NIR paths. The problem
> in the NIR case is that all these assignments generate new SSA values,
> which then become new registers in the final NIR form. This leads to NIR
> code like [2].  In contrast, the old vec4 visitor path, is able to have
> writes to the same variable write to the same register.
>
> As a result, if I print the code right before register allocation in the
> NIR path [3] and I compare that to what we get with the old vec4 visitor
> path at that same point [4], it is clearly visible that this difference
> is allowing the vec4 visitor path to reduce register pressure (see how
> in [4] we have multiple writes to vgrf5, while in [3] we always write to
> a new vgrf every time).
>
> So, am I missing something or is this kind of result expected with NIR
> programs? Is there anything in the nir->vec4 pass that we can do to fix
> this or does this need to be fixed when going out of SSA moe inside NIR?
>
> Iago
>
> [1] http://pastebin.com/5uA8ex2S
> [2] http://pastebin.com/pqLfvAVN
> [3] http://pastebin.com/64nSuUH8
> [4] http://pastebin.com/WCrdYxzt
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Hi Iago,

Indeed, NIR does convert conditional writes to conditional selectss --
it's a required part of the conversion to SSA, and since our HW has a
conditional select instruction that's just as fast as doing a
conditional move, we haven't bothered much to try and change it back
during out-of-SSA. However, doing this shouldn't make things worse. In
your example, vgrf9, vgrf15, and vgrf17 all have very short live
intervals and don't interfere with vgrf11 (unless there's another use
of them somewhere after the snippet you pasted), which means that the
register allocator is free to allocate the destinations of all the
selects to the same register.

What's happening, though, is that you're running into our terrible
liveness analysis. After doing the proper liveness analysis, we figure
out the place each register first becomes live and last becomes dead,
and then we consider registers that have overlapping ranges to
interfere. So we consider vgrf11 to interfere with vgrf15 and vgrf17,
even though it really doesn't. The trouble with making it do the right
thing is that we may actually need to extend the live ranges of
registers when the exec masks don't match up, either because one uses
writemask_all or because they have incompatible exec masks due to
containing different datatypes (half-float vs. float, etc.). For
example, in your snippet, if vgrf15 were to be written with
force_writemask_all, then it actually would interfere with vgrf11. But
right now we're simply being conservative and assuming everything is
incompatible, which produces extra register pressure in situations
like the one you have. We have the same problem with FS, and I have
some idea how to make it better there, but I'm not sure if we would
care enough to port it to vec4.

So the long answer is that there are some unfortunate features of the
backend that are making things much worse than they should be wrt
register allocation, and separating things out like SSA does seems to
make it worse in certain situations. The short answer is: go fix
register spilling!

Connor