[Mesa-dev] Register spilling issues in the NIR->vec4 backend

Wed Jul 15 14:30:35 PDT 2015

On Wed, Jul 15, 2015 at 11:02:03AM -0700, Connor Abbott wrote:
> On Wed, Jul 15, 2015 at 7:49 AM, Iago Toral <itoral at igalia.com> wrote:
> > Hi,
> >
> > when we sent the patches for the new nir->vec4 backend we mentioned that
> > we had a few dEQP tests that would fail to link because of register
> > spilling. Now that we have added GS support we see a few instances of
> > this problem popping up in a few GS piglit tests too, for example this
> > one:
> >
> > tests/spec/glsl-1.50/execution/variable-indexing/gs-input-array-vec4-index-rd.shader_test
> >
> > I have been looking into what is going on with these tests and I came to
> > the conclusion that the problem is a consequence of various factors, but
> > probably the main thing contributing to it is the way our SSA pass
> > works. That said, I am not that experienced with NIR, so it could also
> > be that my analysis is missing something and I am just arriving to wrong
> > conclusions, so I'll explain my thoughts below and hopefully someone
> > else with more NIR experience can jump in and confirm or reject my
> > analysis.
> >
> > The GS code in that test looks like this:
> >
> > for (int p = 0; p < 3; p++) {
> >    color = ((index >= ins[p].m1.length() ?
> >             ins[p].m2[index-ins[p].m1.length()] :
> >             ins[p].m1[index]) == expect) ?
> >                vec4(0.0, 1.0, 0.0, 1.0) : vec4(1.0, 0.0, 0.0, 1.0);
> >    gl_Position = gl_in[p].gl_Position;
> >    EmitVertex();
> > }
> >
> > One thing that is immediately contributing to the register pressure is
> > some really awful code generated because of the indirect array indexing
> > on the inputs inside the loop. This is because of the
> > lower_variable_index_to_cond_assign lowering pass called from
> > brw_shader.cpp. This pass will convert that color assignment into a
> > bunch of nested if/else statements which makes the generated GLSL IR
> > code rather large, involving plenty of temporaries too. This is only
> > made worse by the fact that loop unrolling will replicate that 3 times.
> > The result is a huge pile of GLSL IR with a few dozens of nested if/else
> > statements and temporaries that looks like [1] (that is only a fragment
> > of the GLSL IR).
> >
> > One thing that is particularly relevant in that code is that it has
> > multiple conditional assignments to the same variable
> > (dereference_array_value) as a consequence of this lowering pass.
> >
> > That much, however, is common to the NIR and non-NIR paths. The problem
> > in the NIR case is that all these assignments generate new SSA values,
> > which then become new registers in the final NIR form. This leads to NIR
> > code like [2].  In contrast, the old vec4 visitor path, is able to have
> > writes to the same variable write to the same register.
> >
> > As a result, if I print the code right before register allocation in the
> > NIR path [3] and I compare that to what we get with the old vec4 visitor
> > path at that same point [4], it is clearly visible that this difference
> > is allowing the vec4 visitor path to reduce register pressure (see how
> > in [4] we have multiple writes to vgrf5, while in [3] we always write to
> > a new vgrf every time).
> >
> > So, am I missing something or is this kind of result expected with NIR
> > programs? Is there anything in the nir->vec4 pass that we can do to fix
> > this or does this need to be fixed when going out of SSA moe inside NIR?
> >
> > Iago
> >
> > [1] http://pastebin.com/5uA8ex2S
> > [2] http://pastebin.com/pqLfvAVN
> > [3] http://pastebin.com/64nSuUH8
> > [4] http://pastebin.com/WCrdYxzt
> >
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> Hi Iago,
> 
> Indeed, NIR does convert conditional writes to conditional selectss --
> it's a required part of the conversion to SSA, and since our HW has a
> conditional select instruction that's just as fast as doing a
> conditional move, we haven't bothered much to try and change it back
> during out-of-SSA. However, doing this shouldn't make things worse. In
> your example, vgrf9, vgrf15, and vgrf17 all have very short live
> intervals and don't interfere with vgrf11 (unless there's another use
> of them somewhere after the snippet you pasted), which means that the
> register allocator is free to allocate the destinations of all the
> selects to the same register.
> 
> What's happening, though, is that you're running into our terrible
> liveness analysis. After doing the proper liveness analysis, we figure
> out the place each register first becomes live and last becomes dead,
> and then we consider registers that have overlapping ranges to
> interfere. So we consider vgrf11 to interfere with vgrf15 and vgrf17,
> even though it really doesn't. The trouble with making it do the right
> thing is that we may actually need to extend the live ranges of
> registers when the exec masks don't match up, either because one uses
> writemask_all or because they have incompatible exec masks due to
> containing different datatypes (half-float vs. float, etc.). For
> example, in your snippet, if vgrf15 were to be written with
> force_writemask_all, then it actually would interfere with vgrf11. But
> right now we're simply being conservative and assuming everything is
> incompatible, which produces extra register pressure in situations
> like the one you have. We have the same problem with FS, and I have
> some idea how to make it better there, but I'm not sure if we would
> care enough to port it to vec4.
> 
> So the long answer is that there are some unfortunate features of the
> backend that are making things much worse than they should be wrt
> register allocation, and separating things out like SSA does seems to
> make it worse in certain situations. The short answer is: go fix
> register spilling!
> 

Ack on go fix!!! I started trying to fix this, but quickly got in over my head.
In case it's of any use:

http://cgit.freedesktop.org/~bwidawsk/mesa/log/?h=vec4-more-spill2
http://cgit.freedesktop.org/~bwidawsk/mesa/log/?h=vec4-more-spill3