[Mesa-dev] [PATCH] i965/fs: In the pre-regalloc schedule, try harder at reducing reg pressure.

Wed Oct 16 21:24:53 CEST 2013

On Mon, Oct 14, 2013 at 4:14 PM, Eric Anholt <eric at anholt.net> wrote:
> Previously, the best thing we had was to schedule the things unblocked by
> the current instruction, on the hope that it would be consuming two values
> at the end of their live intervals while only producing one new value.
> Sometimes that wasn't the case.
>
> Now, when an instruction is the first user of a GRF we schedule (i.e. it
> will probably be the virtual_grf_def[] instruction after computing live
> intervals again), penalize it by how many regs it would take up.  When an
> instruction is the last user of a GRF we have to schedule (when it will
> probably be the virtual_grf_end[] instruction), give it a boost by how
> many regs it would free.
>
> The new functions are made virtual (only 1 of 2 really needs to be
> virtual) because I expect we'll soon lift the pre-regalloc scheduling
> heuristic over to the vec4 backend.
>
> shader-db:
> total instructions in shared programs: 1512756 -> 1511604 (-0.08%)
> instructions in affected programs:     10292 -> 9140 (-11.19%)
> GAINED:                                121
> LOST:                                  38
>
> Improves tropics performance at my current settings by 4.50602% +/-
> 2.60694% (n=5).  No difference on Lightsmark (n=5).  No difference on
> GLB2.7 (n=11).
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=70445
> ---

I think we're on the right track by considering register pressure when
scheduling, but one aspect we're not considering is simply how many
registers we think we're using.

If I understand correctly, the pre-register allocation wants to
shorten live intervals as much as possible which reduces register
pressure but at the cost of larger stalls and less instruction level
parallelism. We end up scheduling things like

produce result 4
produce result 3
produce result 2
produce result 1
use result 1
use result 2
use result 3
use result 4

(this is why the MRF writes for the FB write are always done in the
reverse order)

Take the main shader from FillTestC24Z16 in GLB2.5 or 2.7 as an
example. Before texture-grf we serialized the eight texture sends.
After that branch landed, we scheduled the code much better, leading
to a performance improvement.

This patch causes us again to serialize the 8 texture ops in
GLB25_FillTestC24Z16, like we did before texture-from-grf. It reduces
performance from 7.0 billion texels/sec to ~6.5 on IVB.

The shader in question is structured, prior to scheduling as

16 PLNs to interpolate the texture coordinates
 - 10 registers consumed, 16 results produced
8 TEX
 - 16 registers consumed, 32 results produced
28 ADDs to sum the texture results into gl_FragColor.
 - 32 registers consumed, 4 results produced
FB write.
 - 4 registers consumed

Even doubling these numbers for SIMD16 we don't spill. There's no need
to reduce live ranges and therefore ILP for this shader.

Can we accurately track the number of registers in use and decide what
to do based on that?