[Mesa-dev] [PATCH] i965/fs: In the pre-regalloc schedule, try harder at reducing reg pressure.

Mon Oct 21 23:40:50 CEST 2013

Matt Turner <mattst88 at gmail.com> writes:

> On Mon, Oct 14, 2013 at 4:14 PM, Eric Anholt <eric at anholt.net> wrote:
>> Previously, the best thing we had was to schedule the things unblocked by
>> the current instruction, on the hope that it would be consuming two values
>> at the end of their live intervals while only producing one new value.
>> Sometimes that wasn't the case.
>>
>> Now, when an instruction is the first user of a GRF we schedule (i.e. it
>> will probably be the virtual_grf_def[] instruction after computing live
>> intervals again), penalize it by how many regs it would take up.  When an
>> instruction is the last user of a GRF we have to schedule (when it will
>> probably be the virtual_grf_end[] instruction), give it a boost by how
>> many regs it would free.
>>
>> The new functions are made virtual (only 1 of 2 really needs to be
>> virtual) because I expect we'll soon lift the pre-regalloc scheduling
>> heuristic over to the vec4 backend.
>>
>> shader-db:
>> total instructions in shared programs: 1512756 -> 1511604 (-0.08%)
>> instructions in affected programs:     10292 -> 9140 (-11.19%)
>> GAINED:                                121
>> LOST:                                  38
>>
>> Improves tropics performance at my current settings by 4.50602% +/-
>> 2.60694% (n=5).  No difference on Lightsmark (n=5).  No difference on
>> GLB2.7 (n=11).
>>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=70445
>> ---
>
> I think we're on the right track by considering register pressure when
> scheduling, but one aspect we're not considering is simply how many
> registers we think we're using.
>
> If I understand correctly, the pre-register allocation wants to
> shorten live intervals as much as possible which reduces register
> pressure but at the cost of larger stalls and less instruction level
> parallelism. We end up scheduling things like
>
> produce result 4
> produce result 3
> produce result 2
> produce result 1
> use result 1
> use result 2
> use result 3
> use result 4
>
> (this is why the MRF writes for the FB write are always done in the
> reverse order)
>
> Take the main shader from FillTestC24Z16 in GLB2.5 or 2.7 as an
> example. Before texture-grf we serialized the eight texture sends.
> After that branch landed, we scheduled the code much better, leading
> to a performance improvement.
>
> This patch causes us again to serialize the 8 texture ops in
> GLB25_FillTestC24Z16, like we did before texture-from-grf. It reduces
> performance from 7.0 billion texels/sec to ~6.5 on IVB.

This is mostly a problem, as far as I can see, of unfortunate GRF
choices between the send sources and dests.  I haven't seen an easy way
out of that beyond what we're doing with the round_robin flag in the
register allocator already, so let's play with scheduling some more for
the moment...

> Can we accurately track the number of registers in use and decide what
> to do based on that?

An attempt to do this is on betterthanlifo-3 of my tree.  The quick
results:

total instructions in shared programs: 1599565 -> 1599757 (0.01%)
instructions in affected programs:     2014 -> 2206 (9.53%)
GAINED:                                22
LOST:                                  110

That's not at all what I hoped for.  But maybe the problem is that we
end up faced with a ton of multiplies of components of texture results
and we don't know which one we should pick next once we've picked one of
them?  Maybe if we give a higher weight to things that will help finish
off a VGRF's use?  I present betterthanlifo-6:

anholt at eliezer:anholt/src/shader-db% ./report.py sched-lifo3 sched-lifo6 
total instructions in shared programs: 1606060 -> 1606060 (0.00%)
instructions in affected programs:     0 -> 0
GAINED:                                0
LOST:                                  0

Well that wasn't the result I was expecting.  But it kinda makes sense:
Once we've scheduled processing of .x, the next thing we'll probably
choose even in the absence of weighting is .y, not some *other* texture
which had been inserted into the list at a totally separate time.

Looking at performance going from betterthanlifo-2 to betterthanlifo-3:

GLB2.7: 1.39845% +/- 0.797931% (n=15/16)
lm: No difference (n=3)
minecraft: No difference (n=10)
tropics: -4.12118% +/- 2.48834% (n=4)
nexuiz: No difference (n=8)
openarena: -1.46747% +/- 1.08201% (n=110)

At this point I think I want to go forward with -2 (this patch) as
opposed to -3.

(Note: Results presented in this thread, after the original patch
posting, are on top of glsl-cse, trying to reduce the significance of
that one crazy Tropics shader that spawned all this flailing about in
register allocation).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20131021/cedcef76/attachment.pgp>