[Mesa-dev] [PATCH] i965/fs: In the pre-regalloc schedule, try harder at reducing reg pressure.

Fri Oct 18 09:24:22 CEST 2013

On Thu, Oct 17, 2013 at 3:24 AM, Matt Turner <mattst88 at gmail.com> wrote:
> On Mon, Oct 14, 2013 at 4:14 PM, Eric Anholt <eric at anholt.net> wrote:
>> Previously, the best thing we had was to schedule the things unblocked by
>> the current instruction, on the hope that it would be consuming two values
>> at the end of their live intervals while only producing one new value.
>> Sometimes that wasn't the case.
>>
>> Now, when an instruction is the first user of a GRF we schedule (i.e. it
>> will probably be the virtual_grf_def[] instruction after computing live
>> intervals again), penalize it by how many regs it would take up.  When an
>> instruction is the last user of a GRF we have to schedule (when it will
>> probably be the virtual_grf_end[] instruction), give it a boost by how
>> many regs it would free.
>>
>> The new functions are made virtual (only 1 of 2 really needs to be
>> virtual) because I expect we'll soon lift the pre-regalloc scheduling
>> heuristic over to the vec4 backend.
>>
>> shader-db:
>> total instructions in shared programs: 1512756 -> 1511604 (-0.08%)
>> instructions in affected programs:     10292 -> 9140 (-11.19%)
>> GAINED:                                121
>> LOST:                                  38
>>
>> Improves tropics performance at my current settings by 4.50602% +/-
>> 2.60694% (n=5).  No difference on Lightsmark (n=5).  No difference on
>> GLB2.7 (n=11).
>>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=70445
>> ---
>
> I think we're on the right track by considering register pressure when
> scheduling, but one aspect we're not considering is simply how many
> registers we think we're using.
>
> If I understand correctly, the pre-register allocation wants to
> shorten live intervals as much as possible which reduces register
> pressure but at the cost of larger stalls and less instruction level
> parallelism. We end up scheduling things like
>
> produce result 4
> produce result 3
> produce result 2
> produce result 1
> use result 1
> use result 2
> use result 3
> use result 4
>
> (this is why the MRF writes for the FB write are always done in the
> reverse order)
In this example, it will actually be

 produce result 4
 use result 4
 produce result 3
 use result 3
 produce result 2
 use result 2
 produce result 1
 use result 1

and post-regalloc will schedule again to something like

 produce result 4
 produce result 3
 produce result 2
 produce result 1
 use result 4
 use result 3
 use result 2
 use result 1

The pre-regalloc scheduling attempts to consume the results as soon as
they are available.

FB write is done in reverse order because, when a result is available,
its consumers are scheduled in reverse order.  The epilog of fragment
shaders is usually like this:

 placeholder_halt
 mov m1, g1
 mov m2, g2
 mov m3, g3
 mov m4, g4
 send

MOVs depend on placeholder_halt, and send depends on MOVs.  The
scheduler will schedule it as follows:

 placeholder_halt
 mov m4, g4
 mov m3, g3
 mov m2, g2
 mov m1, g1
 send

The order can be corrected with the change proposed here

  http://lists.freedesktop.org/archives/mesa-dev/2013-October/046570.html

But there is no point for making the change the current heuristic for
pre-regalloc is to be reworked.

>
> Take the main shader from FillTestC24Z16 in GLB2.5 or 2.7 as an
> example. Before texture-grf we serialized the eight texture sends.
> After that branch landed, we scheduled the code much better, leading
> to a performance improvement.
>
> This patch causes us again to serialize the 8 texture ops in
> GLB25_FillTestC24Z16, like we did before texture-from-grf. It reduces
> performance from 7.0 billion texels/sec to ~6.5 on IVB.
>
> The shader in question is structured, prior to scheduling as
>
> 16 PLNs to interpolate the texture coordinates
>  - 10 registers consumed, 16 results produced
> 8 TEX
>  - 16 registers consumed, 32 results produced
> 28 ADDs to sum the texture results into gl_FragColor.
>  - 32 registers consumed, 4 results produced
> FB write.
>  - 4 registers consumed
>
> Even doubling these numbers for SIMD16 we don't spill. There's no need
> to reduce live ranges and therefore ILP for this shader.
>
> Can we accurately track the number of registers in use and decide what
> to do based on that?
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev

-- 
olv at LunarG.com