[Mesa-dev] [PATCH] i965/fs: In the pre-regalloc schedule, try harder at reducing reg pressure.

Tue Oct 22 02:31:47 CEST 2013

On Tue, Oct 22, 2013 at 3:05 AM, Eric Anholt <eric at anholt.net> wrote:
> Chia-I Wu <olvaffe at gmail.com> writes:
>
>> On Thu, Oct 17, 2013 at 3:24 AM, Matt Turner <mattst88 at gmail.com> wrote:
>>> On Mon, Oct 14, 2013 at 4:14 PM, Eric Anholt <eric at anholt.net> wrote:
>>>> Previously, the best thing we had was to schedule the things unblocked by
>>>> the current instruction, on the hope that it would be consuming two values
>>>> at the end of their live intervals while only producing one new value.
>>>> Sometimes that wasn't the case.
>>>>
>>>> Now, when an instruction is the first user of a GRF we schedule (i.e. it
>>>> will probably be the virtual_grf_def[] instruction after computing live
>>>> intervals again), penalize it by how many regs it would take up.  When an
>>>> instruction is the last user of a GRF we have to schedule (when it will
>>>> probably be the virtual_grf_end[] instruction), give it a boost by how
>>>> many regs it would free.
>>>>
>>>> The new functions are made virtual (only 1 of 2 really needs to be
>>>> virtual) because I expect we'll soon lift the pre-regalloc scheduling
>>>> heuristic over to the vec4 backend.
>>>>
>>>> shader-db:
>>>> total instructions in shared programs: 1512756 -> 1511604 (-0.08%)
>>>> instructions in affected programs:     10292 -> 9140 (-11.19%)
>>>> GAINED:                                121
>>>> LOST:                                  38
>>>>
>>>> Improves tropics performance at my current settings by 4.50602% +/-
>>>> 2.60694% (n=5).  No difference on Lightsmark (n=5).  No difference on
>>>> GLB2.7 (n=11).
>>>>
>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=70445
>>>> ---
>>>
>>> I think we're on the right track by considering register pressure when
>>> scheduling, but one aspect we're not considering is simply how many
>>> registers we think we're using.
>>>
>>> If I understand correctly, the pre-register allocation wants to
>>> shorten live intervals as much as possible which reduces register
>>> pressure but at the cost of larger stalls and less instruction level
>>> parallelism. We end up scheduling things like
>>>
>>> produce result 4
>>> produce result 3
>>> produce result 2
>>> produce result 1
>>> use result 1
>>> use result 2
>>> use result 3
>>> use result 4
>>>
>>> (this is why the MRF writes for the FB write are always done in the
>>> reverse order)
>> In this example, it will actually be
>>
>>  produce result 4
>>  use result 4
>>  produce result 3
>>  use result 3
>>  produce result 2
>>  use result 2
>>  produce result 1
>>  use result 1
>>
>> and post-regalloc will schedule again to something like
>>
>>  produce result 4
>>  produce result 3
>>  produce result 2
>>  produce result 1
>>  use result 4
>>  use result 3
>>  use result 2
>>  use result 1
>>
>> The pre-regalloc scheduling attempts to consume the results as soon as
>> they are available.
>>
>> FB write is done in reverse order because, when a result is available,
>> its consumers are scheduled in reverse order.  The epilog of fragment
>> shaders is usually like this:
>>
>>  placeholder_halt
>>  mov m1, g1
>>  mov m2, g2
>>  mov m3, g3
>>  mov m4, g4
>>  send
>>
>> MOVs depend on placeholder_halt, and send depends on MOVs.  The
>> scheduler will schedule it as follows:
>>
>>  placeholder_halt
>>  mov m4, g4
>>  mov m3, g3
>>  mov m2, g2
>>  mov m1, g1
>>  send
>>
>> The order can be corrected with the change proposed here
>>
>>   http://lists.freedesktop.org/archives/mesa-dev/2013-October/046570.html
>>
>> But there is no point for making the change the current heuristic for
>> pre-regalloc is to be reworked.
>
> Flipping the order in which we prefer ties (on betterthanlifo-2):
>
> commit 11a511576e465f02875f39c452561775a97416a1
> Author: Eric Anholt <eric at anholt.net>
> Date:   Mon Oct 21 11:45:53 2013 -0700
>
>     otherway
>
> diff --git a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp b/src/mesa/
> index 9a480b4..b123015 100644
> --- a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp
> @@ -1049,9 +1049,9 @@ fs_instruction_scheduler::choose_instruction_to_schedule()
>         * it's the first use of a GRF, reduce its score since it means it
>         * should be increasing register pressure.
>         */
> -      for (schedule_node *node = (schedule_node *)instructions.get_tail();
> -           node != instructions.get_head()->prev;
> -           node = (schedule_node *)node->prev) {
> +      for (schedule_node *node = (schedule_node *)instructions.get_head();
> +           node != instructions.get_head()->next;
> +           node = (schedule_node *)node->next) {
>           schedule_node *n = (schedule_node *)node;
>           fs_inst *inst = (fs_inst *)n->inst;
>
> gives:
>
> total instructions in shared programs: 1544638 -> 1546794 (0.14%)
> instructions in affected programs:     7163 -> 9319 (30.10%)
> GAINED:                                16
> LOST:                                  289
>
> with massive spilling on tropics, and a bit on lightsmark and csgo.
Children of a schedule_node also need to be pushed to the head in reverse order

  for (int i = chosen->child_count - 1; i >= 0; i--) {
    ...;
    if (child->parent_count == 0)
      instructions.push_head(child);
  }

so that when you loop from head, you still get LIFO.

-- 
olv at LunarG.com