[Mesa-dev] [PATCH] i965/fs: Don't immediately schedule instructions that were just made available.

Fri Mar 29 16:07:04 PDT 2013

The original goal of pre-register allocation scheduling was to reduce
live ranges so we'd use fewer registers and hopefully fit into 16-wide.
In shader-db, this change causes us to lose 30 16-wide programs, but we
gain 29... so it's a toss-up. At least by choosing instructions in a
better order all programs should be slightly faster.

On Haswell GLB2.5 C24Z16_DXT1 1600x900 non-composited:

x before-
+ after-
+--------------------------------------------------------------------------+
|                                                                        ++|
|  x x x                                                               + ++|
|xxxxx xx                                                             +++++|
| |__A_|                                                               |AM||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10        8794.6       8825.44       8812.44       8811.01     10.288483
+  10       9110.87       9129.38       9124.95      9122.438       6.38743
Difference at 95.0% confidence
	311.428 +/- 8.04582
	3.53453% +/- 0.0913155%
	(Student's t, pooled s = 8.56306)

Consider the trivial case of

uniform float a, b;
void main() { gl_FragColor = vec4(cross(a, b)); }

Before the patch we compile this to

mov.sat(8)  m4<1>F  0F
mul(8)      g3<1>F  g2.4<0,1,0>F  g2<0,1,0>F
mad.sat(8)  m3<1>F  -g3<4,1,1>F   g2.3<4,1,1>F.x  g2.1<4,1,1>F.x
mul(8)      g3<1>F  g2.3<0,1,0>F  g2.2<0,1,0>F
mad.sat(8)  m2<1>F  -g3<4,1,1>F   g2.5<4,1,1>F.x  g2<4,1,1>F.x
mul(8)      g3<1>F  g2.5<0,1,0>F  g2.1<0,1,0>F
mad.sat(8)  m1<1>F  -g3<4,1,1>F   g2.4<4,1,1>F.x  g2.2<4,1,1>F.x
sendc(8)    null    m1<8,8,1>F

where we stall on each mad.sat waiting for the mul to finish. The sendc
is issued cycle 66. After the patch it compiles to

mul(8)      g3<1>F  g2.5<0,1,0>F  g2.1<0,1,0>F
mul(8)      g4<1>F  g2.3<0,1,0>F  g2.2<0,1,0>F
mul(8)      g5<1>F  g2.4<0,1,0>F  g2<0,1,0>F
mov.sat(8)  m4<1>F  0F
mad.sat(8)  m1<1>F  -g3<4,1,1>F   g2.4<4,1,1>F.x  g2.2<4,1,1>F.x
mad.sat(8)  m2<1>F  -g4<4,1,1>F   g2.5<4,1,1>F.x  g2<4,1,1>F.x
mad.sat(8)  m3<1>F  -g5<4,1,1>F   g2.3<4,1,1>F.x  g2.1<4,1,1>F.x
sendc(8)    null    m1<8,8,1>F

By hiding much of the latency, the sendc instruction is issued by cycle
32.
---
 .../dri/i965/brw_fs_schedule_instructions.cpp      | 46 ++++------------------
 1 file changed, 7 insertions(+), 39 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_fs_schedule_instructions.cpp b/src/mesa/drivers/dri/i965/brw_fs_schedule_instructions.cpp
index 997341b..4aeb738 100644
--- a/src/mesa/drivers/dri/i965/brw_fs_schedule_instructions.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs_schedule_instructions.cpp
@@ -725,48 +725,16 @@ instruction_scheduler::schedule_instructions(fs_inst *next_block_header)
       schedule_node *chosen = NULL;
       int chosen_time = 0;
 
-      if (post_reg_alloc) {
-         /* Of the instructions closest ready to execute or the closest to
-          * being ready, choose the oldest one.
-          */
-         foreach_list(node, &instructions) {
-            schedule_node *n = (schedule_node *)node;
-
-            if (!chosen || n->unblocked_time < chosen_time) {
-               chosen = n;
-               chosen_time = n->unblocked_time;
-            }
-         }
-      } else {
-         /* Before register allocation, we don't care about the latencies of
-          * instructions.  All we care about is reducing live intervals of
-          * variables so that we can avoid register spilling, or get 16-wide
-          * shaders which naturally do a better job of hiding instruction
-          * latency.
-          *
-          * To do so, schedule our instructions in a roughly LIFO/depth-first
-          * order: when new instructions become available as a result of
-          * scheduling something, choose those first so that our result
-          * hopefully is consumed quickly.
-          *
-          * The exception is messages that generate more than one result
-          * register (AKA texturing).  In those cases, the LIFO search would
-          * normally tend to choose them quickly (because scheduling the
-          * previous message not only unblocked the children using its result,
-          * but also the MRF setup for the next sampler message, which in turn
-          * unblocks the next sampler message).
-          */
-         for (schedule_node *node = (schedule_node *)instructions.get_tail();
-              node != instructions.get_head()->prev;
-              node = (schedule_node *)node->prev) {
-            schedule_node *n = (schedule_node *)node;
+      /* Of the instructions closest ready to execute or the closest to
+       * being ready, choose the oldest one.
+       */
+      foreach_list(node, &instructions) {
+         schedule_node *n = (schedule_node *)node;
 
+         if (!chosen || n->unblocked_time < chosen_time) {
             chosen = n;
-            if (chosen->inst->regs_written() <= 1)
-               break;
+            chosen_time = n->unblocked_time;
          }
-
-         chosen_time = chosen->unblocked_time;
       }
 
       /* Schedule this instruction. */
-- 
1.7.12.4