[Mesa-dev] [PATCH] ra: Disable round-robin strategy for optimistically colorable nodes.

Mon Feb 16 12:02:48 PST 2015

Jason Ekstrand <jason at jlekstrand.net> writes:

> On Mon, Feb 16, 2015 at 10:40 AM, Francisco Jerez <currojerez at riseup.net>
> wrote:
>
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>> > On Feb 16, 2015 9:34 AM, "Francisco Jerez" <currojerez at riseup.net>
>> wrote:
>> >>
>> >> Jason Ekstrand <jason at jlekstrand.net> writes:
>> >>
>> >> > On Feb 16, 2015 8:35 AM, "Francisco Jerez" <currojerez at riseup.net>
>> > wrote:
>> >> >>
>> >> >> The round-robin allocation strategy is expected to decrease the
>> amount
>> >> >> of false dependencies created by the register allocator and give the
>> >> >> post-RA scheduling pass more freedom to move instructions around.  On
>> >> >> the other hand it has the disadvantage of increasing fragmentation
>> and
>> >> >> decreasing the number of equally-colored nearby nodes, what increases
>> >> >> the likelihood of failure in presence of optimistically colorable
>> >> >> nodes.
>> >> >>
>> >> >> This patch disables the round-robin strategy for optimistically
>> >> >> colorable nodes.  These typically arise in situations of high
>> register
>> >> >> pressure or for registers with large live intervals, in both cases
>> the
>> >> >> task of the instruction scheduler shouldn't be constrained
>> excessively
>> >> >> by the dense packing of those nodes, and a spill (or on Intel
>> hardware
>> >> >> a fall-back to SIMD8 mode) is invariably worse than a slightly less
>> >> >> optimal scheduling.
>> >> >
>> >> > Actually, that's not true.  Matt was doing some experiments recently
>> > with a
>> >> > noise shader from synmark and the difference between our 2nd and 3rd
>> > choice
>> >> > schedulers is huge.  In that test he disabled the third choice
>> scheduler
>> >> > and the result was a shader that spilled 6 or 8 times but ran
>> something
>> >> > like 30% faster.  We really need to do some more experimentation with
>> >> > scheduling and figure out better heuristics than "SIMD16 is always
>> > faster"
>> >> > and "spilling is bad".
>> >> >
>> >>
>> >> Yes, I'm aware of rare corner cases like that where e.g. SIMD16 leads to
>> >> higher cache thrashing than SIMD8 leading to decreased overall
>> >> performance, and a case where a shader SIMD16 *with* spills has better
>> >> performance than the SIMD8 version of the same shader without spills.
>> >>
>> >> In any case it's not the register allocator's business to implement such
>> >> heuristics, and that's not an argument against the register allocator
>> >> trying to make a more efficient use of the register file.
>> >
>> > The primary point I was trying to make is that scheduling *does* matter.
>> > It matters a lot.  In fact, Matt and i have talked about throwing away
>> the
>> > SIMD16 program if it ends up using the pessimal schedulong algorithm.
>> > Throwing scheduling to the wind just to gain a few SIMD16 programs is
>> > probably not a good trade-off.
>> >
>> In my experience the exact opposite observation has been far more
>> common.  Running SIMD16 vs SIMD8 has a larger impact on performance than
>> the way you end up scheduling things post-regalloc.  Actually even if
>> you end up causing some unmet instruction dependencies by the way
>> instructions are scheduled post-regalloc, the EU can context-switch to
>> service the next available thread almost for free when a thread stalls
>> on some dependency.  Also the fact that you're doing SIMD16 itself makes
>> post-regalloc scheduling less important because it naturally has an
>> effect in hiding latency.
>>
>> My intuition is that the huge performance improvement Matt observed by
>> disabling the third scheduling heuristic is more likely to have been
>> caused by a decrease in the amount of cache thrashing caused by the fact
>> that he was running less channels concurrently rather than by the
>> scheduling heuristic itself.  Matt, did you rule out that possibility?
>>
>> The other thing is this patch has an effect on the allocation strategy
>> for optimistically colorable nodes *only*.  We're already heavily
>> constrained by register pressure when we get to that point, and assuming
>> allocation succeeds the post-regalloc scheduler is going to have little
>> room for maneuvering anyway.
>>
>> > It could be that this is an good idea, but it's going to take more than
>> > hand-waved theories about register allocation one shader not spilling to
>> > convince me.  Do you actually know what it did to scheduling?  It
>> wouldn't
>> > be hard to hack up the driver and shader-db to collect that information.
>> >
>> 44 shaders going SIMD16 seems like a strong enough argument to me.
>> Could you be more precise about what additional information you want me
>> to collect?
>>
>
> How many shaders go from the first scheduling method to the second or to
> the third.  In other words some sort of metric on which shaders are
> "helped" or "hurt" in their scheduling.

OK, I hacked the driver to output the scheduling heuristic that had been
used when we allocated registers successfully for the program via
KHR_debug and then ran shader-db before and after applying this patch.

Before this patch:
Heuristic       SIMD8   SIMD16
PRE             18924   18598
PRE_NON_LIFO    72      38
PRE_LIFO        8       3

After this patch:
Heuristic       SIMD8   SIMD16
PRE             18939   18643
PRE_NON_LIFO    57      37
PRE_LIFO        8       3

So it actually *decreases* the number of shaders falling back to the
latency-insensitive heuristics because the register allocator is more
likely to succeed with the PRE heuristic.

>
>
>> > --Jason
>> >
>> >> >> Shader-db results on the i965 driver:
>> >> >>
>> >> >> total instructions in shared programs: 5488539 -> 5488489 (-0.00%)
>> >> >> instructions in affected programs:     1121 -> 1071 (-4.46%)
>> >> >> helped:                                1
>> >> >> HURT:                                  0
>> >> >> GAINED:                                49
>> >> >> LOST:                                  5
>> >> >> ---
>> >> >>  src/util/register_allocate.c | 22 +++++++++++++++++++++-
>> >> >>  1 file changed, 21 insertions(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/src/util/register_allocate.c
>> > b/src/util/register_allocate.c
>> >> >> index af7a20c..d63d8eb 100644
>> >> >> --- a/src/util/register_allocate.c
>> >> >> +++ b/src/util/register_allocate.c
>> >> >> @@ -168,6 +168,12 @@ struct ra_graph {
>> >> >>
>> >> >>     unsigned int *stack;
>> >> >>     unsigned int stack_count;
>> >> >> +
>> >> >> +   /**
>> >> >> +    * Tracks the start of the set of optimistically-colored
>> registers
>> > in
>> >> > the
>> >> >> +    * stack.
>> >> >> +    */
>> >> >> +   unsigned int stack_optimistic_start;
>> >> >>  };
>> >> >>
>> >> >>  /**
>> >> >> @@ -454,6 +460,7 @@ static void
>> >> >>  ra_simplify(struct ra_graph *g)
>> >> >>  {
>> >> >>     bool progress = true;
>> >> >> +   unsigned int stack_optimistic_start = ~0;
>> >> >>     int i;
>> >> >>
>> >> >>     while (progress) {
>> >> >> @@ -483,12 +490,16 @@ ra_simplify(struct ra_graph *g)
>> >> >>
>> >> >>        if (!progress && best_optimistic_node != ~0U) {
>> >> >>          decrement_q(g, best_optimistic_node);
>> >> >> +         stack_optimistic_start =
>> >> >> +            MIN2(stack_optimistic_start, g->stack_count);
>> >> >>          g->stack[g->stack_count] = best_optimistic_node;
>> >> >>          g->stack_count++;
>> >> >>          g->nodes[best_optimistic_node].in_stack = true;
>> >> >>          progress = true;
>> >> >>        }
>> >> >>     }
>> >> >> +
>> >> >> +   g->stack_optimistic_start = stack_optimistic_start;
>> >> >>  }
>> >> >>
>> >> >>  /**
>> >> >> @@ -542,7 +553,16 @@ ra_select(struct ra_graph *g)
>> >> >>        g->nodes[n].reg = r;
>> >> >>        g->stack_count--;
>> >> >>
>> >> >> -      if (g->regs->round_robin)
>> >> >> +      /* Rotate the starting point except for optimistically
>> colorable
>> >> > nodes.
>> >> >> +       * The likelihood that we will succeed at allocating
>> > optimistically
>> >> >> +       * colorable nodes is highly dependent on the way that the
>> > previous
>> >> >> +       * nodes popped off the stack are laid out.  The round-robin
>> >> > strategy
>> >> >> +       * increases the fragmentation of the register file and
>> > decreases
>> >> > the
>> >> >> +       * number of nearby nodes assigned to the same color, what
>> >> > increases the
>> >> >> +       * likelihood of spilling with respect to the dense packing
>> >> > strategy.
>> >> >> +       */
>> >> >> +      if (g->regs->round_robin &&
>> >> >> +          g->stack_count <= g->stack_optimistic_start)
>> >> >>           start_search_reg = r + 1;
>> >> >>     }
>> >> >>
>> >> >> --
>> >> >> 2.1.3
>> >> >>
>> >> >> _______________________________________________
>> >> >> mesa-dev mailing list
>> >> >> mesa-dev at lists.freedesktop.org
>> >> >> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150216/11f8a431/attachment.sig>