[Mesa-dev] [PATCH] ra: Disable round-robin strategy for optimistically colorable nodes.

Mon Feb 16 12:30:47 PST 2015

On Mon, Feb 16, 2015 at 12:02 PM, Francisco Jerez <currojerez at riseup.net>
wrote:

> Jason Ekstrand <jason at jlekstrand.net> writes:
>
> > On Mon, Feb 16, 2015 at 10:40 AM, Francisco Jerez <currojerez at riseup.net
> >
> > wrote:
> >
> >> Jason Ekstrand <jason at jlekstrand.net> writes:
> >>
> >> > On Feb 16, 2015 9:34 AM, "Francisco Jerez" <currojerez at riseup.net>
> >> wrote:
> >> >>
> >> >> Jason Ekstrand <jason at jlekstrand.net> writes:
> >> >>
> >> >> > On Feb 16, 2015 8:35 AM, "Francisco Jerez" <currojerez at riseup.net>
> >> > wrote:
> >> >> >>
> >> >> >> The round-robin allocation strategy is expected to decrease the
> >> amount
> >> >> >> of false dependencies created by the register allocator and give
> the
> >> >> >> post-RA scheduling pass more freedom to move instructions
> around.  On
> >> >> >> the other hand it has the disadvantage of increasing fragmentation
> >> and
> >> >> >> decreasing the number of equally-colored nearby nodes, what
> increases
> >> >> >> the likelihood of failure in presence of optimistically colorable
> >> >> >> nodes.
> >> >> >>
> >> >> >> This patch disables the round-robin strategy for optimistically
> >> >> >> colorable nodes.  These typically arise in situations of high
> >> register
> >> >> >> pressure or for registers with large live intervals, in both cases
> >> the
> >> >> >> task of the instruction scheduler shouldn't be constrained
> >> excessively
> >> >> >> by the dense packing of those nodes, and a spill (or on Intel
> >> hardware
> >> >> >> a fall-back to SIMD8 mode) is invariably worse than a slightly
> less
> >> >> >> optimal scheduling.
> >> >> >
> >> >> > Actually, that's not true.  Matt was doing some experiments
> recently
> >> > with a
> >> >> > noise shader from synmark and the difference between our 2nd and
> 3rd
> >> > choice
> >> >> > schedulers is huge.  In that test he disabled the third choice
> >> scheduler
> >> >> > and the result was a shader that spilled 6 or 8 times but ran
> >> something
> >> >> > like 30% faster.  We really need to do some more experimentation
> with
> >> >> > scheduling and figure out better heuristics than "SIMD16 is always
> >> > faster"
> >> >> > and "spilling is bad".
> >> >> >
> >> >>
> >> >> Yes, I'm aware of rare corner cases like that where e.g. SIMD16
> leads to
> >> >> higher cache thrashing than SIMD8 leading to decreased overall
> >> >> performance, and a case where a shader SIMD16 *with* spills has
> better
> >> >> performance than the SIMD8 version of the same shader without spills.
> >> >>
> >> >> In any case it's not the register allocator's business to implement
> such
> >> >> heuristics, and that's not an argument against the register allocator
> >> >> trying to make a more efficient use of the register file.
> >> >
> >> > The primary point I was trying to make is that scheduling *does*
> matter.
> >> > It matters a lot.  In fact, Matt and i have talked about throwing away
> >> the
> >> > SIMD16 program if it ends up using the pessimal schedulong algorithm.
> >> > Throwing scheduling to the wind just to gain a few SIMD16 programs is
> >> > probably not a good trade-off.
> >> >
> >> In my experience the exact opposite observation has been far more
> >> common.  Running SIMD16 vs SIMD8 has a larger impact on performance than
> >> the way you end up scheduling things post-regalloc.  Actually even if
> >> you end up causing some unmet instruction dependencies by the way
> >> instructions are scheduled post-regalloc, the EU can context-switch to
> >> service the next available thread almost for free when a thread stalls
> >> on some dependency.  Also the fact that you're doing SIMD16 itself makes
> >> post-regalloc scheduling less important because it naturally has an
> >> effect in hiding latency.
> >>
> >> My intuition is that the huge performance improvement Matt observed by
> >> disabling the third scheduling heuristic is more likely to have been
> >> caused by a decrease in the amount of cache thrashing caused by the fact
> >> that he was running less channels concurrently rather than by the
> >> scheduling heuristic itself.  Matt, did you rule out that possibility?
> >>
> >> The other thing is this patch has an effect on the allocation strategy
> >> for optimistically colorable nodes *only*.  We're already heavily
> >> constrained by register pressure when we get to that point, and assuming
> >> allocation succeeds the post-regalloc scheduler is going to have little
> >> room for maneuvering anyway.
> >>
> >> > It could be that this is an good idea, but it's going to take more
> than
> >> > hand-waved theories about register allocation one shader not spilling
> to
> >> > convince me.  Do you actually know what it did to scheduling?  It
> >> wouldn't
> >> > be hard to hack up the driver and shader-db to collect that
> information.
> >> >
> >> 44 shaders going SIMD16 seems like a strong enough argument to me.
> >> Could you be more precise about what additional information you want me
> >> to collect?
> >>
> >
> > How many shaders go from the first scheduling method to the second or to
> > the third.  In other words some sort of metric on which shaders are
> > "helped" or "hurt" in their scheduling.
>
> OK, I hacked the driver to output the scheduling heuristic that had been
> used when we allocated registers successfully for the program via
> KHR_debug and then ran shader-db before and after applying this patch.
>
> Before this patch:
> Heuristic       SIMD8   SIMD16
> PRE             18924   18598
> PRE_NON_LIFO    72      38
> PRE_LIFO        8       3
>
> After this patch:
> Heuristic       SIMD8   SIMD16
> PRE             18939   18643
> PRE_NON_LIFO    57      37
> PRE_LIFO        8       3
>
> So it actually *decreases* the number of shaders falling back to the
> latency-insensitive heuristics because the register allocator is more
> likely to succeed with the PRE heuristic.
>

Yeah, I guess it only affects post-allocation scheduling and accidental
dependencies (caused by two instructions happening to use the same
register).  Sorry for not thinking about that earlier.  In that case, I
have no problem with the patch but I can't say I've reviewed it either.
I'll try and take a look at it tomorrow and do a real review.
--Jason

> >
> >
> >> > --Jason
> >> >
> >> >> >> Shader-db results on the i965 driver:
> >> >> >>
> >> >> >> total instructions in shared programs: 5488539 -> 5488489 (-0.00%)
> >> >> >> instructions in affected programs:     1121 -> 1071 (-4.46%)
> >> >> >> helped:                                1
> >> >> >> HURT:                                  0
> >> >> >> GAINED:                                49
> >> >> >> LOST:                                  5
> >> >> >> ---
> >> >> >>  src/util/register_allocate.c | 22 +++++++++++++++++++++-
> >> >> >>  1 file changed, 21 insertions(+), 1 deletion(-)
> >> >> >>
> >> >> >> diff --git a/src/util/register_allocate.c
> >> > b/src/util/register_allocate.c
> >> >> >> index af7a20c..d63d8eb 100644
> >> >> >> --- a/src/util/register_allocate.c
> >> >> >> +++ b/src/util/register_allocate.c
> >> >> >> @@ -168,6 +168,12 @@ struct ra_graph {
> >> >> >>
> >> >> >>     unsigned int *stack;
> >> >> >>     unsigned int stack_count;
> >> >> >> +
> >> >> >> +   /**
> >> >> >> +    * Tracks the start of the set of optimistically-colored
> >> registers
> >> > in
> >> >> > the
> >> >> >> +    * stack.
> >> >> >> +    */
> >> >> >> +   unsigned int stack_optimistic_start;
> >> >> >>  };
> >> >> >>
> >> >> >>  /**
> >> >> >> @@ -454,6 +460,7 @@ static void
> >> >> >>  ra_simplify(struct ra_graph *g)
> >> >> >>  {
> >> >> >>     bool progress = true;
> >> >> >> +   unsigned int stack_optimistic_start = ~0;
> >> >> >>     int i;
> >> >> >>
> >> >> >>     while (progress) {
> >> >> >> @@ -483,12 +490,16 @@ ra_simplify(struct ra_graph *g)
> >> >> >>
> >> >> >>        if (!progress && best_optimistic_node != ~0U) {
> >> >> >>          decrement_q(g, best_optimistic_node);
> >> >> >> +         stack_optimistic_start =
> >> >> >> +            MIN2(stack_optimistic_start, g->stack_count);
> >> >> >>          g->stack[g->stack_count] = best_optimistic_node;
> >> >> >>          g->stack_count++;
> >> >> >>          g->nodes[best_optimistic_node].in_stack = true;
> >> >> >>          progress = true;
> >> >> >>        }
> >> >> >>     }
> >> >> >> +
> >> >> >> +   g->stack_optimistic_start = stack_optimistic_start;
> >> >> >>  }
> >> >> >>
> >> >> >>  /**
> >> >> >> @@ -542,7 +553,16 @@ ra_select(struct ra_graph *g)
> >> >> >>        g->nodes[n].reg = r;
> >> >> >>        g->stack_count--;
> >> >> >>
> >> >> >> -      if (g->regs->round_robin)
> >> >> >> +      /* Rotate the starting point except for optimistically
> >> colorable
> >> >> > nodes.
> >> >> >> +       * The likelihood that we will succeed at allocating
> >> > optimistically
> >> >> >> +       * colorable nodes is highly dependent on the way that the
> >> > previous
> >> >> >> +       * nodes popped off the stack are laid out.  The
> round-robin
> >> >> > strategy
> >> >> >> +       * increases the fragmentation of the register file and
> >> > decreases
> >> >> > the
> >> >> >> +       * number of nearby nodes assigned to the same color, what
> >> >> > increases the
> >> >> >> +       * likelihood of spilling with respect to the dense packing
> >> >> > strategy.
> >> >> >> +       */
> >> >> >> +      if (g->regs->round_robin &&
> >> >> >> +          g->stack_count <= g->stack_optimistic_start)
> >> >> >>           start_search_reg = r + 1;
> >> >> >>     }
> >> >> >>
> >> >> >> --
> >> >> >> 2.1.3
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> mesa-dev mailing list
> >> >> >> mesa-dev at lists.freedesktop.org
> >> >> >> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150216/b5017681/attachment.html>