[Mesa-dev] [PATCH] ra: Disable round-robin strategy for optimistically colorable nodes.
Francisco Jerez
currojerez at riseup.net
Mon Feb 16 12:02:48 PST 2015
Jason Ekstrand <jason at jlekstrand.net> writes:
> On Mon, Feb 16, 2015 at 10:40 AM, Francisco Jerez <currojerez at riseup.net>
> wrote:
>
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>> > On Feb 16, 2015 9:34 AM, "Francisco Jerez" <currojerez at riseup.net>
>> wrote:
>> >>
>> >> Jason Ekstrand <jason at jlekstrand.net> writes:
>> >>
>> >> > On Feb 16, 2015 8:35 AM, "Francisco Jerez" <currojerez at riseup.net>
>> > wrote:
>> >> >>
>> >> >> The round-robin allocation strategy is expected to decrease the
>> amount
>> >> >> of false dependencies created by the register allocator and give the
>> >> >> post-RA scheduling pass more freedom to move instructions around. On
>> >> >> the other hand it has the disadvantage of increasing fragmentation
>> and
>> >> >> decreasing the number of equally-colored nearby nodes, what increases
>> >> >> the likelihood of failure in presence of optimistically colorable
>> >> >> nodes.
>> >> >>
>> >> >> This patch disables the round-robin strategy for optimistically
>> >> >> colorable nodes. These typically arise in situations of high
>> register
>> >> >> pressure or for registers with large live intervals, in both cases
>> the
>> >> >> task of the instruction scheduler shouldn't be constrained
>> excessively
>> >> >> by the dense packing of those nodes, and a spill (or on Intel
>> hardware
>> >> >> a fall-back to SIMD8 mode) is invariably worse than a slightly less
>> >> >> optimal scheduling.
>> >> >
>> >> > Actually, that's not true. Matt was doing some experiments recently
>> > with a
>> >> > noise shader from synmark and the difference between our 2nd and 3rd
>> > choice
>> >> > schedulers is huge. In that test he disabled the third choice
>> scheduler
>> >> > and the result was a shader that spilled 6 or 8 times but ran
>> something
>> >> > like 30% faster. We really need to do some more experimentation with
>> >> > scheduling and figure out better heuristics than "SIMD16 is always
>> > faster"
>> >> > and "spilling is bad".
>> >> >
>> >>
>> >> Yes, I'm aware of rare corner cases like that where e.g. SIMD16 leads to
>> >> higher cache thrashing than SIMD8 leading to decreased overall
>> >> performance, and a case where a shader SIMD16 *with* spills has better
>> >> performance than the SIMD8 version of the same shader without spills.
>> >>
>> >> In any case it's not the register allocator's business to implement such
>> >> heuristics, and that's not an argument against the register allocator
>> >> trying to make a more efficient use of the register file.
>> >
>> > The primary point I was trying to make is that scheduling *does* matter.
>> > It matters a lot. In fact, Matt and i have talked about throwing away
>> the
>> > SIMD16 program if it ends up using the pessimal schedulong algorithm.
>> > Throwing scheduling to the wind just to gain a few SIMD16 programs is
>> > probably not a good trade-off.
>> >
>> In my experience the exact opposite observation has been far more
>> common. Running SIMD16 vs SIMD8 has a larger impact on performance than
>> the way you end up scheduling things post-regalloc. Actually even if
>> you end up causing some unmet instruction dependencies by the way
>> instructions are scheduled post-regalloc, the EU can context-switch to
>> service the next available thread almost for free when a thread stalls
>> on some dependency. Also the fact that you're doing SIMD16 itself makes
>> post-regalloc scheduling less important because it naturally has an
>> effect in hiding latency.
>>
>> My intuition is that the huge performance improvement Matt observed by
>> disabling the third scheduling heuristic is more likely to have been
>> caused by a decrease in the amount of cache thrashing caused by the fact
>> that he was running less channels concurrently rather than by the
>> scheduling heuristic itself. Matt, did you rule out that possibility?
>>
>> The other thing is this patch has an effect on the allocation strategy
>> for optimistically colorable nodes *only*. We're already heavily
>> constrained by register pressure when we get to that point, and assuming
>> allocation succeeds the post-regalloc scheduler is going to have little
>> room for maneuvering anyway.
>>
>> > It could be that this is an good idea, but it's going to take more than
>> > hand-waved theories about register allocation one shader not spilling to
>> > convince me. Do you actually know what it did to scheduling? It
>> wouldn't
>> > be hard to hack up the driver and shader-db to collect that information.
>> >
>> 44 shaders going SIMD16 seems like a strong enough argument to me.
>> Could you be more precise about what additional information you want me
>> to collect?
>>
>
> How many shaders go from the first scheduling method to the second or to
> the third. In other words some sort of metric on which shaders are
> "helped" or "hurt" in their scheduling.
OK, I hacked the driver to output the scheduling heuristic that had been
used when we allocated registers successfully for the program via
KHR_debug and then ran shader-db before and after applying this patch.
Before this patch:
Heuristic SIMD8 SIMD16
PRE 18924 18598
PRE_NON_LIFO 72 38
PRE_LIFO 8 3
After this patch:
Heuristic SIMD8 SIMD16
PRE 18939 18643
PRE_NON_LIFO 57 37
PRE_LIFO 8 3
So it actually *decreases* the number of shaders falling back to the
latency-insensitive heuristics because the register allocator is more
likely to succeed with the PRE heuristic.
>
>
>> > --Jason
>> >
>> >> >> Shader-db results on the i965 driver:
>> >> >>
>> >> >> total instructions in shared programs: 5488539 -> 5488489 (-0.00%)
>> >> >> instructions in affected programs: 1121 -> 1071 (-4.46%)
>> >> >> helped: 1
>> >> >> HURT: 0
>> >> >> GAINED: 49
>> >> >> LOST: 5
>> >> >> ---
>> >> >> src/util/register_allocate.c | 22 +++++++++++++++++++++-
>> >> >> 1 file changed, 21 insertions(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/src/util/register_allocate.c
>> > b/src/util/register_allocate.c
>> >> >> index af7a20c..d63d8eb 100644
>> >> >> --- a/src/util/register_allocate.c
>> >> >> +++ b/src/util/register_allocate.c
>> >> >> @@ -168,6 +168,12 @@ struct ra_graph {
>> >> >>
>> >> >> unsigned int *stack;
>> >> >> unsigned int stack_count;
>> >> >> +
>> >> >> + /**
>> >> >> + * Tracks the start of the set of optimistically-colored
>> registers
>> > in
>> >> > the
>> >> >> + * stack.
>> >> >> + */
>> >> >> + unsigned int stack_optimistic_start;
>> >> >> };
>> >> >>
>> >> >> /**
>> >> >> @@ -454,6 +460,7 @@ static void
>> >> >> ra_simplify(struct ra_graph *g)
>> >> >> {
>> >> >> bool progress = true;
>> >> >> + unsigned int stack_optimistic_start = ~0;
>> >> >> int i;
>> >> >>
>> >> >> while (progress) {
>> >> >> @@ -483,12 +490,16 @@ ra_simplify(struct ra_graph *g)
>> >> >>
>> >> >> if (!progress && best_optimistic_node != ~0U) {
>> >> >> decrement_q(g, best_optimistic_node);
>> >> >> + stack_optimistic_start =
>> >> >> + MIN2(stack_optimistic_start, g->stack_count);
>> >> >> g->stack[g->stack_count] = best_optimistic_node;
>> >> >> g->stack_count++;
>> >> >> g->nodes[best_optimistic_node].in_stack = true;
>> >> >> progress = true;
>> >> >> }
>> >> >> }
>> >> >> +
>> >> >> + g->stack_optimistic_start = stack_optimistic_start;
>> >> >> }
>> >> >>
>> >> >> /**
>> >> >> @@ -542,7 +553,16 @@ ra_select(struct ra_graph *g)
>> >> >> g->nodes[n].reg = r;
>> >> >> g->stack_count--;
>> >> >>
>> >> >> - if (g->regs->round_robin)
>> >> >> + /* Rotate the starting point except for optimistically
>> colorable
>> >> > nodes.
>> >> >> + * The likelihood that we will succeed at allocating
>> > optimistically
>> >> >> + * colorable nodes is highly dependent on the way that the
>> > previous
>> >> >> + * nodes popped off the stack are laid out. The round-robin
>> >> > strategy
>> >> >> + * increases the fragmentation of the register file and
>> > decreases
>> >> > the
>> >> >> + * number of nearby nodes assigned to the same color, what
>> >> > increases the
>> >> >> + * likelihood of spilling with respect to the dense packing
>> >> > strategy.
>> >> >> + */
>> >> >> + if (g->regs->round_robin &&
>> >> >> + g->stack_count <= g->stack_optimistic_start)
>> >> >> start_search_reg = r + 1;
>> >> >> }
>> >> >>
>> >> >> --
>> >> >> 2.1.3
>> >> >>
>> >> >> _______________________________________________
>> >> >> mesa-dev mailing list
>> >> >> mesa-dev at lists.freedesktop.org
>> >> >> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150216/11f8a431/attachment.sig>
More information about the mesa-dev
mailing list