<div dir="ltr"> <div class="gmail_extra"> <div class="gmail_quote">On Mon, Feb 16, 2015 at 12:02 PM, Francisco Jerez <<a href="mailto:currojerez@riseup.net" target="_blank">currojerez@riseup.net</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">Jason Ekstrand <<a href="mailto:jason@jlekstrand.net">jason@jlekstrand.net</a>> writes: > On Mon, Feb 16, 2015 at 10:40 AM, Francisco Jerez <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> > wrote: > >> Jason Ekstrand <<a href="mailto:jason@jlekstrand.net">jason@jlekstrand.net</a>> writes: >> >> > On Feb 16, 2015 9:34 AM, "Francisco Jerez" <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> >> wrote: >> >> >> >> Jason Ekstrand <<a href="mailto:jason@jlekstrand.net">jason@jlekstrand.net</a>> writes: >> >> >> >> > On Feb 16, 2015 8:35 AM, "Francisco Jerez" <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> >> > wrote: >> >> >> >> >> >> The round-robin allocation strategy is expected to decrease the >> amount >> >> >> of false dependencies created by the register allocator and give the >> >> >> post-RA scheduling pass more freedom to move instructions around. On >> >> >> the other hand it has the disadvantage of increasing fragmentation >> and >> >> >> decreasing the number of equally-colored nearby nodes, what increases >> >> >> the likelihood of failure in presence of optimistically colorable >> >> >> nodes. >> >> >> >> >> >> This patch disables the round-robin strategy for optimistically >> >> >> colorable nodes. These typically arise in situations of high >> register >> >> >> pressure or for registers with large live intervals, in both cases >> the >> >> >> task of the instruction scheduler shouldn't be constrained >> excessively >> >> >> by the dense packing of those nodes, and a spill (or on Intel >> hardware >> >> >> a fall-back to SIMD8 mode) is invariably worse than a slightly less >> >> >> optimal scheduling. >> >> > >> >> > Actually, that's not true. Matt was doing some experiments recently >> > with a >> >> > noise shader from synmark and the difference between our 2nd and 3rd >> > choice >> >> > schedulers is huge. In that test he disabled the third choice >> scheduler >> >> > and the result was a shader that spilled 6 or 8 times but ran >> something >> >> > like 30% faster. We really need to do some more experimentation with >> >> > scheduling and figure out better heuristics than "SIMD16 is always >> > faster" >> >> > and "spilling is bad". >> >> > >> >> >> >> Yes, I'm aware of rare corner cases like that where e.g. SIMD16 leads to >> >> higher cache thrashing than SIMD8 leading to decreased overall >> >> performance, and a case where a shader SIMD16 *with* spills has better >> >> performance than the SIMD8 version of the same shader without spills. >> >> >> >> In any case it's not the register allocator's business to implement such >> >> heuristics, and that's not an argument against the register allocator >> >> trying to make a more efficient use of the register file. >> > >> > The primary point I was trying to make is that scheduling *does* matter. >> > It matters a lot. In fact, Matt and i have talked about throwing away >> the >> > SIMD16 program if it ends up using the pessimal schedulong algorithm. >> > Throwing scheduling to the wind just to gain a few SIMD16 programs is >> > probably not a good trade-off. >> > >> In my experience the exact opposite observation has been far more >> common. Running SIMD16 vs SIMD8 has a larger impact on performance than >> the way you end up scheduling things post-regalloc. Actually even if >> you end up causing some unmet instruction dependencies by the way >> instructions are scheduled post-regalloc, the EU can context-switch to >> service the next available thread almost for free when a thread stalls >> on some dependency. Also the fact that you're doing SIMD16 itself makes >> post-regalloc scheduling less important because it naturally has an >> effect in hiding latency. >> >> My intuition is that the huge performance improvement Matt observed by >> disabling the third scheduling heuristic is more likely to have been >> caused by a decrease in the amount of cache thrashing caused by the fact >> that he was running less channels concurrently rather than by the >> scheduling heuristic itself. Matt, did you rule out that possibility? >> >> The other thing is this patch has an effect on the allocation strategy >> for optimistically colorable nodes *only*. We're already heavily >> constrained by register pressure when we get to that point, and assuming >> allocation succeeds the post-regalloc scheduler is going to have little >> room for maneuvering anyway. >> >> > It could be that this is an good idea, but it's going to take more than >> > hand-waved theories about register allocation one shader not spilling to >> > convince me. Do you actually know what it did to scheduling? It >> wouldn't >> > be hard to hack up the driver and shader-db to collect that information. >> > >> 44 shaders going SIMD16 seems like a strong enough argument to me. >> Could you be more precise about what additional information you want me >> to collect? >> > > How many shaders go from the first scheduling method to the second or to > the third. In other words some sort of metric on which shaders are > "helped" or "hurt" in their scheduling. </div></div>OK, I hacked the driver to output the scheduling heuristic that had been used when we allocated registers successfully for the program via KHR_debug and then ran shader-db before and after applying this patch. Before this patch: Heuristic SIMD8 SIMD16 PRE 18924 18598 PRE_NON_LIFO 72 38 PRE_LIFO 8 3 After this patch: Heuristic SIMD8 SIMD16 PRE 18939 18643 PRE_NON_LIFO 57 37 PRE_LIFO 8 3 So it actually *decreases* the number of shaders falling back to the latency-insensitive heuristics because the register allocator is more likely to succeed with the PRE heuristic. <div class="HOEnZb"><div class="h5"></div></div></blockquote><div> </div><div>Yeah, I guess it only affects post-allocation scheduling and accidental dependencies (caused by two instructions happening to use the same register). Sorry for not thinking about that earlier. In that case, I have no problem with the patch but I can't say I've reviewed it either. I'll try and take a look at it tomorrow and do a real review. </div><div>--Jason </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"> > > >> > --Jason >> > >> >> >> Shader-db results on the i965 driver: >> >> >> >> >> >> total instructions in shared programs: 5488539 -> 5488489 (-0.00%) >> >> >> instructions in affected programs: 1121 -> 1071 (-4.46%) >> >> >> helped: 1 >> >> >> HURT: 0 >> >> >> GAINED: 49 >> >> >> LOST: 5 >> >> >> --- >> >> >> src/util/register_allocate.c | 22 +++++++++++++++++++++- >> >> >> 1 file changed, 21 insertions(+), 1 deletion(-) >> >> >> >> >> >> diff --git a/src/util/register_allocate.c >> > b/src/util/register_allocate.c >> >> >> index af7a20c..d63d8eb 100644 >> >> >> --- a/src/util/register_allocate.c >> >> >> +++ b/src/util/register_allocate.c >> >> >> @@ -168,6 +168,12 @@ struct ra_graph { >> >> >> >> >> >> unsigned int *stack; >> >> >> unsigned int stack_count; >> >> >> + >> >> >> + /** >> >> >> + * Tracks the start of the set of optimistically-colored >> registers >> > in >> >> > the >> >> >> + * stack. >> >> >> + */ >> >> >> + unsigned int stack_optimistic_start; >> >> >> }; >> >> >> >> >> >> /** >> >> >> @@ -454,6 +460,7 @@ static void >> >> >> ra_simplify(struct ra_graph *g) >> >> >> { >> >> >> bool progress = true; >> >> >> + unsigned int stack_optimistic_start = ~0; >> >> >> int i; >> >> >> >> >> >> while (progress) { >> >> >> @@ -483,12 +490,16 @@ ra_simplify(struct ra_graph *g) >> >> >> >> >> >> if (!progress && best_optimistic_node != ~0U) { >> >> >> decrement_q(g, best_optimistic_node); >> >> >> + stack_optimistic_start = >> >> >> + MIN2(stack_optimistic_start, g->stack_count); >> >> >> g->stack[g->stack_count] = best_optimistic_node; >> >> >> g->stack_count++; >> >> >> g->nodes[best_optimistic_node].in_stack = true; >> >> >> progress = true; >> >> >> } >> >> >> } >> >> >> + >> >> >> + g->stack_optimistic_start = stack_optimistic_start; >> >> >> } >> >> >> >> >> >> /** >> >> >> @@ -542,7 +553,16 @@ ra_select(struct ra_graph *g) >> >> >> g->nodes[n].reg = r; >> >> >> g->stack_count--; >> >> >> >> >> >> - if (g->regs->round_robin) >> >> >> + /* Rotate the starting point except for optimistically >> colorable >> >> > nodes. >> >> >> + * The likelihood that we will succeed at allocating >> > optimistically >> >> >> + * colorable nodes is highly dependent on the way that the >> > previous >> >> >> + * nodes popped off the stack are laid out. The round-robin >> >> > strategy >> >> >> + * increases the fragmentation of the register file and >> > decreases >> >> > the >> >> >> + * number of nearby nodes assigned to the same color, what >> >> > increases the >> >> >> + * likelihood of spilling with respect to the dense packing >> >> > strategy. >> >> >> + */ >> >> >> + if (g->regs->round_robin && >> >> >> + g->stack_count <= g->stack_optimistic_start) >> >> >> start_search_reg = r + 1; >> >> >> } >> >> >> >> >> >> -- >> >> >> 2.1.3 >> >> >> >> >> >> _______________________________________________ >> >> >> mesa-dev mailing list >> >> >> <a href="mailto:mesa-dev@lists.freedesktop.org">mesa-dev@lists.freedesktop.org</a> >> >> >> <a href="http://lists.freedesktop.org/mailman/listinfo/mesa-dev" target="_blank">http://lists.freedesktop.org/mailman/listinfo/mesa-dev</a> >> </div></div></blockquote></div> </div></div>