<div dir="ltr">On 28 October 2013 15:34, Eric Anholt <<a href="mailto:eric@anholt.net" target="_blank">eric@anholt.net</a>> wrote: <div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> When faced with a million instructions that all became candidates at the same time (none of which individually reduce register pressure), the ones on the critical path are more likely to be the ones that will free up some candidates soon. shader-db: total instructions in shared programs: 1681070 -> 1681070 (0.00%) instructions in affected programs: 0 -> 0 GAINED: 40 LOST: 74 Fixes indistinguishable-from-hanging behavior in GLES3conform's uniform_buffer_object_max_uniform_block_size test, regressed by c3c9a8c85758796a26b48e484286e6b6f5a5299a. Given that 93bd627d5a6c485948b94488e6cd53a06b7ebdcf was unlocked by that commit, the net effect on 16-wide program count is still quite positive, and I think this should give us more stable scheduling (less dependency on original instruction emit order). Bugzilla: <a href="https://bugs.freedesktop.org/show_bug.cgi?id=70943" target="_blank">https://bugs.freedesktop.org/show_bug.cgi?id=70943</a> --- .../drivers/dri/i965/brw_schedule_instructions.cpp | 77 +++++++++++++++++----- 1 file changed, 62 insertions(+), 15 deletions(-) diff --git a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp index 06e3a8b..04c875f 100644 --- a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp +++ b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp @@ -68,6 +68,7 @@ public: this->child_count = 0; this->parent_count = 0; this->unblocked_time = 0; + this->cand_generation = 0; /* We can't measure Gen6 timings directly but expect them to be much * closer to Gen7 than Gen4. @@ -91,6 +92,12 @@ public: int latency; /** + * Which iteration of pushing groups of children onto the candidates list + * this node was a part of. + */ + unsigned cand_generation; + + /** * This is the sum of the instruction's latency plus the maximum delay of * its children, or just the issue_time if it's a leaf node. */ @@ -973,26 +980,63 @@ fs_instruction_scheduler::choose_instruction_to_schedule() * variables so that we can avoid register spilling, or get 16-wide * shaders which naturally do a better job of hiding instruction * latency. - * - * To do so, schedule our instructions in a roughly LIFO/depth-first - * order: when new instructions become available as a result of - * scheduling something, choose those first so that our result - * hopefully is consumed quickly. - * - * The exception is messages that generate more than one result - * register (AKA texturing). In those cases, the LIFO search would - * normally tend to choose them quickly (because scheduling the - * previous message not only unblocked the children using its result, - * but also the MRF setup for the next sampler message, which in turn - * unblocks the next sampler message). */ foreach_list(node, &instructions) { schedule_node *n = (schedule_node *)node; fs_inst *inst = (fs_inst *)n->inst; - chosen = n; - if (v->brw->gen >= 7 || inst->regs_written <= 1) - break; + if (!chosen) { + chosen = n; + continue; + } + + /* Prefer instructions that recently became available for scheduling. + * These are the things that are most likely to (eventually) make a + * variable dead and reduce register pressure. Typical register + * pressure estimates don't work for us because most of our pressure + * comes from texturing, where no single instruction to schedule will + * make a vec4 value dead. + */ + if (n->cand_generation > chosen->cand_generation) { + chosen = n; + continue; + } else if (n->cand_generation < chosen->cand_generation) { + continue; + } + + /* On MRF-using chips, prefer non-SEND instructions. Because we + * prefer instructions that just became candidates, we'll end up in a + * pattern of scheduling a SEND, then the MRFs for the next SEND, + * then the next SEND, then the MRFs, etc., without ever consuming + * the results of a send. </blockquote><div> </div><div>On the first few readings of this comment, I was confused because I didn't realize that the sentence beginning "Because..." was describing a counterfactual. Maybe change this to "If we don't do this, then because we prefer...". </div> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> + */ + if (v->brw->gen < 7) { + fs_inst *chosen_inst = (fs_inst *)chosen->inst; + + if (inst->regs_written <= 1 && chosen_inst->regs_written > 1) { </blockquote><div> </div><div>It would be nice to have a comment here explaining that we're using inst->regs_written > 1 as a proxy for determining whether something is a send instruction--this isn't exactly correct, but it's close enough because the only SENDs that it misses are things like spills and fb_writes, and those aren't involved in the pathological case we're trying to fix. </div><div>With those minor changes, series is: Reviewed-by: Paul Berry <<a href="mailto:stereotype441@gmail.com">stereotype441@gmail.com</a>> </div><div> </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> + chosen = n; + continue; + } else if (inst->regs_written > chosen_inst->regs_written) { + continue; + } + } + + /* For instructions pushed on the cands list at the same time, prefer + * the one with the highest delay to the end of the program. This is + * most likely to have its values able to be consumed first (such as + * for a large tree of lowered ubo loads, which appear reversed in + * the instruction stream with respect to when they can be consumed). + */ + if (n->delay > chosen->delay) { + chosen = n; + continue; + } else if (n->delay < chosen->delay) { + continue; + } + + /* If all other metrics are equal, we prefer the first instruction in + * the list (program execution). + */ } } @@ -1048,6 +1092,7 @@ instruction_scheduler::schedule_instructions(backend_instruction *next_block_hea n->remove(); } + unsigned cand_generation = 1; while (!instructions.is_empty()) { schedule_node *chosen = choose_instruction_to_schedule(); @@ -1090,6 +1135,7 @@ instruction_scheduler::schedule_instructions(backend_instruction *next_block_hea bv->dump_instruction(child->inst); } + child->cand_generation = cand_generation; child->parent_count--; if (child->parent_count == 0) { if (debug) { @@ -1098,6 +1144,7 @@ instruction_scheduler::schedule_instructions(backend_instruction *next_block_hea instructions.push_head(child); } } + cand_generation++; /* Shared resource: the mathbox. There's one mathbox per EU on Gen6+ * but it's more limited pre-gen6, so if we send something off to it then -- 1.8.4.rc3 _______________________________________________ mesa-dev mailing list <a href="mailto:mesa-dev@lists.freedesktop.org">mesa-dev@lists.freedesktop.org</a> <a href="http://lists.freedesktop.org/mailman/listinfo/mesa-dev" target="_blank">http://lists.freedesktop.org/mailman/listinfo/mesa-dev</a> </blockquote></div> </div></div>