<div dir="ltr">On 28 October 2013 15:34, Eric Anholt <span dir="ltr"><<a href="mailto:eric@anholt.net" target="_blank">eric@anholt.net</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
When faced with a million instructions that all became candidates at the<br>
same time (none of which individually reduce register pressure), the ones<br>
on the critical path are more likely to be the ones that will free up some<br>
candidates soon.<br>
<br>
shader-db:<br>
total instructions in shared programs: 1681070 -> 1681070 (0.00%)<br>
instructions in affected programs: 0 -> 0<br>
GAINED: 40<br>
LOST: 74<br>
<br>
Fixes indistinguishable-from-hanging behavior in GLES3conform's<br>
uniform_buffer_object_max_uniform_block_size test, regressed by<br>
c3c9a8c85758796a26b48e484286e6b6f5a5299a. Given that<br>
93bd627d5a6c485948b94488e6cd53a06b7ebdcf was unlocked by that commit, the<br>
net effect on 16-wide program count is still quite positive, and I think<br>
this should give us more stable scheduling (less dependency on original<br>
instruction emit order).<br>
<br>
Bugzilla: <a href="https://bugs.freedesktop.org/show_bug.cgi?id=70943" target="_blank">https://bugs.freedesktop.org/show_bug.cgi?id=70943</a><br>
---<br>
.../drivers/dri/i965/brw_schedule_instructions.cpp | 77 +++++++++++++++++-----<br>
1 file changed, 62 insertions(+), 15 deletions(-)<br>
<br>
diff --git a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp<br>
index 06e3a8b..04c875f 100644<br>
--- a/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp<br>
+++ b/src/mesa/drivers/dri/i965/brw_schedule_instructions.cpp<br>
@@ -68,6 +68,7 @@ public:<br>
this->child_count = 0;<br>
this->parent_count = 0;<br>
this->unblocked_time = 0;<br>
+ this->cand_generation = 0;<br>
<br>
/* We can't measure Gen6 timings directly but expect them to be much<br>
* closer to Gen7 than Gen4.<br>
@@ -91,6 +92,12 @@ public:<br>
int latency;<br>
<br>
/**<br>
+ * Which iteration of pushing groups of children onto the candidates list<br>
+ * this node was a part of.<br>
+ */<br>
+ unsigned cand_generation;<br>
+<br>
+ /**<br>
* This is the sum of the instruction's latency plus the maximum delay of<br>
* its children, or just the issue_time if it's a leaf node.<br>
*/<br>
@@ -973,26 +980,63 @@ fs_instruction_scheduler::choose_instruction_to_schedule()<br>
* variables so that we can avoid register spilling, or get 16-wide<br>
* shaders which naturally do a better job of hiding instruction<br>
* latency.<br>
- *<br>
- * To do so, schedule our instructions in a roughly LIFO/depth-first<br>
- * order: when new instructions become available as a result of<br>
- * scheduling something, choose those first so that our result<br>
- * hopefully is consumed quickly.<br>
- *<br>
- * The exception is messages that generate more than one result<br>
- * register (AKA texturing). In those cases, the LIFO search would<br>
- * normally tend to choose them quickly (because scheduling the<br>
- * previous message not only unblocked the children using its result,<br>
- * but also the MRF setup for the next sampler message, which in turn<br>
- * unblocks the next sampler message).<br>
*/<br>
foreach_list(node, &instructions) {<br>
schedule_node *n = (schedule_node *)node;<br>
fs_inst *inst = (fs_inst *)n->inst;<br>
<br>
- chosen = n;<br>
- if (v->brw->gen >= 7 || inst->regs_written <= 1)<br>
- break;<br>
+ if (!chosen) {<br>
+ chosen = n;<br>
+ continue;<br>
+ }<br>
+<br>
+ /* Prefer instructions that recently became available for scheduling.<br>
+ * These are the things that are most likely to (eventually) make a<br>
+ * variable dead and reduce register pressure. Typical register<br>
+ * pressure estimates don't work for us because most of our pressure<br>
+ * comes from texturing, where no single instruction to schedule will<br>
+ * make a vec4 value dead.<br>
+ */<br>
+ if (n->cand_generation > chosen->cand_generation) {<br>
+ chosen = n;<br>
+ continue;<br>
+ } else if (n->cand_generation < chosen->cand_generation) {<br>
+ continue;<br>
+ }<br>
+<br>
+ /* On MRF-using chips, prefer non-SEND instructions. Because we<br>
+ * prefer instructions that just became candidates, we'll end up in a<br>
+ * pattern of scheduling a SEND, then the MRFs for the next SEND,<br>
+ * then the next SEND, then the MRFs, etc., without ever consuming<br>
+ * the results of a send.<br></blockquote><div><br></div><div>On the first few readings of this comment, I was confused because I didn't realize that the sentence beginning "Because..." was describing a counterfactual. Maybe change this to "If we don't do this, then because we prefer...".<br>
</div> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+ */<br>
+ if (v->brw->gen < 7) {<br>
+ fs_inst *chosen_inst = (fs_inst *)chosen->inst;<br>
+<br>
+ if (inst->regs_written <= 1 && chosen_inst->regs_written > 1) {<br></blockquote><div><br></div><div>It would be nice to have a comment here explaining that we're using inst->regs_written > 1 as a proxy for determining whether something is a send instruction--this isn't exactly correct, but it's close enough because the only SENDs that it misses are things like spills and fb_writes, and those aren't involved in the pathological case we're trying to fix.<br>
<br></div><div>With those minor changes, series is:<br><br>Reviewed-by: Paul Berry <<a href="mailto:stereotype441@gmail.com">stereotype441@gmail.com</a>><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+ chosen = n;<br>
+ continue;<br>
+ } else if (inst->regs_written > chosen_inst->regs_written) {<br>
+ continue;<br>
+ }<br>
+ }<br>
+<br>
+ /* For instructions pushed on the cands list at the same time, prefer<br>
+ * the one with the highest delay to the end of the program. This is<br>
+ * most likely to have its values able to be consumed first (such as<br>
+ * for a large tree of lowered ubo loads, which appear reversed in<br>
+ * the instruction stream with respect to when they can be consumed).<br>
+ */<br>
+ if (n->delay > chosen->delay) {<br>
+ chosen = n;<br>
+ continue;<br>
+ } else if (n->delay < chosen->delay) {<br>
+ continue;<br>
+ }<br>
+<br>
+ /* If all other metrics are equal, we prefer the first instruction in<br>
+ * the list (program execution).<br>
+ */<br>
}<br>
}<br>
<br>
@@ -1048,6 +1092,7 @@ instruction_scheduler::schedule_instructions(backend_instruction *next_block_hea<br>
n->remove();<br>
}<br>
<br>
+ unsigned cand_generation = 1;<br>
while (!instructions.is_empty()) {<br>
schedule_node *chosen = choose_instruction_to_schedule();<br>
<br>
@@ -1090,6 +1135,7 @@ instruction_scheduler::schedule_instructions(backend_instruction *next_block_hea<br>
bv->dump_instruction(child->inst);<br>
}<br>
<br>
+ child->cand_generation = cand_generation;<br>
child->parent_count--;<br>
if (child->parent_count == 0) {<br>
if (debug) {<br>
@@ -1098,6 +1144,7 @@ instruction_scheduler::schedule_instructions(backend_instruction *next_block_hea<br>
instructions.push_head(child);<br>
}<br>
}<br>
+ cand_generation++;<br>
<br>
/* Shared resource: the mathbox. There's one mathbox per EU on Gen6+<br>
* but it's more limited pre-gen6, so if we send something off to it then<br>
<span class="HOEnZb"><font color="#888888">--<br>
1.8.4.rc3<br>
<br>
_______________________________________________<br>
mesa-dev mailing list<br>
<a href="mailto:mesa-dev@lists.freedesktop.org">mesa-dev@lists.freedesktop.org</a><br>
<a href="http://lists.freedesktop.org/mailman/listinfo/mesa-dev" target="_blank">http://lists.freedesktop.org/mailman/listinfo/mesa-dev</a><br>
</font></span></blockquote></div><br></div></div>