<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, May 27, 2016 at 7:05 PM, Francisco Jerez <span dir="ltr"><<a href="mailto:currojerez@riseup.net" target="_blank">currojerez@riseup.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Skipping the temporary allocation and copy instructions is easy (just<br>
return dst), but the conditions used to find out whether the copy can<br>
be optimized out safely without breaking the program are rather<br>
complex: The destination must be exactly one component of at most the<br>
execution width of the lowered instruction, and all source regions of<br>
the instruction must be either fully disjoint from the destination or<br>
be aligned with it group by group.<br>
---<br>
src/mesa/drivers/dri/i965/brw_fs.cpp | 81 ++++++++++++++++++++++++++++++++++++<br>
1 file changed, 81 insertions(+)<br>
<br>
diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp<br>
index 8eff27e..5d26c68 100644<br>
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp<br>
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp<br>
@@ -5054,6 +5054,78 @@ emit_unzip(const fs_builder &lbld, bblock_t *block, fs_inst *inst,<br>
}<br>
<br>
/**<br>
+ * Return true if splitting out the group of channels of instruction \p inst<br>
+ * given by lbld.group() requires allocating a temporary for the destination<br>
+ * of the lowered instruction and copying the data back to the original<br>
+ * destination region.<br>
+ *<br>
+ * Note that the result of this function may be different for different<br>
+ * channel groups in cases where the overlap between a source and destination<br>
+ * region is partial.<br>
+ */<br>
+static inline bool<br>
+needs_dst_copy(const fs_builder &lbld, const fs_inst *inst)<br>
+{<br>
+ /* If the instruction writes more than one component we'll have to shuffle<br>
+ * the results of multiple lowered instructions in order to make sure that<br>
+ * they end up arranged correctly in the original destination region.<br>
+ */<br>
+ if (inst->regs_written * REG_SIZE ><br>
+ inst->dst.component_size(inst->exec_size))<br>
+ return true;<br>
+<br>
+ /* If the lowered execution size is larger than the original the result of<br>
+ * the instruction won't fit in the original destination, so we'll have to<br>
+ * allocate a temporary in any case.<br>
+ */<br>
+ if (lbld.dispatch_width() > inst->exec_size)<br>
+ return true;<br>
+<br>
+ for (unsigned i = 0; i < inst->sources; i++) {<br>
+ /* Index of this lowered instruction and total number of lowered<br>
+ * instructions.<br>
+ */<br>
+ const unsigned j = lbld.group() / lbld.dispatch_width();<br>
+ const unsigned n = DIV_ROUND_UP(inst->exec_size, lbld.dispatch_width());<br>
+<br>
+ /* Specified channels from the source and destination regions. */<br>
+ const fs_reg src = horiz_offset(inst->src[i], lbld.group());<br>
+ const fs_reg dst = horiz_offset(inst->dst, lbld.group());<br>
+<br>
+ /* Lowered component size of the source and destination regions. */<br>
+ const unsigned dsrc = inst->src[i].component_size(lbld.dispatch_width());<br>
+ const unsigned ddst = inst->dst.component_size(lbld.dispatch_width());<br>
+<br>
+ /* If we already made a copy of the source for other reasons there won't<br>
+ * be any overlap with the destination.<br>
+ */<br>
+ if (!is_periodic(inst->src[i], lbld.dispatch_width()) &&<br>
+ inst->components_read(i) != 1)<br></blockquote><div><br></div><div>might be good to have a needs_src_copy function to keep the logic in one place.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+ continue;<br>
+<br>
+ /* We need a copy if the specified group of channels of the destination<br>
+ * overlaps either the previous or subsequent groups of the source. We<br>
+ * don't care whether the destination of a given lowered instruction<br>
+ * overlaps its own source because there's arguably no<br>
+ * source/destination hazard for this opcode if the source and<br>
+ * destination of the unlowered instruction were already overlapping.<br>
+ *<br>
+ * Note that it might be possible to avoid emitting a copy in more cases<br>
+ * by making assumptions about the ordering in which the lowered<br>
+ * instructions will be emitted (ugh!), but this has the advantage that<br>
+ * the lowered instructions will be fully independent from each other<br>
+ * regardless of the form of overlap so they can be scheduled freely.<br>
+ */<br>
+ if (regions_overlap(dst, ddst, inst->src[i], dsrc * j) ||<br>
+ regions_overlap(dst, ddst, horiz_offset(src, lbld.dispatch_width()),<br>
+ dsrc * (n - j - 1)))<br>
+ return true;<br></blockquote><br></div><div class="gmail_quote">Wow, this is annoyingly complicated. It also appears to be correct. Would it be sufficient to simply use the following?<br><br></div><div class="gmail_quote">if (regions_overlap(inst->dst, inst->regs_written * REG_SIZE, inst->src[i], inst->regs_read() * REG_SIZE) &&<br></div><div class="gmail_quote"> !inst->dst.equals(inst->src[i]))<br></div><div class="gmail_quote"> return true;<br><br></div><div class="gmail_quote">It's much simpler and covers the two cases we really care about. If that's not sufficient, what you have above does look correct.<br><br></div><div class="gmail_quote">--Jason<br></div><div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+ }<br>
+<br>
+ return false;<br>
+}<br>
+<br>
+/**<br>
* Insert data from a packed temporary into the channel group given by<br>
* lbld.group() of the destination region of instruction \p inst and return<br>
* the temporary as result. If any copy instructions are required they will<br>
@@ -5073,6 +5145,8 @@ emit_zip(const fs_builder &lbld, bblock_t *block, fs_inst *inst)<br>
const fs_reg dst = horiz_offset(inst->dst, lbld.group());<br>
const unsigned dst_size = inst->regs_written * REG_SIZE /<br>
inst->dst.component_size(inst->exec_size);<br>
+<br>
+ if (needs_dst_copy(lbld, inst)) {<br>
const fs_reg tmp = lbld.vgrf(inst->dst.type, dst_size);<br>
<br>
if (inst->predicate) {<br>
@@ -5090,6 +5164,13 @@ emit_zip(const fs_builder &lbld, bblock_t *block, fs_inst *inst)<br>
.MOV(offset(dst, inst->exec_size, k), offset(tmp, lbld, k));<br>
<br>
return tmp;<br>
+<br>
+ } else {<br>
+ /* No need to allocate a temporary for the lowered instruction, just<br>
+ * take the right group of channels from the original region.<br>
+ */<br>
+ return dst;<br>
+ }<br>
}<br>
<br>
bool<br>
<span class="HOEnZb"><font color="#888888">--<br>
2.7.3<br>
<br>
_______________________________________________<br>
mesa-dev mailing list<br>
<a href="mailto:mesa-dev@lists.freedesktop.org">mesa-dev@lists.freedesktop.org</a><br>
<a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/mesa-dev</a><br>
</font></span></blockquote></div><br></div></div>