<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Sat, Aug 25, 2018 at 9:40 AM Jason Ekstrand <<a href="mailto:jason@jlekstrand.net">jason@jlekstrand.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Sorry I haven't given you any in-line review yet. I've been letting this pass simmer in my brain a bit and thinking about complexity and data structures. Here's a few questions for which I do not expect you to have actual answers:</div><div><br></div><div> 1) How many unique derefs should we expect to see in a shader? Dozens? Hundreds? Thousands? My guess is that, after vars_to_ssa has picked off all the easy ones, it should be on the order of a couple hundred at most but I don't actually know what a reasonable upper limit to assume is.</div><div><br></div><div> 2) Is every deref going to be eventually compared to every other deref or are we going to have separate little groups of derefs?</div><div><br></div><div> I'm asking these questions because I have some concern about how expensive this pass (and copy prop) is going to be. If you have a shader with a huge number of loops each of which has a
very self-contained set of derefs, the current pass as-is probably has an efficiency that's around O(n) in the number of instructions. However, in the case where most derefs are
in play all the time, then I supsect this pass as-is looks more like
O(n^3) in the number of derefs or worse.</div><div><br></div><div>If the total number of unique derefs is reasonable AND the graph of deref comparisons isn't too sparse, then I wonder if it wouldn't be better to have some intermediate data structure like this:</div><div><br></div><div><div>struct deref_compare_map {</div><div> void *mem_ctx;<br></div><div><br></div><div> struct hash_table *id_map;</div><div><br></div><div> nir_deref_path *paths;<br></div><div> BITSET_WORD **alias;</div><div> BITSET_WORD **contains;</div><div> BITSET_WORD **contained_in;<br></div><div>};</div><div><br></div><div>void deref_compare_map_init(struct deref_compare_map *map, void *mem_ctx)</div><div>{</div><div> memset(map, 0, sizeof(*map));</div><div> map->mem_ctx = ralloc_context(mem_ctx);</div><div><br></div><div> /* I'm pretty sure we have deref hashing and equality functions somewhere already, we just have to pull them out and use them. It is, however, possible that they were in nir_instr_set.c and got deleted as part of the move to deref instructions. */<br></div><div> map->id_map = _mesa_hash_table_create(map->mem_ctx, deref_hash, deref_equal);<br></div><div>}</div><div><br></div><div><div>void deref_compare_map_finish(struct deref_compare_map *map)</div><div>{</div><div> ralloc_free(map->mem_ctx);</div><div>}</div><div><br></div><div>uint32_t deref_compare_map_get_id(struct deref_compare_map *map, nir_deref_instr *deref)</div><div>{</div><div> struct hash_entry *entry = _mesa_hash_set_lookup(map->info_map, deref);</div><div> if (entry)</div><div> return (uint32_t)(uintptr_t)entry->data;</div><div><br></div><div> /* We can't be adding anything after we've baked it */<br></div><div> assert(map->alias == NULL);<br></div></div><div><br></div><div> uint32_t id = map->id_map->entries;<br></div><div><div> _mesa_hash_set_add(map->id_map, deref, id);</div><div> assert(map->id_map->entries == id + 1);<br></div><div>}</div><div><br></div><div>static BITSET_WORDS **</div><div>alloc_set_of_sets(void *mem_ctx, unsigned entries)</div><div>{</div><div> BITSET_WORDS **sets = ralloc_array(mem_ctx, BITSET_WORD *, entries);</div><div> BITSET_WORDS *data = rzalloc_array(mem_ctx, BITSET_WORD, entries * BITSET_WORDS(entries));</div><div> for (unsigned i = 0; i < entries; i++)<br></div><div> sets[i] = data + i * BITSET_WORDS(entries);</div><div> return sets;<br></div><div>}<br></div><div><br></div><div>void deref_compare_map_bake(struct deref_compare_map *map)</div><div>{</div><div> const unsigned num_derefs = map->id_map->entries;</div><div><br></div><div> map->paths = ralloc_array(map->mem_ctx, nir_deref_path, num_derefs);</div><div> struct hash_entry *entry;</div><div> hash_table_foreach(map->id_map, entry) {</div><div> uint32_t id = (uintptr_t)entry->data;<br></div><div> nir_deref_path_init(&map->paths[id], (void *)entry->key, map->mem_ctx);<br></div><div> }<br></div><div><br></div><div> map->alias = alloc_set_of_sets(map->mem_ctx, map->id_map->entries);</div><div> map->contains = alloc_set_of_sets(map->mem_ctx, map->id_map->entries);</div><div> map->contained_in = alloc_set_of_sets(map->mem_ctx, map->id_map->entries);</div><div><br></div><div> for (unsigned a = 0; a < num_derefs; a++) {</div><div> /* Handle comparing a with itself */</div><div> BITSET_SET(map->alias[a], a);<br> BITSET_SET(map->contains[a], a);<br> BITSET_SET(map->contained_in[a], a);</div><div><br></div><div> for (unsigned b = 0; b < a; b++) {</div><div> enum nir_deref_compare_result comp =</div><div> nir_compare_deref_paths(&map->paths[a], &map->paths[b]);</div><div> if (comp & derefs_may_alias_bit) {</div><div> BITSET_SET(map->alias[a], b);</div><div> BITSET_SET(map->alias[b], a);</div><div> }</div><div><br></div><div> if (comp & derefs_a_contains_b_bit) {</div><div><div> BITSET_SET(map->contains[a], b);</div><div><div><div> BITSET_SET(map->contained_in[b], a);</div></div></div></div><div> }</div><div><br></div><div><div> if (comp & derefs_b_contains_a_bit) {</div><div><div> BITSET_SET(map->contains[b], a);</div><div><div><div> BITSET_SET(map->contained_in[a], b);</div></div></div></div><div> }</div></div><div> }<br></div><div> }<br></div><div>}</div><div><br></div><div>The idea would be to make it a 5-step pass:</div><div><br></div><div> 1) find all derefs</div><div> 2) Bake the deref compare map</div><div> 3) Local elimination and building kill etc. sets.</div><div> 4) Data-flow</div><div> 5) global elimination</div><div><br></div><div>Doing that would give us some better run-time guarantees:</div><div><br></div><div> 1) Set operations on bitsets are O(n) (where n is actually DIV_ROUND_UP(n, 32) thanks to bitsets)<br></div><div> 2) We call nir_deref_path_init exactly num_derefs times for the entire pass<br></div><div> 3) We call nir_compare_deref_paths exactly (num_derefs * (num_derefs - 1)) / 2 times for the entire pass.</div><div><br></div><div>As it is, deref_map_add_map is O(n^2) comparisons and we do that operation inside our fixed-point loop which has an unknown number of iterations. This data structure could easily be put in nir_derefs.c and shared between this pass and copy-prop.<br></div><div><br></div><div>Just a thought. Of course, you've spent far more time playing with this than I have so I may be completely off-base. Before you go off and re-write it, let's have a little e-mail discussion to decide if we think the rewrite is worth it. :-)</div></div></div></div></blockquote><div><br></div><div>After giving it more thought, I realized I hadn't done anything with components like you did. While local variables don't currently support write-masks, that should probably change in the not-so-distant future so we should handle them. One idea would be to replace our many arrays with one array of structs such as</div><div><br></div><div>struct deref_comp_info {</div><div> /** Offset into other info's bitsets */<br></div><div> uint32_t offset;</div><div><br></div><div> /** Number of components for vector derefs, zero for others</div><div> *</div><div> * If num_components == 0, the deref is represented by the single bit at offset in the bitsets.</div><div> * If num_components > 0, the deref is represented by num_components bits in the b, then</div><div> * the deref is represented by num_components bits starting at offset.</div><div> */</div><div> unsigned num_components;</div><div><br></div><div> nir_deref_path path;</div><div><br><div> BITSET_WORD *alias;</div><div> BITSET_WORD *contains;</div><div> BITSET_WORD *contained_in;</div></div><div>};</div><div><br></div><div>Also, if we're going to share it between passes, it might be good to have nir_deref_compare_map_init take a shader and do the walk to find all derefs and the bake in a single step and we can avoid exposing quite as many details to the user.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div>--Jason<br></div></div><div><div><br><div class="gmail_quote"><div dir="ltr">On Wed, Aug 15, 2018 at 4:57 PM Caio Marcelo de Oliveira Filho <<a href="mailto:caio.oliveira@intel.com" target="_blank">caio.oliveira@intel.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The pass will remove not only writes that are proven to be dead in<br>
block, but also writes that can only be considered dead by looking at<br>
multiple blocks. Such global analysis is necessary to remove dead<br>
writes such as the one marked below:<br>
<br>
int total = gl_VertexIndex * 10;<br>
float r = gl_VertexIndex;<br>
<br>
for (int i = 0; i < total; i++) {<br>
arr[i] = i; /* DEAD WRITE. */<br>
if ((i % 2) == 0) {<br>
arr[i] = i * 3.0;<br>
} else {<br>
arr[i] = i * 5.0;<br>
}<br>
r = arr[i];<br>
}<br>
<br>
out_color = vec4(r,r,r,r);<br>
<br>
Without knowing that both sucessor blocks will overwrite the value,<br>
it was not possible to remove the the "arr[i] = i" write.<br>
<br>
The local pass was incremented with some extra data collected to allow<br>
a iterative (backward) data-flow analysis to run. The analysis is<br>
used to find out, for each block, what derefs are used after the<br>
block. This information is then used to decide what unused writes in<br>
the end of the block can be removed. In a local-only analysis we<br>
couldn't remove any, because they could be used later.<br>
---<br>
src/compiler/nir/nir_opt_dead_write_vars.c | 546 ++++++++++++++++++++-<br>
1 file changed, 532 insertions(+), 14 deletions(-)<br>
<br>
diff --git a/src/compiler/nir/nir_opt_dead_write_vars.c b/src/compiler/nir/nir_opt_dead_write_vars.c<br>
index 822bfa5595d..fe72ab784a8 100644<br>
--- a/src/compiler/nir/nir_opt_dead_write_vars.c<br>
+++ b/src/compiler/nir/nir_opt_dead_write_vars.c<br>
@@ -27,6 +27,85 @@<br>
<br>
#include "util/u_dynarray.h"<br>
<br>
+/**<br>
+ * Elimination of dead writes based on derefs.<br>
+ *<br>
+ * Dead writes are stores and copies that write to a deref, which then gets<br>
+ * another write before it was used (read or sourced for a copy). Those<br>
+ * writes can be removed since they don't affect anything.<br>
+ *<br>
+ * For derefs that refer to a memory area that can be read after the program,<br>
+ * the last write is considered used. The presence of certain instructions<br>
+ * may also cause writes to be considered used, e.g. memory barrier (in this case<br>
+ * the value must be written as other thread might use it).<br>
+ *<br>
+ * The write mask for store instructions is considered, so it is possible that<br>
+ * a store is removed because of the combination of other stores overwritten<br>
+ * its value.<br>
+ *<br>
+ * The pass works in three steps:<br>
+ *<br>
+ * (1) For each block, perform a local dead write removal. Removing only<br>
+ * writes we are sure are dead, so keeping around the last unused writes<br>
+ * at the end of the block. These last unused will be revisited in (3).<br>
+ *<br>
+ * (2) Perform a data-flow analysis to identify for each block the "uses<br>
+ * before write" of successor blocks. E.g. given the blocks below (block0<br>
+ * precedes block1, that precedes block2)<br>
+ *<br>
+ * block0: block1: block2:<br>
+ * write A load B load D<br>
+ * write B write C load E<br>
+ * write C load C<br>
+ * write D write D<br>
+ * write E<br>
+ * load A<br>
+ *<br>
+ * the goal is know that B and E are "used before write" after block0; D<br>
+ * and E are "used before write" after block1. Note that usage of E<br>
+ * doesn't come from an immediate succesor of block0. The "used before<br>
+ * written" information will decided later what unused writes are removed.<br>
+ *<br>
+ * The data-flow analysis is implemented by starting with per-block sets<br>
+ * and iterate fixing the content in those sets. The nature of the<br>
+ * "fixing" ensures that the sets get stable (i.e. no infinite loops).<br>
+ * Each block gets two sets representing its local information:<br>
+ *<br>
+ * - GEN: the derefs used before write by the block. E.g. GEN(block1) = {B}<br>
+ * - KILL: the derefs overwritten by the block. E.g. KILL(block1) = {C, D}<br>
+ *<br>
+ * And two other sets representing its current state:<br>
+ *<br>
+ * - IN: the derefs used before write for this block and all successors.<br>
+ * E.g. at the end of the analysis, IN(block1) = {B, E}. Note how E is<br>
+ * present but not D.<br>
+ * - OUT: the derefs used before write for the successors of the block.<br>
+ *<br>
+ * In this pass the iteration happens "backwards", so a block uses the<br>
+ * succesor information to improve its own information. The IN and OUT<br>
+ * sets are updated as follows:<br>
+ *<br>
+ * OUT = union IN of all successors<br>
+ * IN = GEN union (OUT minus KILL)<br>
+ *<br>
+ * 3) Remove the last unused writes for each block that are not used by<br>
+ * successor blocks. After the analysis in (2), this information is<br>
+ * available in the OUT set of each block.<br>
+ *<br>
+ * One extra factor in play is that not trivial to decide whether two derefs<br>
+ * are referring to the same storage or not. Currently array indirects are<br>
+ * assumed to alias with each other, i.e. "write A[i]" is considered used by a<br>
+ * later "load A[j]". This logic is mostly handled as part of<br>
+ * nir_compare_deref_paths() function. This prevents us from representing the<br>
+ * data-flow sets as bitsets.<br>
+ *<br>
+ * Data-flow analysis code here is inspired by Muchnick's "Advanced Compiler<br>
+ * Design and Implementation" and by the data-flow analysis implementation in<br>
+ * src/intel/compiler/brw_fs_copy_propagation.cpp.<br>
+ */<br>
+<br>
+static const bool debug = false;<br>
+<br>
struct state {<br>
void *mem_ctx;<br>
<br>
@@ -34,6 +113,24 @@ struct state {<br>
* rebuilding the paths for the same deref. */<br>
struct hash_table *paths;<br>
void *path_lin_ctx;<br>
+<br>
+ struct block_data *block_data;<br>
+};<br>
+<br>
+struct block_data {<br>
+ struct deref_map *in;<br>
+ struct deref_map *out;<br>
+ struct deref_map *gen;<br>
+ struct deref_map *kill;<br>
+<br>
+ struct deref_map *kill_before_memory_barrier;<br>
+ struct deref_map *kill_before_emit_vertex;<br>
+<br>
+ /* All the writes not used by the end of the block. Used after the<br>
+ * data-flow analysis, to identify globally unused writes, that can be<br>
+ * removed.<br>
+ */<br>
+ struct util_dynarray unused_writes;<br>
};<br>
<br>
static nir_deref_path *<br>
@@ -50,6 +147,23 @@ get_path(struct state *state, nir_deref_instr *deref)<br>
}<br>
}<br>
<br>
+static void<br>
+print_instr(const char *prefix, const nir_instr *instr, const char *suffix)<br>
+{<br>
+ if (prefix) fputs(prefix, stdout);<br>
+ nir_print_instr(instr, stdout);<br>
+ if (suffix) fputs(suffix, stdout);<br>
+}<br>
+<br>
+static void<br>
+print_path(const char *prefix, nir_deref_path *path, const char *suffix)<br>
+{<br>
+ int i;<br>
+ for (i = 0; path->path[i + 1]; i++);<br>
+ nir_deref_instr *deref = path->path[i];<br>
+ print_instr(prefix, &deref->instr, suffix);<br>
+}<br>
+<br>
/* Entry for unused_writes arrays. */<br>
struct write_entry {<br>
/* If NULL indicates the entry is free to be reused. */<br>
@@ -58,6 +172,161 @@ struct write_entry {<br>
nir_deref_path *path;<br>
};<br>
<br>
+struct deref_map_entry {<br>
+ nir_deref_path *path;<br>
+ uintptr_t mask;<br>
+};<br>
+<br>
+/* Maps derefs to a (write) mask value. Used by the various data-flow<br>
+ * analysis sets.<br>
+ *<br>
+ * TODO: Consider a tree-based data structure to represent this instead of<br>
+ * arrays of paths.<br>
+ */<br>
+struct deref_map {<br>
+ struct util_dynarray entries;<br>
+ void *mem_ctx;<br>
+};<br>
+<br>
+static struct deref_map *<br>
+deref_map_create(void *mem_ctx)<br>
+{<br>
+ struct deref_map *map = rzalloc(mem_ctx, struct deref_map);<br>
+ map->mem_ctx = mem_ctx;<br>
+ util_dynarray_init(&map->entries, mem_ctx);<br>
+ return map;<br>
+}<br>
+<br>
+static bool<br>
+deref_map_empty(struct deref_map *map)<br>
+{<br>
+ return map->entries.size == 0;<br>
+}<br>
+<br>
+/* Add a deref (represented by its path) to a deref_map, with a given mask.<br>
+ *<br>
+ * Returns whether the maps was changed or not.<br>
+ */<br>
+static bool<br>
+deref_map_add(struct deref_map *map, nir_deref_path *path, uintptr_t mask)<br>
+{<br>
+ util_dynarray_foreach (&map->entries, struct deref_map_entry, map_entry) {<br>
+ nir_deref_compare_result comp = nir_compare_deref_paths(path, map_entry->path);<br>
+ if (comp & nir_derefs_equal_bit) {<br>
+ /* The deref is already present, no new entry needed. */<br>
+ uintptr_t old_mask = map_entry->mask;<br>
+ map_entry->mask |= mask;<br>
+ return old_mask != map_entry->mask;<br>
+ }<br>
+ }<br>
+<br>
+ struct deref_map_entry map_entry = {<br>
+ .path = path,<br>
+ .mask = mask,<br>
+ };<br>
+ util_dynarray_append(&map->entries, struct deref_map_entry, map_entry);<br>
+<br>
+ return true;<br>
+}<br>
+<br>
+static bool<br>
+deref_map_contains(struct deref_map *map, nir_deref_path *path, uintptr_t mask)<br>
+{<br>
+ util_dynarray_foreach (&map->entries, struct deref_map_entry, map_entry) {<br>
+ nir_deref_compare_result comp = nir_compare_deref_paths(path, map_entry->path);<br>
+ /* All the bits in the mask must be present in the map. */<br>
+ if ((comp & nir_derefs_b_contains_a_bit) && ((mask & map_entry->mask) == mask))<br>
+ return true;<br>
+ }<br>
+ return false;<br>
+}<br>
+<br>
+static bool<br>
+deref_map_may_alias(struct deref_map *map, nir_deref_path *path, uintptr_t mask)<br>
+{<br>
+ util_dynarray_foreach (&map->entries, struct deref_map_entry, map_entry) {<br>
+ nir_deref_compare_result comp = nir_compare_deref_paths(path, map_entry->path);<br>
+ /* A single overlapping bit is enough to characterize an alias. */<br>
+ if ((comp & nir_derefs_may_alias_bit) && (mask & map_entry->mask))<br>
+ return true;<br>
+ }<br>
+ return false;<br>
+}<br>
+<br>
+#define deref_map_foreach(map, elem) \<br>
+ util_dynarray_foreach(&map->entries, struct deref_map_entry, elem)<br>
+<br>
+static bool<br>
+deref_map_add_map(struct deref_map *map, struct deref_map *to_add)<br>
+{<br>
+ bool changed = false;<br>
+ deref_map_foreach(to_add, map_entry)<br>
+ changed |= deref_map_add(map, map_entry->path, map_entry->mask);<br>
+ return changed;<br>
+}<br>
+<br>
+<br>
+static struct deref_map *<br>
+deref_map_clone(struct deref_map *from, void *mem_ctx)<br>
+{<br>
+ struct deref_map *map = deref_map_create(mem_ctx);<br>
+ util_dynarray_clone(&map->entries, mem_ctx, &from->entries);<br>
+ return map;<br>
+}<br>
+<br>
+/* The zero mask to indicate entry should be removed. */<br>
+static void<br>
+deref_map_remove_zero_masks(struct deref_map *map)<br>
+{<br>
+ deref_map_foreach(map, map_entry) {<br>
+ if (map_entry->mask)<br>
+ continue;<br>
+<br>
+ /* Replace the deleted entry with the last element. */<br>
+ struct deref_map_entry *last =<br>
+ util_dynarray_pop_ptr(&map->entries, struct deref_map_entry);<br>
+ if (map_entry != last)<br>
+ *map_entry = *last;<br>
+ }<br>
+}<br>
+<br>
+/* Removes from map all derefs that are contained by any deref in to_sub. */<br>
+static bool<br>
+deref_map_sub_map(struct deref_map *map, struct deref_map *to_sub)<br>
+{<br>
+ bool changed = false;<br>
+ deref_map_foreach(to_sub, sub_entry) {<br>
+ nir_deref_path *path = sub_entry->path;<br>
+ uintptr_t mask = sub_entry->mask;<br>
+<br>
+ deref_map_foreach(map, map_entry) {<br>
+ nir_deref_compare_result comp = nir_compare_deref_paths(path, map_entry->path);<br>
+ if (comp & nir_derefs_a_contains_b_bit) {<br>
+ uintptr_t new_mask = map_entry->mask & ~mask;<br>
+ if (new_mask != map_entry->mask)<br>
+ changed = true;<br>
+ map_entry->mask = new_mask;<br>
+ }<br>
+ }<br>
+ }<br>
+<br>
+ deref_map_remove_zero_masks(map);<br>
+<br>
+ return changed;<br>
+}<br>
+<br>
+/* Removes from map all derefs that the variable matches the given modes. */<br>
+static void<br>
+deref_map_sub_modes(struct deref_map *map, nir_variable_mode modes)<br>
+{<br>
+ deref_map_foreach(map, map_entry) {<br>
+ nir_variable *var = map_entry->path->path[0]->var;<br>
+ if (var->data.mode & modes)<br>
+ map_entry->mask = 0;<br>
+ }<br>
+ deref_map_remove_zero_masks(map);<br>
+}<br>
+<br>
static void<br>
clear_unused_for_modes(struct util_dynarray *unused_writes, nir_variable_mode modes)<br>
{<br>
@@ -96,10 +365,15 @@ update_unused_writes_with_dst(struct util_dynarray *unused_writes,<br>
empty_entry = entry;<br>
continue;<br>
}<br>
+<br>
nir_deref_compare_result comp = nir_compare_deref_paths(dst_path, entry->path);<br>
if (comp & nir_derefs_a_contains_b_bit) {<br>
entry->mask &= ~mask;<br>
if (entry->mask == 0) {<br>
+ if (debug) {<br>
+ print_instr("REMOVE: ", &entry->intrin->instr,<br>
+ "\n (value overwritten)\n");<br>
+ }<br>
nir_instr_remove(&entry->intrin->instr);<br>
entry->intrin = NULL;<br>
empty_entry = entry;<br>
@@ -123,13 +397,36 @@ update_unused_writes_with_dst(struct util_dynarray *unused_writes,<br>
return progress;<br>
}<br>
<br>
+static void<br>
+tag_use_before_write(struct deref_map *gen, struct deref_map *kill,<br>
+ nir_deref_path *path)<br>
+{<br>
+ if (deref_map_contains(kill, path, ~(1 << NIR_MAX_VEC_COMPONENTS)))<br>
+ return;<br>
+ deref_map_add(gen, path, ~(1 << NIR_MAX_VEC_COMPONENTS));<br>
+}<br>
+<br>
+static void<br>
+tag_write_before_use(struct deref_map *gen, struct deref_map *kill,<br>
+ nir_deref_path *path, uintptr_t mask)<br>
+{<br>
+ deref_map_add(kill, path, mask);<br>
+}<br>
+<br>
+static const nir_variable_mode memory_barrier_modes =<br>
+ ~(nir_var_local | nir_var_global | nir_var_shader_in | nir_var_uniform);<br>
+<br>
+static const nir_variable_mode emit_vertex_modes = nir_var_shader_out;<br>
+<br>
+static const nir_variable_mode end_of_program_modes =<br>
+ ~(nir_var_local | nir_var_shader_in | nir_var_uniform);<br>
+<br>
static bool<br>
-remove_dead_write_vars_local(struct state *state, nir_block *block)<br>
+remove_dead_write_vars_local(struct state *state, nir_block *block, struct block_data *bd)<br>
{<br>
bool progress = false;<br>
<br>
- struct util_dynarray unused_writes;<br>
- util_dynarray_init(&unused_writes, state->mem_ctx);<br>
+ util_dynarray_init(&bd->unused_writes, state->mem_ctx);<br>
<br>
nir_foreach_instr_safe(instr, block) {<br>
if (instr->type != nir_instr_type_intrinsic)<br>
@@ -139,16 +436,21 @@ remove_dead_write_vars_local(struct state *state, nir_block *block)<br>
switch (intrin->intrinsic) {<br>
case nir_intrinsic_barrier:<br>
case nir_intrinsic_memory_barrier: {<br>
- nir_variable_mode modes = ~(nir_var_local | nir_var_global |<br>
- nir_var_shader_in | nir_var_uniform);<br>
- clear_unused_for_modes(&unused_writes, modes);<br>
+ clear_unused_for_modes(&bd->unused_writes, memory_barrier_modes);<br>
+ if (!bd->kill_before_memory_barrier) {<br>
+ bd->kill_before_memory_barrier = deref_map_clone(bd->kill, state->mem_ctx);<br>
+ deref_map_sub_modes(bd->kill_before_memory_barrier, memory_barrier_modes);<br>
+ }<br>
break;<br>
}<br>
<br>
case nir_intrinsic_emit_vertex:<br>
case nir_intrinsic_emit_vertex_with_counter: {<br>
- nir_variable_mode modes = nir_var_shader_out;<br>
- clear_unused_for_modes(&unused_writes, modes);<br>
+ clear_unused_for_modes(&bd->unused_writes, emit_vertex_modes);<br>
+ if (!bd->kill_before_emit_vertex) {<br>
+ bd->kill_before_emit_vertex = deref_map_clone(bd->kill, state->mem_ctx);<br>
+ deref_map_sub_modes(bd->kill_before_emit_vertex, emit_vertex_modes);<br>
+ }<br>
break;<br>
}<br>
<br>
@@ -156,7 +458,8 @@ remove_dead_write_vars_local(struct state *state, nir_block *block)<br>
nir_deref_instr *src = nir_src_as_deref(intrin->src[0]);<br>
nir_deref_path *src_path = get_path(state, src);<br>
<br>
- clear_unused_for_src(&unused_writes, src_path);<br>
+ clear_unused_for_src(&bd->unused_writes, src_path);<br>
+ tag_use_before_write(bd->gen, bd->kill, src_path);<br>
break;<br>
}<br>
<br>
@@ -165,7 +468,8 @@ remove_dead_write_vars_local(struct state *state, nir_block *block)<br>
nir_deref_path *dst_path = get_path(state, dst);<br>
<br>
uintptr_t mask = nir_intrinsic_write_mask(intrin);<br>
- progress |= update_unused_writes_with_dst(&unused_writes, intrin, dst_path, mask);<br>
+ progress |= update_unused_writes_with_dst(&bd->unused_writes, intrin, dst_path, mask);<br>
+ tag_write_before_use(bd->gen, bd->kill, dst_path, mask);<br>
break;<br>
}<br>
<br>
@@ -178,14 +482,20 @@ remove_dead_write_vars_local(struct state *state, nir_block *block)<br>
<br>
/* Self-copy is removed. */<br>
if (src == dst || (nir_compare_deref_paths(src_path, dst_path) & nir_derefs_equal_bit)) {<br>
+ if (debug) {<br>
+ print_instr("REMOVE: ", instr,<br>
+ "\n (self-copy)\n");<br>
+ }<br>
nir_instr_remove(instr);<br>
progress = true;<br>
break;<br>
}<br>
<br>
uintptr_t mask = ~(1 << NIR_MAX_VEC_COMPONENTS);<br>
- clear_unused_for_src(&unused_writes, src_path);<br>
- progress |= update_unused_writes_with_dst(&unused_writes, intrin, dst_path, mask);<br>
+ clear_unused_for_src(&bd->unused_writes, src_path);<br>
+ progress |= update_unused_writes_with_dst(&bd->unused_writes, intrin, dst_path, mask);<br>
+ tag_use_before_write(bd->gen, bd->kill, src_path);<br>
+ tag_write_before_use(bd->gen, bd->kill, dst_path, mask);<br>
break;<br>
}<br>
<br>
@@ -201,6 +511,60 @@ remove_dead_write_vars_local(struct state *state, nir_block *block)<br>
return progress;<br>
}<br>
<br>
+enum {<br>
+ DUMP_GEN = 1 << 0,<br>
+ DUMP_KILL = 1 << 1,<br>
+ DUMP_IN = 1 << 2,<br>
+ DUMP_OUT = 1 << 3,<br>
+ DUMP_UNUSED_WRITES = 1 << 4,<br>
+};<br>
+<br>
+static void<br>
+dump_block(struct state *state, unsigned block_index, unsigned flags)<br>
+{<br>
+ struct block_data *bd = &state->block_data[block_index];<br>
+ printf("--- block%d\n", block_index);<br>
+<br>
+ if (bd->kill_before_memory_barrier)<br>
+ printf(" has memory barrier\n");<br>
+ if (bd->kill_before_emit_vertex)<br>
+ printf(" has emit vertex\n");<br>
+<br>
+ if ((flags & DUMP_GEN) && !deref_map_empty(bd->gen)) {<br>
+ printf(" gen\n");<br>
+ deref_map_foreach(bd->gen, map_entry)<br>
+ print_path(" ", map_entry->path, "\n");<br>
+ }<br>
+<br>
+ if ((flags & DUMP_KILL) && !deref_map_empty(bd->kill)) {<br>
+ printf(" kill\n");<br>
+ deref_map_foreach(bd->kill, map_entry)<br>
+ print_path(" ", map_entry->path, "\n");<br>
+ }<br>
+<br>
+ if ((flags & DUMP_IN) && !deref_map_empty(bd->in)) {<br>
+ printf(" in\n");<br>
+ deref_map_foreach(bd->in, map_entry)<br>
+ print_path(" ", map_entry->path, "\n");<br>
+ }<br>
+<br>
+ if ((flags & DUMP_OUT) && !deref_map_empty(bd->out)) {<br>
+ printf(" out\n");<br>
+ deref_map_foreach(bd->out, map_entry)<br>
+ print_path(" ", map_entry->path, "\n");<br>
+ }<br>
+<br>
+ if ((flags & DUMP_UNUSED_WRITES) && bd->unused_writes.size != 0) {<br>
+ printf(" unused writes\n");<br>
+ util_dynarray_foreach(&bd->unused_writes, struct write_entry, entry) {<br>
+ if (!entry->intrin)<br>
+ continue;<br>
+ print_instr(" ", &entry->intrin->instr, "");<br>
+ print_path(" -- ", entry->path, "\n");<br>
+ }<br>
+ }<br>
+}<br>
+<br>
static bool<br>
remove_dead_write_vars_impl(void *mem_ctx, nir_function_impl *func)<br>
{<br>
@@ -211,10 +575,161 @@ remove_dead_write_vars_impl(void *mem_ctx, nir_function_impl *func)<br>
<br>
.path_lin_ctx = linear_alloc_parent(mem_ctx, 0),<br>
.paths = _mesa_hash_table_create(mem_ctx, _mesa_hash_pointer, _mesa_key_pointer_equal),<br>
+<br>
+ /* Count one extra data element for the end_block. */<br>
+ .block_data = rzalloc_array(mem_ctx, struct block_data, func->num_blocks + 1),<br>
};<br>
<br>
- nir_foreach_block(block, func)<br>
- progress |= remove_dead_write_vars_local(&state, block);<br>
+ /* Step 1. Perform local optimization and initialize GEN, KILL and<br>
+ * UNUSED_WRITES sets.<br>
+ */<br>
+ bool all_gens_empty = true;<br>
+ nir_foreach_block(block, func) {<br>
+ struct block_data *bd = &state.block_data[block->index];<br>
+ bd->gen = deref_map_create(mem_ctx);<br>
+ bd->kill = deref_map_create(mem_ctx);<br>
+ bd->out = deref_map_create(mem_ctx);<br>
+<br>
+ progress |= remove_dead_write_vars_local(&state, block, bd);<br>
+<br>
+ if (!deref_map_empty(bd->gen))<br>
+ all_gens_empty = false;<br>
+ }<br>
+<br>
+ if (func->num_blocks < 2 || all_gens_empty)<br>
+ return progress;<br>
+<br>
+ /* Identify the derefs that are considered used by other intrinsics (or the<br>
+ * end of the program), then include them as part of the GEN of the blocks<br>
+ * that contains those intrinsics.<br>
+ */<br>
+ struct deref_map *used_by_memory_barrier = deref_map_create(mem_ctx);<br>
+ struct deref_map *used_by_emit_vertex = deref_map_create(mem_ctx);<br>
+ struct deref_map *used_by_end_of_program = deref_map_create(mem_ctx);<br>
+<br>
+ for (int i = 0; i < func->num_blocks; i++) {<br>
+ struct block_data *bd = &state.block_data[i];<br>
+<br>
+ util_dynarray_foreach(&bd->unused_writes, struct write_entry, entry) {<br>
+ if (!entry->intrin)<br>
+ continue;<br>
+<br>
+ nir_deref_path *dst_path = entry->path;<br>
+ nir_variable *var = dst_path->path[0]->var;<br>
+<br>
+ if (var->data.mode & memory_barrier_modes)<br>
+ deref_map_add(used_by_memory_barrier, dst_path, entry->mask);<br>
+<br>
+ if (var->data.mode & emit_vertex_modes)<br>
+ deref_map_add(used_by_emit_vertex, dst_path, entry->mask);<br>
+<br>
+ if (var->data.mode & end_of_program_modes)<br>
+ deref_map_add(used_by_end_of_program, dst_path, entry->mask);<br>
+ }<br>
+ }<br>
+<br>
+ /* Update GEN to include extra uses based on the presence of other<br>
+ * intrinsics. Make sure not include uses for derefs we already overwrite<br>
+ * (kill) before the intrinsic in the block.<br>
+ */<br>
+ for (int i = 0; i < func->num_blocks; i++) {<br>
+ struct block_data *bd = &state.block_data[i];<br>
+ if (bd->kill_before_memory_barrier) {<br>
+ struct deref_map *temp = deref_map_clone(used_by_memory_barrier, mem_ctx);<br>
+ deref_map_sub_map(temp, bd->kill_before_memory_barrier);<br>
+ deref_map_add_map(bd->gen, temp);<br>
+ }<br>
+ if (bd->kill_before_emit_vertex) {<br>
+ struct deref_map *temp = deref_map_clone(used_by_emit_vertex, mem_ctx);<br>
+ deref_map_sub_map(temp, bd->kill_before_emit_vertex);<br>
+ deref_map_add_map(bd->gen, temp);<br>
+ }<br>
+<br>
+ /* Initialize the IN set with GEN, since we don't have out information yet. */<br>
+ bd->in = deref_map_clone(bd->gen, mem_ctx);<br>
+ }<br>
+<br>
+ /* Since end_block is used as successor for the last blocks executed, we<br>
+ * can conveniently fill its IN set to propagate the usage.<br>
+ */<br>
+ struct block_data *end_bd = &state.block_data[func->end_block->index];<br>
+ end_bd->in = used_by_end_of_program;<br>
+<br>
+ if (debug) {<br>
+ printf("\n=== Before data-flow analysis\n");<br>
+ for (int i = 0; i < func->num_blocks; i++)<br>
+ dump_block(&state, i, DUMP_IN | DUMP_KILL | DUMP_UNUSED_WRITES);<br>
+ dump_block(&state, func->end_block->index, DUMP_IN);<br>
+ }<br>
+<br>
+ /* Step 2. Iterate backwards computing the OUT and IN sets.<br>
+ *<br>
+ * OUT = union IN of all successors<br>
+ * IN = GEN union (OUT minus KILL)<br>
+ *<br>
+ * The sets will always grow, so we can reuse the IN instead of repeatedly<br>
+ * calculate it. This helps identifying whether the set changed or not.<br>
+ */<br>
+ bool flow_progress;<br>
+ int iterations = 0;<br>
+ do {<br>
+ flow_progress = false;<br>
+<br>
+ if (debug)<br>
+ printf("\n=== Blocks with IN/OUT changed in iteration %d\n", ++iterations);<br>
+<br>
+ nir_foreach_block_reverse(block, func) {<br>
+ if (block == func->end_block)<br>
+ continue;<br>
+<br>
+ struct block_data *bd = &state.block_data[block->index];<br>
+<br>
+ bool out_changed = false;<br>
+ for (int i = 0; i < 2; i++) {<br>
+ nir_block *successor = block->successors[i];<br>
+ if (!successor)<br>
+ continue;<br>
+ struct block_data *successor_bd = &state.block_data[successor->index];<br>
+ out_changed |= deref_map_add_map(bd->out, successor_bd->in);<br>
+ }<br>
+<br>
+ if (!out_changed)<br>
+ continue;<br>
+<br>
+ struct deref_map *temp = deref_map_clone(bd->out, mem_ctx);<br>
+ deref_map_sub_map(temp, bd->kill);<br>
+<br>
+ bool in_changed = deref_map_add_map(bd->in, temp);<br>
+ flow_progress |= in_changed;<br>
+<br>
+ if (debug && in_changed)<br>
+ dump_block(&state, block->index, DUMP_IN | DUMP_OUT);<br>
+ }<br>
+ } while (flow_progress);<br>
+<br>
+ if (debug)<br>
+ printf("\n=== After data flow analysis (%d iterations)\n", iterations);<br>
+<br>
+ /* Step 3. Remove unused writes based on the data-flow analysis. */<br>
+ nir_foreach_block(block, func) {<br>
+ struct block_data *bd = &state.block_data[block->index];<br>
+<br>
+ if (debug)<br>
+ dump_block(&state, block->index, DUMP_OUT | DUMP_UNUSED_WRITES);<br>
+<br>
+ util_dynarray_foreach(&bd->unused_writes, struct write_entry, entry) {<br>
+ if (!entry->intrin)<br>
+ continue;<br>
+ if (!deref_map_may_alias(bd->out, entry->path, entry->mask)) {<br>
+ if (debug) {<br>
+ print_instr("REMOVE: ", &entry->intrin->instr,<br>
+ "\n (unused after data flow analysis)\n");<br>
+ }<br>
+ nir_instr_remove(&entry->intrin->instr);<br>
+ progress = true;<br>
+ }<br>
+ }<br>
+ }<br>
<br>
return progress;<br>
}<br>
@@ -225,6 +740,9 @@ nir_opt_dead_write_vars(nir_shader *shader)<br>
void *mem_ctx = ralloc_context(NULL);<br>
bool progress = false;<br>
<br>
+ if (debug)<br>
+ nir_print_shader(shader, stdout);<br>
+<br>
nir_foreach_function(function, shader) {<br>
if (!function->impl)<br>
continue;<br>
-- <br>
2.18.0<br>
<br>
_______________________________________________<br>
mesa-dev mailing list<br>
<a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.org</a><br>
<a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/mesa-dev</a><br>
</blockquote></div></div></div></div></div>
</blockquote></div></div>