[Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

Tue Oct 20 04:18:05 PDT 2015

Iago Toral <itoral at igalia.com> writes:

> On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote:
>> Iago Toral Quiroga <itoral at igalia.com> writes:
>> 
>> > This allows us to re-use the results of previous ssbo loads in situations
>> > that are safe (i.e. when there are no stores, atomic operations or
>> > memory barriers in between).
>> >
>> > This is particularly useful for things like matrix multiplications, where
>> > for a mat4 buffer variable we cut the number of loads from 16 (4 reads of
>> > each column) down to 4 (1 read of each column).
>> >
>> > The pass can only cache ssbo loads that involve constant blocks and
>> > offsets, but could be extended to compare sub-expressions for these
>> > as well, similar to a CSE pass.
>> >
>> > The way the cache works is simple: ssbo loads with constant block/offset
>> > are included in a cache as they are seen. Stores invalidate cache entries.
>> > Stores with non-constant offset invalidate all cached loads for the block
>> > and stores with non-constant block invalidate all cache entries. There is
>> > room to improve this by using the actual variable name we are accessing to
>> > limit the entries that should be invalidated. We also need to invalidate
>> > cache entries when we exit the block in which they have been defined
>> > (i.e. inside if/else blocks or loops).
>> >
>> > The cache optimization is built as a separate pass, instead of merging it
>> > inside the lower_ubo_reference pass for a number of reasons:
>> >
>> > 1) The way we process assignments in visitors is that the LHS is
>> > processed before the RHS. This creates a problem for an optimization
>> > such as this when we do things like a = a + 1, since we would see the
>> > store before the read when the actual execution order is reversed.
>> > This could be fixed by re-implementing the logic in the visit_enter
>> > method for ir_assignment in lower_ubo_reference and then returning
>> > visit_continue_with_parent.
>> >
>> > 2) Some writes/reads need to be split into multiple smaller
>> > writes/reads, and we need to handle caching for each one. This happens
>> > deep inside the code that handles the lowering and some
>> > of the information we need to do this is not available. This could also
>> > be fixed by passing more data into the corresponding functions or by
>> > making this data available as class members, but the current implementation
>> > is already complex enough and  this would only contribute to the complexity.
>> >
>> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In these cases
>> > the current code in lower_uo_reference would see the store before the read.
>> > Probably fixable, but again would add more complexity to the lowering.
>> >
>> > On the other hand, a separate pass that runs after the lowering sees
>> > all the individal loads and stores in the correct order (so we don't need
>> > to do any tricks) and it allows us to sepearate the lowering logic (which
>> > is already complex) from the caching logic. It also gives us a chance to
>> > run it after other optimization passes have run and turned constant
>> > expressions for block/offset into constants, enabling more opportunities
>> > for caching.
>> 
>> Seems like a restricted form of CSE that only handles SSBO loads, and
>> only the ones with constant arguments.  Why don't we CSE these? (and
>> other memory access operations like image loads)
>
> There is not a CSE pass in GLSL IR any more so we would have to do it in
> NIR and some drivers would lose the optimization. Doing it at GLSL IR
> level seemed like a win from this perspective.
>
> Then there is the fact that we cannot just CSE these. We need to make
> sure that we only CSE them when it is safe to do so (i.e. no
> stores/atomics to the same offsets in between, no memory barriers, etc).
> The current CSE pass in NIR does not support this as far as I can see. I
> suppose that we could look into changing the pass to accommodate
> restrictions such as this if we think is worth it.
>
Not really sure if the NIR CSE pass would be adequate, but at least at
the i965 back-end level this could be handled easily in the CSE pass
(for typed and untyped surface read opcodes in general) in roughly the
same way that source variable interference is handled -- Just kill
potentially overlapping entries from the AEB whenever an atomic or write
instruction for the same surface is seen.

> Iago
>
>> 
>> > ---
>> >  src/glsl/Makefile.sources  |   1 +
>> >  src/glsl/ir_optimization.h |   1 +
>> >  src/glsl/opt_ssbo_load.cpp | 338 +++++++++++++++++++++++++++++++++++++++++++++
>> >  3 files changed, 340 insertions(+)
>> >  create mode 100644 src/glsl/opt_ssbo_load.cpp
>> >
>> > diff --git a/src/glsl/Makefile.sources b/src/glsl/Makefile.sources
>> > index ca87036..73c7514 100644
>> > --- a/src/glsl/Makefile.sources
>> > +++ b/src/glsl/Makefile.sources
>> > @@ -201,6 +201,7 @@ LIBGLSL_FILES = \
>> >  	opt_noop_swizzle.cpp \
>> >  	opt_rebalance_tree.cpp \
>> >  	opt_redundant_jumps.cpp \
>> > +	opt_ssbo_load.cpp \
>> >  	opt_structure_splitting.cpp \
>> >  	opt_swizzle_swizzle.cpp \
>> >  	opt_tree_grafting.cpp \
>> > diff --git a/src/glsl/ir_optimization.h b/src/glsl/ir_optimization.h
>> > index ce5c492..26677d7 100644
>> > --- a/src/glsl/ir_optimization.h
>> > +++ b/src/glsl/ir_optimization.h
>> > @@ -125,6 +125,7 @@ bool lower_clip_distance(gl_shader *shader);
>> >  void lower_output_reads(unsigned stage, exec_list *instructions);
>> >  bool lower_packing_builtins(exec_list *instructions, int op_mask);
>> >  void lower_ubo_reference(struct gl_shader *shader, exec_list *instructions);
>> > +bool opt_ssbo_loads(struct gl_shader *shader, exec_list *instructions);
>> >  void lower_packed_varyings(void *mem_ctx,
>> >                             unsigned locations_used, ir_variable_mode mode,
>> >                             unsigned gs_input_vertices, gl_shader *shader);
>> > diff --git a/src/glsl/opt_ssbo_load.cpp b/src/glsl/opt_ssbo_load.cpp
>> > new file mode 100644
>> > index 0000000..5404907
>> > --- /dev/null
>> > +++ b/src/glsl/opt_ssbo_load.cpp
>> > @@ -0,0 +1,338 @@
>> > +/*
>> > + * Copyright © 2015 Intel Corporation
>> > + *
>> > + * Permission is hereby granted, free of charge, to any person obtaining a
>> > + * copy of this software and associated documentation files (the "Software"),
>> > + * to deal in the Software without restriction, including without limitation
>> > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>> > + * and/or sell copies of the Software, and to permit persons to whom the
>> > + * Software is furnished to do so, subject to the following conditions:
>> > + *
>> > + * The above copyright notice and this permission notice (including the next
>> > + * paragraph) shall be included in all copies or substantial portions of the
>> > + * Software.
>> > + *
>> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>> > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
>> > + * DEALINGS IN THE SOFTWARE.
>> > + */
>> > +
>> > +/**
>> > + * \file opt_ssbo_load.cpp
>> > + *
>> > + * IR optimization pass to reduce the number of SSBO loads by re-using previous
>> > + * loads when it is safe to do so (i.e. no stores have invalidated the load,
>> > + * no memory barriers in between, etc)
>> > + */
>> > +
>> > +#include "ir.h"
>> > +#include "ir_builder.h"
>> > +#include "program/hash_table.h"
>> > +
>> > +using namespace ir_builder;
>> > +
>> > +namespace {
>> > +struct ssbo_load_cache_remove_closure {
>> > +   struct hash_table *ht;
>> > +   const char *key_prefix;
>> > +   unsigned depth;
>> > +};
>> > +
>> > +struct ssbo_load_cache_entry {
>> > +   ir_dereference *load;
>> > +   unsigned depth;
>> > +};
>> > +
>> > +class ssbo_load_cache_visitor : public ir_hierarchical_visitor {
>> > +public:
>> > +   ssbo_load_cache_visitor(gl_shader *shader)
>> > +   : current_depth(0), progress(false)
>> > +   {
>> > +      mem_ctx = ralloc_parent(shader->ir);
>> > +      ssbo_load_cache =
>> > +         hash_table_ctor(0, hash_table_string_hash, hash_table_string_compare);
>> > +   }
>> > +
>> > +   ~ssbo_load_cache_visitor()
>> > +   {
>> > +      hash_table_dtor(ssbo_load_cache);
>> > +   }
>> > +
>> > +   virtual ir_visitor_status visit_enter(ir_call *);
>> > +
>> > +   char *build_ssbo_load_cache_key(unsigned block_index,
>> > +                                   unsigned offset);
>> > +   ir_dereference *ssbo_load_cache_find(ir_rvalue *block_index,
>> > +                                        ir_rvalue *offset);
>> > +   void ssbo_load_cache_add(ir_dereference *deref,
>> > +                            ir_rvalue *block_index,
>> > +                            ir_rvalue *offset,
>> > +                            unsigned depth);
>> > +   void ssbo_load_cache_remove(ir_rvalue *block_index,
>> > +                               ir_rvalue *offset);
>> > +   void ssbo_load_cache_remove_by_key_prefix(const char *prefix);
>> > +   void ssbo_load_cache_remove_all();
>> > +   void ssbo_load_cache_remove_depth(unsigned depth);
>> > +
>> > +   ir_visitor_status visit_enter(ir_if *ir);
>> > +   ir_visitor_status visit_enter(ir_loop *ir);
>> > +   ir_visitor_status visit_leave(ir_loop *ir);
>> > +
>> > +   bool get_progress() { return this->progress; }
>> > +
>> > +private:
>> > +   void *mem_ctx;
>> > +   unsigned current_depth;
>> > +   struct hash_table *ssbo_load_cache;
>> > +   bool progress;
>> > +};
>> > +
>> > +ir_visitor_status
>> > +ssbo_load_cache_visitor::visit_enter(ir_call *ir)
>> > +{
>> > +   if (!ir->callee->is_intrinsic)
>> > +      return visit_continue_with_parent;
>> > +
>> > +   if (!strcmp(ir->callee_name(), "__intrinsic_load_ssbo")) {
>> > +      exec_node *param = ir->actual_parameters.get_head();
>> > +      ir_rvalue *block = ((ir_instruction *)param)->as_rvalue();
>> > +
>> > +      param = param->get_next();
>> > +      ir_rvalue *offset = ((ir_instruction *)param)->as_rvalue();
>> > +
>> > +      ir_dereference *cached_load = ssbo_load_cache_find(block, offset);
>> > +      if (cached_load) {
>> > +         ir_variable *var = ir->return_deref->variable_referenced();
>> > +         base_ir->insert_before(assign(var, cached_load->clone(mem_ctx, NULL)));
>> > +         ir->remove();
>> > +         this->progress = true;
>> > +      } else {
>> > +         ssbo_load_cache_add(ir->return_deref, block, offset,
>> > +                             this->current_depth);
>> > +      }
>> > +   } else if (!strcmp(ir->callee_name(), "__intrinsic_store_ssbo")) {
>> > +      exec_node *param = ir->actual_parameters.get_head();
>> > +      ir_rvalue *block = ((ir_instruction *)param)->as_rvalue();
>> > +
>> > +      param = param->get_next();
>> > +      ir_rvalue *offset = ((ir_instruction *)param)->as_rvalue();
>> > +
>> > +      ssbo_load_cache_remove(block, offset);
>> > +   } else if (strstr(ir->callee_name(), "__intrinsic_ssbo_atomic") ==
>> > +       ir->callee_name()) {
>> > +      exec_node *param = ir->actual_parameters.get_head();
>> > +      ir_rvalue *block = ((ir_instruction *)param)->as_rvalue();
>> > +
>> > +      param = param->get_next();
>> > +      ir_rvalue *offset = ((ir_instruction *)param)->as_rvalue();
>> > +
>> > +      ssbo_load_cache_remove(block, offset);
>> > +   } else if (!strcmp(ir->callee_name(), "__intrinsic_memory_barrier")) {
>> > +      ssbo_load_cache_remove_all();
>> > +   }
>> > +
>> > +   return visit_continue_with_parent;
>> > +}
>> > +
>> > +ir_visitor_status
>> > +ssbo_load_cache_visitor::visit_enter(ir_if *ir)
>> > +{
>> > +   ir->condition->accept(this);
>> > +
>> > +   this->current_depth++;
>> > +
>> > +   if (!ir->then_instructions.is_empty()) {
>> > +      visit_list_elements(this, &ir->then_instructions);
>> > +      ssbo_load_cache_remove_depth(this->current_depth);
>> > +   }
>> > +
>> > +   if (!ir->else_instructions.is_empty()) {
>> > +      visit_list_elements(this, &ir->else_instructions);
>> > +      ssbo_load_cache_remove_depth(this->current_depth);
>> > +   }
>> > +
>> > +   this->current_depth--;
>> > +
>> > +   return visit_continue_with_parent;
>> > +}
>> > +
>> > +ir_visitor_status
>> > +ssbo_load_cache_visitor::visit_enter(ir_loop *ir)
>> > +{
>> > +   this->current_depth++;
>> > +   return visit_continue;
>> > +}
>> > +
>> > +ir_visitor_status
>> > +ssbo_load_cache_visitor::visit_leave(ir_loop *ir)
>> > +{
>> > +   ssbo_load_cache_remove_depth(this->current_depth);
>> > +   this->current_depth--;
>> > +   return visit_continue;
>> > +}
>> > +
>> > +char *
>> > +ssbo_load_cache_visitor::build_ssbo_load_cache_key(unsigned block_index,
>> > +                                                       unsigned offset)
>> > +{
>> > +   return ralloc_asprintf(mem_ctx, "%u-%u", block_index, offset);
>> > +}
>> > +
>> > +ir_dereference *
>> > +ssbo_load_cache_visitor::ssbo_load_cache_find(ir_rvalue *block_index,
>> > +                                              ir_rvalue *offset)
>> > +{
>> > +   ir_constant *const_block_index = block_index->as_constant();
>> > +   if (!const_block_index)
>> > +      return NULL;
>> > +
>> > +   ir_constant *const_offset = offset->as_constant();
>> > +   if (!const_offset)
>> > +      return NULL;
>> > +
>> > +   char *cache_key =
>> > +      build_ssbo_load_cache_key(const_block_index->value.u[0],
>> > +                                const_offset->value.u[0]);
>> > +
>> > +   struct ssbo_load_cache_entry *entry = (struct ssbo_load_cache_entry *)
>> > +      hash_table_find(ssbo_load_cache, cache_key);
>> > +   return entry ? entry->load : NULL;
>> > +}
>> > +
>> > +void
>> > +ssbo_load_cache_visitor::ssbo_load_cache_add(ir_dereference *deref,
>> > +                                             ir_rvalue *block_index,
>> > +                                             ir_rvalue *offset,
>> > +                                             unsigned depth)
>> > +{
>> > +   /* We only support caching SSBO loads with constant block and offset */
>> > +   ir_constant *const_block_index = block_index->as_constant();
>> > +   if (!const_block_index)
>> > +      return;
>> > +
>> > +   ir_constant *const_offset = offset->as_constant();
>> > +   if (!const_offset)
>> > +      return;
>> > +
>> > +   char *cache_key =
>> > +      build_ssbo_load_cache_key(const_block_index->value.u[0],
>> > +                                const_offset->value.u[0]);
>> > +
>> > +   struct ssbo_load_cache_entry *entry = (struct ssbo_load_cache_entry *)
>> > +      ralloc_size(this->mem_ctx, sizeof(struct ssbo_load_cache_entry));
>> > +   entry->load = deref;
>> > +   entry->depth = depth;
>> > +
>> > +   hash_table_replace(ssbo_load_cache, entry, cache_key);
>> > +}
>> > +
>> > +static void
>> > +ssbo_load_cache_remove_callback(const void *key, void *data, void *closure)
>> > +{
>> > +   struct ssbo_load_cache_remove_closure *c =
>> > +      (struct ssbo_load_cache_remove_closure *) closure;
>> > +
>> > +   /* If we have a key_prefix, then we only want to delete entries
>> > +    * with that key prefix. Otherwise, if we have depth > 0 we want
>> > +    * to delete keys with that depth (or larger). Otheriwse (no prefix
>> > +    * and depth == 0) we want to delete all keys.
>> > +    */
>> > +   assert((c->key_prefix && c->depth == 0) ||
>> > +          (!c->key_prefix && c->depth > 0) ||
>> > +          (!c->key_prefix && c->depth == 0));
>> > +
>> > +   if (c->key_prefix && strstr((char *)key, (char *)c->key_prefix) != key)
>> > +      return;
>> > +
>> > +   if (c->depth > 0) {
>> > +      struct ssbo_load_cache_entry *entry =
>> > +         (struct ssbo_load_cache_entry *) data;
>> > +      if (entry->depth < c->depth)
>> > +         return;
>> > +   }
>> > +
>> > +   hash_table_remove(c->ht, key);
>> > +}
>> > +
>> > +void
>> > +ssbo_load_cache_visitor::ssbo_load_cache_remove_by_key_prefix(const char *prefix)
>> > +{
>> > +   struct ssbo_load_cache_remove_closure c;
>> > +   c.ht = ssbo_load_cache;
>> > +   c.key_prefix = prefix;
>> > +   c.depth = 0;
>> > +
>> > +   hash_table_call_foreach(ssbo_load_cache,
>> > +                           ssbo_load_cache_remove_callback, &c);
>> > +}
>> > +
>> > +void
>> > +ssbo_load_cache_visitor::ssbo_load_cache_remove_all()
>> > +{
>> > +   struct ssbo_load_cache_remove_closure c;
>> > +   c.ht = ssbo_load_cache;
>> > +   c.key_prefix = NULL;
>> > +   c.depth = 0;
>> > +
>> > +   hash_table_call_foreach(ssbo_load_cache,
>> > +                           ssbo_load_cache_remove_callback, &c);
>> > +}
>> > +
>> > +void
>> > +ssbo_load_cache_visitor::ssbo_load_cache_remove_depth(unsigned depth)
>> > +{
>> > +   struct ssbo_load_cache_remove_closure c;
>> > +   c.ht = ssbo_load_cache;
>> > +   c.key_prefix = NULL;
>> > +   c.depth = depth;
>> > +
>> > +   hash_table_call_foreach(ssbo_load_cache,
>> > +                           ssbo_load_cache_remove_callback, &c);
>> > +}
>> > +
>> > +void
>> > +ssbo_load_cache_visitor::ssbo_load_cache_remove(ir_rvalue *block_index,
>> > +                                                ir_rvalue *offset)
>> > +{
>> > +   ir_constant *const_block_index = block_index->as_constant();
>> > +   if (!const_block_index) {
>> > +      /* If we don't know the block index, then invalidate the entire cache.
>> > +       * We could try to do better, for example, considering the actual
>> > +       * field name we are accessing in the SSBO in the keys so we only
>> > +       * invalidate those. This requires some work though.
>> > +       */
>> > +      ssbo_load_cache_remove_all();
>> > +      return;
>> > +   }
>> > +
>> > +   ir_constant *const_offset = offset->as_constant();
>> > +   if (!const_offset) {
>> > +      /* We know the block but not the offset, so invalidate all entries
>> > +       * for the given block
>> > +       */
>> > +      ssbo_load_cache_remove_by_key_prefix(
>> > +         ralloc_asprintf(mem_ctx, "%u-", const_block_index->value.u[0]));
>> > +      return;
>> > +   }
>> > +
>> > +   /* We know block and offset, so invalidate that particular load only */
>> > +   char *cache_key =
>> > +      build_ssbo_load_cache_key(const_block_index->value.u[0],
>> > +                                const_offset->value.u[0]);
>> > +
>> > +   hash_table_remove(ssbo_load_cache, cache_key);
>> > +}
>> > +
>> > +} /* Unnamed namespace */
>> > +
>> > +bool
>> > +opt_ssbo_loads(struct gl_shader *shader, exec_list *instructions)
>> > +{
>> > +   ssbo_load_cache_visitor v(shader);
>> > +   visit_list_elements(&v, instructions);
>> > +   return v.get_progress();
>> > +}
>> > -- 
>> > 1.9.1
>> >
>> > _______________________________________________
>> > mesa-dev mailing list
>> > mesa-dev at lists.freedesktop.org
>> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20151020/c8719fa6/attachment-0001.sig>