[Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

Iago Toral itoral at igalia.com
Wed Oct 21 06:51:10 PDT 2015


On Wed, 2015-10-21 at 14:58 +0300, Francisco Jerez wrote:
> Iago Toral <itoral at igalia.com> writes:
> 
> > On Wed, 2015-10-21 at 13:00 +0300, Francisco Jerez wrote:
> >> Iago Toral <itoral at igalia.com> writes:
> >> 
> >> > Hi Curro,
> >> >
> >> > On Tue, 2015-10-20 at 14:18 +0300, Francisco Jerez wrote:
> >> >> Iago Toral <itoral at igalia.com> writes:
> >> >> 
> >> >> > On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote:
> >> >> >> Iago Toral Quiroga <itoral at igalia.com> writes:
> >> >> >> 
> >> >> >> > This allows us to re-use the results of previous ssbo loads in situations
> >> >> >> > that are safe (i.e. when there are no stores, atomic operations or
> >> >> >> > memory barriers in between).
> >> >> >> >
> >> >> >> > This is particularly useful for things like matrix multiplications, where
> >> >> >> > for a mat4 buffer variable we cut the number of loads from 16 (4 reads of
> >> >> >> > each column) down to 4 (1 read of each column).
> >> >> >> >
> >> >> >> > The pass can only cache ssbo loads that involve constant blocks and
> >> >> >> > offsets, but could be extended to compare sub-expressions for these
> >> >> >> > as well, similar to a CSE pass.
> >> >> >> >
> >> >> >> > The way the cache works is simple: ssbo loads with constant block/offset
> >> >> >> > are included in a cache as they are seen. Stores invalidate cache entries.
> >> >> >> > Stores with non-constant offset invalidate all cached loads for the block
> >> >> >> > and stores with non-constant block invalidate all cache entries. There is
> >> >> >> > room to improve this by using the actual variable name we are accessing to
> >> >> >> > limit the entries that should be invalidated. We also need to invalidate
> >> >> >> > cache entries when we exit the block in which they have been defined
> >> >> >> > (i.e. inside if/else blocks or loops).
> >> >> >> >
> >> >> >> > The cache optimization is built as a separate pass, instead of merging it
> >> >> >> > inside the lower_ubo_reference pass for a number of reasons:
> >> >> >> >
> >> >> >> > 1) The way we process assignments in visitors is that the LHS is
> >> >> >> > processed before the RHS. This creates a problem for an optimization
> >> >> >> > such as this when we do things like a = a + 1, since we would see the
> >> >> >> > store before the read when the actual execution order is reversed.
> >> >> >> > This could be fixed by re-implementing the logic in the visit_enter
> >> >> >> > method for ir_assignment in lower_ubo_reference and then returning
> >> >> >> > visit_continue_with_parent.
> >> >> >> >
> >> >> >> > 2) Some writes/reads need to be split into multiple smaller
> >> >> >> > writes/reads, and we need to handle caching for each one. This happens
> >> >> >> > deep inside the code that handles the lowering and some
> >> >> >> > of the information we need to do this is not available. This could also
> >> >> >> > be fixed by passing more data into the corresponding functions or by
> >> >> >> > making this data available as class members, but the current implementation
> >> >> >> > is already complex enough and  this would only contribute to the complexity.
> >> >> >> >
> >> >> >> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In these cases
> >> >> >> > the current code in lower_uo_reference would see the store before the read.
> >> >> >> > Probably fixable, but again would add more complexity to the lowering.
> >> >> >> >
> >> >> >> > On the other hand, a separate pass that runs after the lowering sees
> >> >> >> > all the individal loads and stores in the correct order (so we don't need
> >> >> >> > to do any tricks) and it allows us to sepearate the lowering logic (which
> >> >> >> > is already complex) from the caching logic. It also gives us a chance to
> >> >> >> > run it after other optimization passes have run and turned constant
> >> >> >> > expressions for block/offset into constants, enabling more opportunities
> >> >> >> > for caching.
> >> >> >> 
> >> >> >> Seems like a restricted form of CSE that only handles SSBO loads, and
> >> >> >> only the ones with constant arguments.  Why don't we CSE these? (and
> >> >> >> other memory access operations like image loads)
> >> >> >
> >> >> > There is not a CSE pass in GLSL IR any more so we would have to do it in
> >> >> > NIR and some drivers would lose the optimization. Doing it at GLSL IR
> >> >> > level seemed like a win from this perspective.
> >> >> >
> >> >> > Then there is the fact that we cannot just CSE these. We need to make
> >> >> > sure that we only CSE them when it is safe to do so (i.e. no
> >> >> > stores/atomics to the same offsets in between, no memory barriers, etc).
> >> >> > The current CSE pass in NIR does not support this as far as I can see. I
> >> >> > suppose that we could look into changing the pass to accommodate
> >> >> > restrictions such as this if we think is worth it.
> >> >> >
> >> >> Not really sure if the NIR CSE pass would be adequate, but at least at
> >> >> the i965 back-end level this could be handled easily in the CSE pass
> >> >> (for typed and untyped surface read opcodes in general) in roughly the
> >> >> same way that source variable interference is handled -- Just kill
> >> >> potentially overlapping entries from the AEB whenever an atomic or write
> >> >> instruction for the same surface is seen.
> >> >
> >> > I've been having a quick look at this option but I don't think this is
> >> > going to work for us. The first thing I noticed is that the CSE pass in
> >> > i965 was not currently acting on ssbo loads at all: that is, it would
> >> > add them to the AEB but immediately remove them because of this:
> >> >
> >> > /* Kill any AEB entries using registers that don't get reused any
> >> >  * more -- a sure sign they'll fail operands_match().
> >> >  */
> >> >
> >> > The problem here is that operands_match checks that all operands in the
> >> > untyped reads match, but of course it won't use the values to check for
> >> > this, it'll check that they are actually the same registers. This is not
> >> > what we want to do, we want to identify identical ssbo loads by
> >> > inspecting the block index and offset (i.e. the values inside the
> >> > registers, which we can't even check because they are GRF registers at
> >> > this point). Because of this, the chances of the CSE pass identifying
> >> > matching loads is really slim, which explains why we never saw this
> >> > producing problems even when we never disabled it for SSBO loads or
> >> > images, at least until recently (Jordan found one example where CSE
> >> > kicked in and created a problem).
> >> >
> >> That sounds fine as long as the surface index and offset are eventually
> >> assigned to the same register if they are the same value 
> >
> > Right, the problem is that this is not happening. Here is why:
> >
> > buffer SSBO {
> >     mat4 sm4;
> > };
> >
> > uniform mat4 um4;
> >
> > void main() {
> >     sm4 *= um4;
> > }
> >
> > That multiplication will spawn 16 ssbo loads like this (NIR):
> >
> > vec1 ssa_0 = load_const (0x00000000 /* 0.000000 */)
> > vec4 ssa_1 = intrinsic load_ssbo (ssa_0) () (0)
> > vec4 ssa_2 = intrinsic load_ssbo (ssa_0) () (16)
> > vec4 ssa_3 = intrinsic load_ssbo (ssa_0) () (32)
> > vec4 ssa_4 = intrinsic load_ssbo (ssa_0) () (48)
> > (...)
> > vec4 ssa_22 = intrinsic load_ssbo (ssa_0) () (0)
> > vec4 ssa_23 = intrinsic load_ssbo (ssa_0) () (16)
> > vec4 ssa_24 = intrinsic load_ssbo (ssa_0) () (32)
> > vec4 ssa_25 = intrinsic load_ssbo (ssa_0) () (48)
> > (...)
> > vec4 ssa_43 = intrinsic load_ssbo (ssa_0) () (0)
> > vec4 ssa_44 = intrinsic load_ssbo (ssa_0) () (16)
> > vec4 ssa_45 = intrinsic load_ssbo (ssa_0) () (32)
> > vec4 ssa_46 = intrinsic load_ssbo (ssa_0) () (48)
> > (...)
> > vec4 ssa_64 = intrinsic load_ssbo (ssa_0) () (0)
> > vec4 ssa_65 = intrinsic load_ssbo (ssa_0) () (16)
> > vec4 ssa_66 = intrinsic load_ssbo (ssa_0) () (32)
> > vec4 ssa_67 = intrinsic load_ssbo (ssa_0) () (48)
> > (...)
> >
> > And this is what we see in CSE:
> >
> > untyped_surface_read(8) (mlen:2) vgrf5:UD, vgrf262:UD, vgrf4:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf10:UD, vgrf264:UD, vgrf9:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf15:UD, vgrf266:UD, vgrf14:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf20:UD, vgrf268:UD, vgrf19:UD, 4u 
> >
> > untyped_surface_read(8) (mlen:2) vgrf42:UD, vgrf270:UD, vgrf41:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf47:UD, vgrf272:UD, vgrf46:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf52:UD, vgrf274:UD, vgrf51:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf57:UD, vgrf276:UD, vgrf56:UD, 4u 
> >
> > untyped_surface_read(8) (mlen:2) vgrf79:UD, vgrf278:UD, vgrf78:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf84:UD, vgrf280:UD, vgrf83:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf89:UD, vgrf282:UD, vgrf88:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf94:UD, vgrf284:UD, vgrf93:UD, 4u 
> >
> > untyped_surface_read(8) (mlen:2) vgrf116:UD, vgrf286:UD, vgrf115:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf121:UD, vgrf288:UD, vgrf120:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf126:UD, vgrf290:UD, vgrf125:UD, 4u 
> > untyped_surface_read(8) (mlen:2) vgrf131:UD, vgrf292:UD, vgrf130:UD, 4u 
> >
> > So all are different and won't be optimized. Going deeper, if we analyze
> > the ssbo load from offset 16 specifically, right before optimization I
> > see:
> >
> > mov(8) vgrf7:UD, 16u
> > find_live_channel(8) vgrf8:UD,
> > broadcast(8) vgrf9:UD, 1u, vgrf8:UD
> > untyped_surface_read_logical(8) vgrf10:UD, vgrf7:UD(null):UD, vgrf9:UD,
> > 1u, 4u
> > (...)
> > mov(8) vgrf44:UD, 16u
> > find_live_channel(8) vgrf45:UD,
> > broadcast(8) vgrf46:UD, 1u, vgrf45:UD
> > untyped_surface_read_logical(8) vgrf47:UD, vgrf44:UD(null):UD,
> > vgrf46:UD, 1u, 4u
> >
> > And after all optimization passes we have:
> >
> > mov(8) vgrf2:UD, 1u
> > mov(8) vgrf116:UD, 0d
> > mov(8) vgrf116+0.28:UD, 65535d
> > load_payload(8) vgrf115:UD, vgrf116:UD, 16u
> > untyped_surface_read(8) (mlen: 2) vgrf3:UD, vgrf115:UD, vgrf2:UD, 4u
> > (...)
> > mov(8) vgrf26:UD, 1u
> > mov(8) vgrf124:UD, 0d
> > mov(8) vgrf124+0.28:UD, 65535d
> > load_payload(8) vgrf123:UD, vgrf124:UD, 16u
> > untyped_surface_read(8) (mlen: 2) vgrf27:UD, vgrf123:UD, vgrf26:UD, 4u
> >
> > The only way in which we could make the untyped reads equivalent here
> > would be if the driver recognized that it could kill the MOVs to vgrf26
> > and vgrf124 and simply reuse vgrf2 and vgrf116 instead. For that to
> > happen, I imagine that CSE should kick in for regular MOVs too, however
> > this is explicitly disallowed in the CSE pass because it leads to
> > infinite loops due to clashing with constant propagation (see
> > a28ad9d4c0d4b9).
> >
> vgrf2 and vgrf26 should be constant-propagated into the surface read but
> they are not, due to the known bug in the copy propagation pass.  The
> LOAD_PAYLOAD instructions are not being CSE-ed because CSE mistakenly
> thinks that the headers (vgrf116 and vgrf124) are different, even though
> they are not (this is why partial writes suck).  For the time being you
> could sidestep the problem by applying Kristian's patch to disable
> sending a header together with the UNTYPED_SURFACE_READ message --
> However that will only help with UNTYPED_SURFACE_READ even though an
> analogous optimization would apply to the other five surface opcodes.
> 
> In the long run we'll probably want to use a scalar (i.e. exec_size = 1)
> LOAD_PAYLOAD or something similar to construct the header itself so that
> the whole payload can be CSE-ed easily.

I think I'll try to make this work inside the NIR CSE pass first, that
should help more drivers and avoid us the hassle of having to do this
for both vec4/fs backends separately (besides all these other problems).
The CSE pass in NIR does not accommodate for this kind of things
naturally but after a quick experiment it seems that it can be done
without too much effort so it is worth a try.

Iago

> > Iago
> >
> >> -- If either of
> >> them is a copy they should be copy or constant-propagated, otherwise
> >> copy propagation has a bug (Reminds me of some earlier discussion with
> >> Kristian.  Sure it's not useful?). 
> >>  If the payload spans several
> >> registers (e.g. header and several coordinate components) the
> >> LOAD_PAYLOAD instruction that initializes it should be CSE'ed
> >> recursively if its sources can also be determined to be the same,
> >> otherwise CSE has a bug.
> >> 
> >> > Iago
> >> >
> >> >> > Iago
> >> >> >
> >> >> >> 
> >> >> >> > ---
> >> >> >> >  src/glsl/Makefile.sources  |   1 +
> >> >> >> >  src/glsl/ir_optimization.h |   1 +
> >> >> >> >  src/glsl/opt_ssbo_load.cpp | 338 +++++++++++++++++++++++++++++++++++++++++++++
> >> >> >> >  3 files changed, 340 insertions(+)
> >> >> >> >  create mode 100644 src/glsl/opt_ssbo_load.cpp
> >> >> >> >
> >> >> >> > diff --git a/src/glsl/Makefile.sources b/src/glsl/Makefile.sources
> >> >> >> > index ca87036..73c7514 100644
> >> >> >> > --- a/src/glsl/Makefile.sources
> >> >> >> > +++ b/src/glsl/Makefile.sources
> >> >> >> > @@ -201,6 +201,7 @@ LIBGLSL_FILES = \
> >> >> >> >  	opt_noop_swizzle.cpp \
> >> >> >> >  	opt_rebalance_tree.cpp \
> >> >> >> >  	opt_redundant_jumps.cpp \
> >> >> >> > +	opt_ssbo_load.cpp \
> >> >> >> >  	opt_structure_splitting.cpp \
> >> >> >> >  	opt_swizzle_swizzle.cpp \
> >> >> >> >  	opt_tree_grafting.cpp \
> >> >> >> > diff --git a/src/glsl/ir_optimization.h b/src/glsl/ir_optimization.h
> >> >> >> > index ce5c492..26677d7 100644
> >> >> >> > --- a/src/glsl/ir_optimization.h
> >> >> >> > +++ b/src/glsl/ir_optimization.h
> >> >> >> > @@ -125,6 +125,7 @@ bool lower_clip_distance(gl_shader *shader);
> >> >> >> >  void lower_output_reads(unsigned stage, exec_list *instructions);
> >> >> >> >  bool lower_packing_builtins(exec_list *instructions, int op_mask);
> >> >> >> >  void lower_ubo_reference(struct gl_shader *shader, exec_list *instructions);
> >> >> >> > +bool opt_ssbo_loads(struct gl_shader *shader, exec_list *instructions);
> >> >> >> >  void lower_packed_varyings(void *mem_ctx,
> >> >> >> >                             unsigned locations_used, ir_variable_mode mode,
> >> >> >> >                             unsigned gs_input_vertices, gl_shader *shader);
> >> >> >> > diff --git a/src/glsl/opt_ssbo_load.cpp b/src/glsl/opt_ssbo_load.cpp
> >> >> >> > new file mode 100644
> >> >> >> > index 0000000..5404907
> >> >> >> > --- /dev/null
> >> >> >> > +++ b/src/glsl/opt_ssbo_load.cpp
> >> >> >> > @@ -0,0 +1,338 @@
> >> >> >> > +/*
> >> >> >> > + * Copyright © 2015 Intel Corporation
> >> >> >> > + *
> >> >> >> > + * Permission is hereby granted, free of charge, to any person obtaining a
> >> >> >> > + * copy of this software and associated documentation files (the "Software"),
> >> >> >> > + * to deal in the Software without restriction, including without limitation
> >> >> >> > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> >> >> >> > + * and/or sell copies of the Software, and to permit persons to whom the
> >> >> >> > + * Software is furnished to do so, subject to the following conditions:
> >> >> >> > + *
> >> >> >> > + * The above copyright notice and this permission notice (including the next
> >> >> >> > + * paragraph) shall be included in all copies or substantial portions of the
> >> >> >> > + * Software.
> >> >> >> > + *
> >> >> >> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> >> >> >> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> >> >> >> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> >> >> >> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> >> >> >> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> >> >> >> > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> >> >> >> > + * DEALINGS IN THE SOFTWARE.
> >> >> >> > + */
> >> >> >> > +
> >> >> >> > +/**
> >> >> >> > + * \file opt_ssbo_load.cpp
> >> >> >> > + *
> >> >> >> > + * IR optimization pass to reduce the number of SSBO loads by re-using previous
> >> >> >> > + * loads when it is safe to do so (i.e. no stores have invalidated the load,
> >> >> >> > + * no memory barriers in between, etc)
> >> >> >> > + */
> >> >> >> > +
> >> >> >> > +#include "ir.h"
> >> >> >> > +#include "ir_builder.h"
> >> >> >> > +#include "program/hash_table.h"
> >> >> >> > +
> >> >> >> > +using namespace ir_builder;
> >> >> >> > +
> >> >> >> > +namespace {
> >> >> >> > +struct ssbo_load_cache_remove_closure {
> >> >> >> > +   struct hash_table *ht;
> >> >> >> > +   const char *key_prefix;
> >> >> >> > +   unsigned depth;
> >> >> >> > +};
> >> >> >> > +
> >> >> >> > +struct ssbo_load_cache_entry {
> >> >> >> > +   ir_dereference *load;
> >> >> >> > +   unsigned depth;
> >> >> >> > +};
> >> >> >> > +
> >> >> >> > +class ssbo_load_cache_visitor : public ir_hierarchical_visitor {
> >> >> >> > +public:
> >> >> >> > +   ssbo_load_cache_visitor(gl_shader *shader)
> >> >> >> > +   : current_depth(0), progress(false)
> >> >> >> > +   {
> >> >> >> > +      mem_ctx = ralloc_parent(shader->ir);
> >> >> >> > +      ssbo_load_cache =
> >> >> >> > +         hash_table_ctor(0, hash_table_string_hash, hash_table_string_compare);
> >> >> >> > +   }
> >> >> >> > +
> >> >> >> > +   ~ssbo_load_cache_visitor()
> >> >> >> > +   {
> >> >> >> > +      hash_table_dtor(ssbo_load_cache);
> >> >> >> > +   }
> >> >> >> > +
> >> >> >> > +   virtual ir_visitor_status visit_enter(ir_call *);
> >> >> >> > +
> >> >> >> > +   char *build_ssbo_load_cache_key(unsigned block_index,
> >> >> >> > +                                   unsigned offset);
> >> >> >> > +   ir_dereference *ssbo_load_cache_find(ir_rvalue *block_index,
> >> >> >> > +                                        ir_rvalue *offset);
> >> >> >> > +   void ssbo_load_cache_add(ir_dereference *deref,
> >> >> >> > +                            ir_rvalue *block_index,
> >> >> >> > +                            ir_rvalue *offset,
> >> >> >> > +                            unsigned depth);
> >> >> >> > +   void ssbo_load_cache_remove(ir_rvalue *block_index,
> >> >> >> > +                               ir_rvalue *offset);
> >> >> >> > +   void ssbo_load_cache_remove_by_key_prefix(const char *prefix);
> >> >> >> > +   void ssbo_load_cache_remove_all();
> >> >> >> > +   void ssbo_load_cache_remove_depth(unsigned depth);
> >> >> >> > +
> >> >> >> > +   ir_visitor_status visit_enter(ir_if *ir);
> >> >> >> > +   ir_visitor_status visit_enter(ir_loop *ir);
> >> >> >> > +   ir_visitor_status visit_leave(ir_loop *ir);
> >> >> >> > +
> >> >> >> > +   bool get_progress() { return this->progress; }
> >> >> >> > +
> >> >> >> > +private:
> >> >> >> > +   void *mem_ctx;
> >> >> >> > +   unsigned current_depth;
> >> >> >> > +   struct hash_table *ssbo_load_cache;
> >> >> >> > +   bool progress;
> >> >> >> > +};
> >> >> >> > +
> >> >> >> > +ir_visitor_status
> >> >> >> > +ssbo_load_cache_visitor::visit_enter(ir_call *ir)
> >> >> >> > +{
> >> >> >> > +   if (!ir->callee->is_intrinsic)
> >> >> >> > +      return visit_continue_with_parent;
> >> >> >> > +
> >> >> >> > +   if (!strcmp(ir->callee_name(), "__intrinsic_load_ssbo")) {
> >> >> >> > +      exec_node *param = ir->actual_parameters.get_head();
> >> >> >> > +      ir_rvalue *block = ((ir_instruction *)param)->as_rvalue();
> >> >> >> > +
> >> >> >> > +      param = param->get_next();
> >> >> >> > +      ir_rvalue *offset = ((ir_instruction *)param)->as_rvalue();
> >> >> >> > +
> >> >> >> > +      ir_dereference *cached_load = ssbo_load_cache_find(block, offset);
> >> >> >> > +      if (cached_load) {
> >> >> >> > +         ir_variable *var = ir->return_deref->variable_referenced();
> >> >> >> > +         base_ir->insert_before(assign(var, cached_load->clone(mem_ctx, NULL)));
> >> >> >> > +         ir->remove();
> >> >> >> > +         this->progress = true;
> >> >> >> > +      } else {
> >> >> >> > +         ssbo_load_cache_add(ir->return_deref, block, offset,
> >> >> >> > +                             this->current_depth);
> >> >> >> > +      }
> >> >> >> > +   } else if (!strcmp(ir->callee_name(), "__intrinsic_store_ssbo")) {
> >> >> >> > +      exec_node *param = ir->actual_parameters.get_head();
> >> >> >> > +      ir_rvalue *block = ((ir_instruction *)param)->as_rvalue();
> >> >> >> > +
> >> >> >> > +      param = param->get_next();
> >> >> >> > +      ir_rvalue *offset = ((ir_instruction *)param)->as_rvalue();
> >> >> >> > +
> >> >> >> > +      ssbo_load_cache_remove(block, offset);
> >> >> >> > +   } else if (strstr(ir->callee_name(), "__intrinsic_ssbo_atomic") ==
> >> >> >> > +       ir->callee_name()) {
> >> >> >> > +      exec_node *param = ir->actual_parameters.get_head();
> >> >> >> > +      ir_rvalue *block = ((ir_instruction *)param)->as_rvalue();
> >> >> >> > +
> >> >> >> > +      param = param->get_next();
> >> >> >> > +      ir_rvalue *offset = ((ir_instruction *)param)->as_rvalue();
> >> >> >> > +
> >> >> >> > +      ssbo_load_cache_remove(block, offset);
> >> >> >> > +   } else if (!strcmp(ir->callee_name(), "__intrinsic_memory_barrier")) {
> >> >> >> > +      ssbo_load_cache_remove_all();
> >> >> >> > +   }
> >> >> >> > +
> >> >> >> > +   return visit_continue_with_parent;
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +ir_visitor_status
> >> >> >> > +ssbo_load_cache_visitor::visit_enter(ir_if *ir)
> >> >> >> > +{
> >> >> >> > +   ir->condition->accept(this);
> >> >> >> > +
> >> >> >> > +   this->current_depth++;
> >> >> >> > +
> >> >> >> > +   if (!ir->then_instructions.is_empty()) {
> >> >> >> > +      visit_list_elements(this, &ir->then_instructions);
> >> >> >> > +      ssbo_load_cache_remove_depth(this->current_depth);
> >> >> >> > +   }
> >> >> >> > +
> >> >> >> > +   if (!ir->else_instructions.is_empty()) {
> >> >> >> > +      visit_list_elements(this, &ir->else_instructions);
> >> >> >> > +      ssbo_load_cache_remove_depth(this->current_depth);
> >> >> >> > +   }
> >> >> >> > +
> >> >> >> > +   this->current_depth--;
> >> >> >> > +
> >> >> >> > +   return visit_continue_with_parent;
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +ir_visitor_status
> >> >> >> > +ssbo_load_cache_visitor::visit_enter(ir_loop *ir)
> >> >> >> > +{
> >> >> >> > +   this->current_depth++;
> >> >> >> > +   return visit_continue;
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +ir_visitor_status
> >> >> >> > +ssbo_load_cache_visitor::visit_leave(ir_loop *ir)
> >> >> >> > +{
> >> >> >> > +   ssbo_load_cache_remove_depth(this->current_depth);
> >> >> >> > +   this->current_depth--;
> >> >> >> > +   return visit_continue;
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +char *
> >> >> >> > +ssbo_load_cache_visitor::build_ssbo_load_cache_key(unsigned block_index,
> >> >> >> > +                                                       unsigned offset)
> >> >> >> > +{
> >> >> >> > +   return ralloc_asprintf(mem_ctx, "%u-%u", block_index, offset);
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +ir_dereference *
> >> >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_find(ir_rvalue *block_index,
> >> >> >> > +                                              ir_rvalue *offset)
> >> >> >> > +{
> >> >> >> > +   ir_constant *const_block_index = block_index->as_constant();
> >> >> >> > +   if (!const_block_index)
> >> >> >> > +      return NULL;
> >> >> >> > +
> >> >> >> > +   ir_constant *const_offset = offset->as_constant();
> >> >> >> > +   if (!const_offset)
> >> >> >> > +      return NULL;
> >> >> >> > +
> >> >> >> > +   char *cache_key =
> >> >> >> > +      build_ssbo_load_cache_key(const_block_index->value.u[0],
> >> >> >> > +                                const_offset->value.u[0]);
> >> >> >> > +
> >> >> >> > +   struct ssbo_load_cache_entry *entry = (struct ssbo_load_cache_entry *)
> >> >> >> > +      hash_table_find(ssbo_load_cache, cache_key);
> >> >> >> > +   return entry ? entry->load : NULL;
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +void
> >> >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_add(ir_dereference *deref,
> >> >> >> > +                                             ir_rvalue *block_index,
> >> >> >> > +                                             ir_rvalue *offset,
> >> >> >> > +                                             unsigned depth)
> >> >> >> > +{
> >> >> >> > +   /* We only support caching SSBO loads with constant block and offset */
> >> >> >> > +   ir_constant *const_block_index = block_index->as_constant();
> >> >> >> > +   if (!const_block_index)
> >> >> >> > +      return;
> >> >> >> > +
> >> >> >> > +   ir_constant *const_offset = offset->as_constant();
> >> >> >> > +   if (!const_offset)
> >> >> >> > +      return;
> >> >> >> > +
> >> >> >> > +   char *cache_key =
> >> >> >> > +      build_ssbo_load_cache_key(const_block_index->value.u[0],
> >> >> >> > +                                const_offset->value.u[0]);
> >> >> >> > +
> >> >> >> > +   struct ssbo_load_cache_entry *entry = (struct ssbo_load_cache_entry *)
> >> >> >> > +      ralloc_size(this->mem_ctx, sizeof(struct ssbo_load_cache_entry));
> >> >> >> > +   entry->load = deref;
> >> >> >> > +   entry->depth = depth;
> >> >> >> > +
> >> >> >> > +   hash_table_replace(ssbo_load_cache, entry, cache_key);
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +static void
> >> >> >> > +ssbo_load_cache_remove_callback(const void *key, void *data, void *closure)
> >> >> >> > +{
> >> >> >> > +   struct ssbo_load_cache_remove_closure *c =
> >> >> >> > +      (struct ssbo_load_cache_remove_closure *) closure;
> >> >> >> > +
> >> >> >> > +   /* If we have a key_prefix, then we only want to delete entries
> >> >> >> > +    * with that key prefix. Otherwise, if we have depth > 0 we want
> >> >> >> > +    * to delete keys with that depth (or larger). Otheriwse (no prefix
> >> >> >> > +    * and depth == 0) we want to delete all keys.
> >> >> >> > +    */
> >> >> >> > +   assert((c->key_prefix && c->depth == 0) ||
> >> >> >> > +          (!c->key_prefix && c->depth > 0) ||
> >> >> >> > +          (!c->key_prefix && c->depth == 0));
> >> >> >> > +
> >> >> >> > +   if (c->key_prefix && strstr((char *)key, (char *)c->key_prefix) != key)
> >> >> >> > +      return;
> >> >> >> > +
> >> >> >> > +   if (c->depth > 0) {
> >> >> >> > +      struct ssbo_load_cache_entry *entry =
> >> >> >> > +         (struct ssbo_load_cache_entry *) data;
> >> >> >> > +      if (entry->depth < c->depth)
> >> >> >> > +         return;
> >> >> >> > +   }
> >> >> >> > +
> >> >> >> > +   hash_table_remove(c->ht, key);
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +void
> >> >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_remove_by_key_prefix(const char *prefix)
> >> >> >> > +{
> >> >> >> > +   struct ssbo_load_cache_remove_closure c;
> >> >> >> > +   c.ht = ssbo_load_cache;
> >> >> >> > +   c.key_prefix = prefix;
> >> >> >> > +   c.depth = 0;
> >> >> >> > +
> >> >> >> > +   hash_table_call_foreach(ssbo_load_cache,
> >> >> >> > +                           ssbo_load_cache_remove_callback, &c);
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +void
> >> >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_remove_all()
> >> >> >> > +{
> >> >> >> > +   struct ssbo_load_cache_remove_closure c;
> >> >> >> > +   c.ht = ssbo_load_cache;
> >> >> >> > +   c.key_prefix = NULL;
> >> >> >> > +   c.depth = 0;
> >> >> >> > +
> >> >> >> > +   hash_table_call_foreach(ssbo_load_cache,
> >> >> >> > +                           ssbo_load_cache_remove_callback, &c);
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +void
> >> >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_remove_depth(unsigned depth)
> >> >> >> > +{
> >> >> >> > +   struct ssbo_load_cache_remove_closure c;
> >> >> >> > +   c.ht = ssbo_load_cache;
> >> >> >> > +   c.key_prefix = NULL;
> >> >> >> > +   c.depth = depth;
> >> >> >> > +
> >> >> >> > +   hash_table_call_foreach(ssbo_load_cache,
> >> >> >> > +                           ssbo_load_cache_remove_callback, &c);
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +void
> >> >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_remove(ir_rvalue *block_index,
> >> >> >> > +                                                ir_rvalue *offset)
> >> >> >> > +{
> >> >> >> > +   ir_constant *const_block_index = block_index->as_constant();
> >> >> >> > +   if (!const_block_index) {
> >> >> >> > +      /* If we don't know the block index, then invalidate the entire cache.
> >> >> >> > +       * We could try to do better, for example, considering the actual
> >> >> >> > +       * field name we are accessing in the SSBO in the keys so we only
> >> >> >> > +       * invalidate those. This requires some work though.
> >> >> >> > +       */
> >> >> >> > +      ssbo_load_cache_remove_all();
> >> >> >> > +      return;
> >> >> >> > +   }
> >> >> >> > +
> >> >> >> > +   ir_constant *const_offset = offset->as_constant();
> >> >> >> > +   if (!const_offset) {
> >> >> >> > +      /* We know the block but not the offset, so invalidate all entries
> >> >> >> > +       * for the given block
> >> >> >> > +       */
> >> >> >> > +      ssbo_load_cache_remove_by_key_prefix(
> >> >> >> > +         ralloc_asprintf(mem_ctx, "%u-", const_block_index->value.u[0]));
> >> >> >> > +      return;
> >> >> >> > +   }
> >> >> >> > +
> >> >> >> > +   /* We know block and offset, so invalidate that particular load only */
> >> >> >> > +   char *cache_key =
> >> >> >> > +      build_ssbo_load_cache_key(const_block_index->value.u[0],
> >> >> >> > +                                const_offset->value.u[0]);
> >> >> >> > +
> >> >> >> > +   hash_table_remove(ssbo_load_cache, cache_key);
> >> >> >> > +}
> >> >> >> > +
> >> >> >> > +} /* Unnamed namespace */
> >> >> >> > +
> >> >> >> > +bool
> >> >> >> > +opt_ssbo_loads(struct gl_shader *shader, exec_list *instructions)
> >> >> >> > +{
> >> >> >> > +   ssbo_load_cache_visitor v(shader);
> >> >> >> > +   visit_list_elements(&v, instructions);
> >> >> >> > +   return v.get_progress();
> >> >> >> > +}
> >> >> >> > -- 
> >> >> >> > 1.9.1
> >> >> >> >
> >> >> >> > _______________________________________________
> >> >> >> > mesa-dev mailing list
> >> >> >> > mesa-dev at lists.freedesktop.org
> >> >> >> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev




More information about the mesa-dev mailing list