[Mesa-dev] [PATCH 1/2] nir: add pass to move load_const

Thu Jun 7 00:34:15 UTC 2018

On 06/06/2018 04:47 PM, Rob Clark wrote:
> On Wed, Jun 6, 2018 at 6:51 PM, Ian Romanick <idr at freedesktop.org> wrote:
>> On 06/06/2018 07:43 AM, Rob Clark wrote:
>>> Run this pass late (after opt loop) to move load_const instructions back
>>> into the basic blocks which use the result, in cases where a load_const
>>> is only consumed in a single block.
>>
>> If the load_const is used in more than one block, you could use it to
>> the block that dominates all of the block that contain a use.  I suspect
>> in many cases that will just be the first block, but it's probably worth
>> trying.
> 
> hmm, good point.. I was mostly going for the super-obvious wins with
> now downside cases (ie. low hanging fruit), but I guess the block that
> dominates all the uses isn't going to be worse than the first block..
> 
> I can play around w/ the common denominator case.. although in at
> least some cases I think it might be better to split the load_const.
> Idk, the low hanging fruit approach cuts register usage by 1/3rd in
> some of these edge case shaders, which by itself seems like a big
> enough win, but I guess nir_shader_compiler_options config params is
> an option for more aggressive approaches that may or may not be a win
> for different drivers.

True.  We could land this and incrementally improve it later.  If we
want to do that, I'd suggest

 - Put all of our ideas for improving the pass, including the one
   already documented in the code, to the file header comment.

 - Remove the suggestions for future work from the "if consumed in more
   than one block, ignore." comment.

>>> This helps reduce register usage in cases where the backend driver
>>> cannot lower the load_const to a uniform.
>>
>> The trade off is that now you might not be able to hide the latency of
>> the load.  I tried this on i965, and the results (below) support this
>> idea a bit.  If you have a structured backend IR, this seems like a
>> thing better done there... like, if you can't register allocate without
>> spilling, try moving the load.  But that's really just a scheduling
>> heuristic.  Hmm...
> 
> iirc, i965 had some clever load_const lower passes so all the threads
> in a warp could share the same load_const (iirc?  I might not have
> that description quite right)..  so I wasn't expecting this to be
> useful to i965 (but if it is, then, bonus!).  For me I can *almost*
> always lower load_const to uniform, so in the common cases it hasn't
> been a problem.  The exceptions are cases where an alu instruction
> consume >1 const (but those get optimized away) or a const plus
> uniform, or things like ssbo/image store.  (I noticed the register
> usage blowing up thing looking at some deqp tests like
> dEQP-GLES31.functional.ssbo.layout.2_level_array.shared.vec4 ;-))

There are some optimizations for loading constants in the vec4 backend.
There is some code to lower arrays of constants to uniforms.  Uniforms
are already shared within the SIMD group.  I don't know of anything
other than that, but that's doesn't mean such a thing doesn't exist.

Also... at what point do you call this pass?  Here's what I did for
i965... basically called it as late as I thought was possible:

diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
index dfeea73b06a..2209d6142b3 100644
--- a/src/intel/compiler/brw_nir.c
+++ b/src/intel/compiler/brw_nir.c
@@ -795,6 +795,8 @@ brw_postprocess_nir(nir_shader *nir, const struct
brw_compiler *compiler,
    if (devinfo->gen <= 5)
       brw_nir_analyze_boolean_resolves(nir);

+   nir_move_load_const(nir);
+
    nir_sweep(nir);

    if (unlikely(debug_enabled)) {

Now that the full shader-db run is done, it seems to have some odd
effects on vec4 shaders, so I'll probably tweak this a bit.

> If latency of the load is an issue, perhaps a configuration option
> about the minimum size of destination use_block, with the rough theory
> that if the block is bigger than some threshold the instruction
> scheduling could probably hide the latency?

I thought about this a bit more... I think the driver's scheduler should
fix this case.  If it can't hide the latency of a load immediate, the
scheduler should move it to the earlier block.  That might even help
hide the latency of a compare.  Something like

cmp.g.f0(16)    null            g2<8,8,1>F      -g9.4<0,1,0>F
(+f0) if(16)    JIP: 40         UIP: 40
   END B0 ->B1 ->B2
   START B1 <-B0 (40 cycles)
mov(16)         g77<1>F         1F

should become

cmp.g.f0(16)    null            g2<8,8,1>F      -g9.4<0,1,0>F
mov(16)         g77<1>F         1F
(+f0) if(16)    JIP: 40         UIP: 40
   END B0 ->B1 ->B2
   START B1 <-B0 (40 cycles)

> BR,
> -R
> 
> 
>> total instructions in shared programs: 14398410 -> 14397918 (<.01%)
>> instructions in affected programs: 134856 -> 134364 (-0.36%)
>> helped: 53
>> HURT: 10
>> helped stats (abs) min: 1 max: 30 x̄: 9.47 x̃: 7
>> helped stats (rel) min: 0.16% max: 2.74% x̄: 0.64% x̃: 0.40%
>> HURT stats (abs)   min: 1 max: 1 x̄: 1.00 x̃: 1
>> HURT stats (rel)   min: 0.03% max: 0.21% x̄: 0.05% x̃: 0.03%
>> 95% mean confidence interval for instructions value: -9.61 -6.01
>> 95% mean confidence interval for instructions %-change: -0.68% -0.38%
>> Instructions are helped.
>>
>> total cycles in shared programs: 532949823 -> 532851352 (-0.02%)
>> cycles in affected programs: 260693023 -> 260594552 (-0.04%)
>> helped: 122
>> HURT: 88
>> helped stats (abs) min: 1 max: 24933 x̄: 1080.30 x̃: 20
>> helped stats (rel) min: <.01% max: 34.94% x̄: 2.20% x̃: 0.32%
>> HURT stats (abs)   min: 1 max: 1884 x̄: 378.69 x̃: 29
>> HURT stats (rel)   min: <.01% max: 10.23% x̄: 0.64% x̃: 0.09%
>> 95% mean confidence interval for cycles value: -943.71 5.89
>> 95% mean confidence interval for cycles %-change: -1.71% -0.32%
>> Inconclusive result (value mean confidence interval includes 0).
>>
>> total spills in shared programs: 8044 -> 7975 (-0.86%)
>> spills in affected programs: 2391 -> 2322 (-2.89%)
>> helped: 42
>> HURT: 0
>>
>> total fills in shared programs: 10950 -> 10884 (-0.60%)
>> fills in affected programs: 2999 -> 2933 (-2.20%)
>> helped: 42
>> HURT: 0
>>
>> LOST:   4
>> GAINED: 7
>>
>>> Signed-off-by: Rob Clark <robdclark at gmail.com>
>>> ---
>>>  src/compiler/Makefile.sources          |   1 +
>>>  src/compiler/nir/meson.build           |   1 +
>>>  src/compiler/nir/nir.h                 |   1 +
>>>  src/compiler/nir/nir_move_load_const.c | 131 +++++++++++++++++++++++++
>>>  4 files changed, 134 insertions(+)
>>>  create mode 100644 src/compiler/nir/nir_move_load_const.c
>>>
>>> diff --git a/src/compiler/Makefile.sources b/src/compiler/Makefile.sources
>>> index 3daa2c51334..cb8a2875a84 100644
>>> --- a/src/compiler/Makefile.sources
>>> +++ b/src/compiler/Makefile.sources
>>> @@ -253,6 +253,7 @@ NIR_FILES = \
>>>       nir/nir_lower_wpos_center.c \
>>>       nir/nir_lower_wpos_ytransform.c \
>>>       nir/nir_metadata.c \
>>> +     nir/nir_move_load_const.c \
>>>       nir/nir_move_vec_src_uses_to_dest.c \
>>>       nir/nir_normalize_cubemap_coords.c \
>>>       nir/nir_opt_conditional_discard.c \
>>> diff --git a/src/compiler/nir/meson.build b/src/compiler/nir/meson.build
>>> index 3fec363691d..373a47fd155 100644
>>> --- a/src/compiler/nir/meson.build
>>> +++ b/src/compiler/nir/meson.build
>>> @@ -144,6 +144,7 @@ files_libnir = files(
>>>    'nir_lower_wpos_ytransform.c',
>>>    'nir_lower_bit_size.c',
>>>    'nir_metadata.c',
>>> +  'nir_move_load_const.c',
>>>    'nir_move_vec_src_uses_to_dest.c',
>>>    'nir_normalize_cubemap_coords.c',
>>>    'nir_opt_conditional_discard.c',
>>> diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h
>>> index 5a1f79515ad..073ab4e82ea 100644
>>> --- a/src/compiler/nir/nir.h
>>> +++ b/src/compiler/nir/nir.h
>>> @@ -2612,6 +2612,7 @@ bool nir_remove_dead_variables(nir_shader *shader, nir_variable_mode modes);
>>>  bool nir_lower_constant_initializers(nir_shader *shader,
>>>                                       nir_variable_mode modes);
>>>
>>> +bool nir_move_load_const(nir_shader *shader);
>>>  bool nir_move_vec_src_uses_to_dest(nir_shader *shader);
>>>  bool nir_lower_vec_to_movs(nir_shader *shader);
>>>  void nir_lower_alpha_test(nir_shader *shader, enum compare_func func,
>>> diff --git a/src/compiler/nir/nir_move_load_const.c b/src/compiler/nir/nir_move_load_const.c
>>> new file mode 100644
>>> index 00000000000..c0750035fbb
>>> --- /dev/null
>>> +++ b/src/compiler/nir/nir_move_load_const.c
>>> @@ -0,0 +1,131 @@
>>> +/*
>>> + * Copyright © 2018 Red Hat
>>> + *
>>> + * Permission is hereby granted, free of charge, to any person obtaining a
>>> + * copy of this software and associated documentation files (the "Software"),
>>> + * to deal in the Software without restriction, including without limitation
>>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>>> + * and/or sell copies of the Software, and to permit persons to whom the
>>> + * Software is furnished to do so, subject to the following conditions:
>>> + *
>>> + * The above copyright notice and this permission notice (including the next
>>> + * paragraph) shall be included in all copies or substantial portions of the
>>> + * Software.
>>> + *
>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>>> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
>>> + * IN THE SOFTWARE.
>>> + *
>>> + * Authors:
>>> + *    Rob Clark (robdclark at gmail.com>
>>> + *
>>> + */
>>> +
>>> +#include "nir.h"
>>> +
>>> +
>>> +/*
>>> + * A simple pass that moves load_const's into consuming block if
>>> + * they are only consumed in a single block, to try to counter-
>>> + * act nir's tendency to move all load_const to the top of the
>>> + * first block.
>>> + */
>>> +
>>> +/* iterate a ssa def's use's and if all uses are within a single
>>> + * block, return it, otherwise return null.
>>> + */
>>> +static nir_block *
>>> +get_single_block(nir_ssa_def *def)
>>> +{
>>> +   nir_block *b = NULL;
>>> +
>>> +   /* hmm, probably ignore if-uses: */
>>> +   if (!list_empty(&def->if_uses))
>>> +      return NULL;
>>> +
>>> +   nir_foreach_use(use, def) {
>>> +      nir_instr *instr = use->parent_instr;
>>> +
>>> +      /*
>>> +       * Kind of an ugly special-case, but phi instructions
>>> +       * need to appear first in the block, so by definition
>>> +       * we can't move a load_immed into a block where it is
>>> +       * consumed by a phi instruction.  We could conceivably
>>> +       * move it into a dominator block.
>>> +       */
>>> +      if (instr->type == nir_instr_type_phi)
>>> +         return NULL;
>>> +
>>> +      nir_block *use_block = instr->block;
>>> +
>>> +      if (!b) {
>>> +         b = use_block;
>>> +      } else if (b != use_block) {
>>> +         return NULL;
>>> +      }
>>> +   }
>>> +
>>> +   return b;
>>> +}
>>> +
>>> +bool
>>> +nir_move_load_const(nir_shader *shader)
>>> +{
>>> +   bool progress = false;
>>> +
>>> +   nir_foreach_function(function, shader) {
>>> +      if (!function->impl)
>>> +         continue;
>>> +
>>> +      nir_foreach_block(block, function->impl) {
>>> +         nir_foreach_instr_safe(instr, block) {
>>> +            if (instr->type != nir_instr_type_load_const)
>>> +               continue;
>>> +
>>> +            nir_load_const_instr *load =
>>> +                  nir_instr_as_load_const(instr);
>>> +            nir_block *use_block =
>>> +                  get_single_block(&load->def);
>>> +
>>> +            /* if consumed in more than one block, ignore.  Possibly
>>> +             * there should be a threshold, ie. if used in a small #
>>> +             * of blocks in a large shader, it could be better to
>>> +             * split the load_const.  For now ignore that and focus
>>> +             * on the clear wins:
>>> +             */
>>> +            if (!use_block)
>>> +               continue;
>>> +
>>> +            if (use_block == load->instr.block)
>>> +               continue;
>>> +
>>> +            exec_node_remove(&load->instr.node);
>>> +
>>> +            /* insert before first non-phi instruction: */
>>> +            nir_foreach_instr(instr2, use_block) {
>>> +               if (instr2->type == nir_instr_type_phi)
>>> +                  continue;
>>> +
>>> +               exec_node_insert_node_before(&instr2->node,
>>> +                                            &load->instr.node);
>>> +
>>> +               break;
>>> +            }
>>> +
>>> +            load->instr.block = use_block;
>>> +
>>> +            progress = true;
>>> +         }
>>> +      }
>>> +
>>> +      nir_metadata_preserve(function->impl,
>>> +                            nir_metadata_block_index | nir_metadata_dominance);
>>> +   }
>>> +
>>> +
>>> +   return progress;
>>> +}