[Mesa-dev] [PATCH 1/2] nir: add pass to move load_const

Thu Jun 7 01:25:59 UTC 2018

On Wed, Jun 6, 2018 at 8:34 PM, Ian Romanick <idr at freedesktop.org> wrote:
> On 06/06/2018 04:47 PM, Rob Clark wrote:
>> On Wed, Jun 6, 2018 at 6:51 PM, Ian Romanick <idr at freedesktop.org> wrote:
>>> On 06/06/2018 07:43 AM, Rob Clark wrote:
>>>> Run this pass late (after opt loop) to move load_const instructions back
>>>> into the basic blocks which use the result, in cases where a load_const
>>>> is only consumed in a single block.
>>>
>>> If the load_const is used in more than one block, you could use it to
>>> the block that dominates all of the block that contain a use.  I suspect
>>> in many cases that will just be the first block, but it's probably worth
>>> trying.
>>
>> hmm, good point.. I was mostly going for the super-obvious wins with
>> now downside cases (ie. low hanging fruit), but I guess the block that
>> dominates all the uses isn't going to be worse than the first block..
>>
>> I can play around w/ the common denominator case.. although in at
>> least some cases I think it might be better to split the load_const.
>> Idk, the low hanging fruit approach cuts register usage by 1/3rd in
>> some of these edge case shaders, which by itself seems like a big
>> enough win, but I guess nir_shader_compiler_options config params is
>> an option for more aggressive approaches that may or may not be a win
>> for different drivers.
>
> True.  We could land this and incrementally improve it later.  If we
> want to do that, I'd suggest
>
>  - Put all of our ideas for improving the pass, including the one
>    already documented in the code, to the file header comment.
>
>  - Remove the suggestions for future work from the "if consumed in more
>    than one block, ignore." comment.

sgtm, I'll respin tomorrow.. maybe after playing around a bit w/ the
common denominator block thing (since I see some cases that seems like
it would benefit)

>
>>>> This helps reduce register usage in cases where the backend driver
>>>> cannot lower the load_const to a uniform.
>>>
>>> The trade off is that now you might not be able to hide the latency of
>>> the load.  I tried this on i965, and the results (below) support this
>>> idea a bit.  If you have a structured backend IR, this seems like a
>>> thing better done there... like, if you can't register allocate without
>>> spilling, try moving the load.  But that's really just a scheduling
>>> heuristic.  Hmm...
>>
>> iirc, i965 had some clever load_const lower passes so all the threads
>> in a warp could share the same load_const (iirc?  I might not have
>> that description quite right)..  so I wasn't expecting this to be
>> useful to i965 (but if it is, then, bonus!).  For me I can *almost*
>> always lower load_const to uniform, so in the common cases it hasn't
>> been a problem.  The exceptions are cases where an alu instruction
>> consume >1 const (but those get optimized away) or a const plus
>> uniform, or things like ssbo/image store.  (I noticed the register
>> usage blowing up thing looking at some deqp tests like
>> dEQP-GLES31.functional.ssbo.layout.2_level_array.shared.vec4 ;-))
>
> There are some optimizations for loading constants in the vec4 backend.
> There is some code to lower arrays of constants to uniforms.  Uniforms
> are already shared within the SIMD group.  I don't know of anything
> other than that, but that's doesn't mean such a thing doesn't exist.
>
> Also... at what point do you call this pass?  Here's what I did for
> i965... basically called it as late as I thought was possible:
>
> diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
> index dfeea73b06a..2209d6142b3 100644
> --- a/src/intel/compiler/brw_nir.c
> +++ b/src/intel/compiler/brw_nir.c
> @@ -795,6 +795,8 @@ brw_postprocess_nir(nir_shader *nir, const struct
> brw_compiler *compiler,
>     if (devinfo->gen <= 5)
>        brw_nir_analyze_boolean_resolves(nir);
>
> +   nir_move_load_const(nir);
> +

fwiw, this is basically we I called it for ir3, it is intended to be
after opt loop (or really anything that might move load_const back to
the top of the first block)..

but tbh I didn't play with it that much yet, other than seeing that it
was a big with for the pedantic cases (like the deqp test I
mentioned).  I think about the only thing I do after the
nir_move_load_const() pass is nir_lower_locals_to_regs +
nir_convert_from_ssa(phi_webs_only=true)

>     nir_sweep(nir);
>
>     if (unlikely(debug_enabled)) {
>
> Now that the full shader-db run is done, it seems to have some odd
> effects on vec4 shaders, so I'll probably tweak this a bit.
>
>> If latency of the load is an issue, perhaps a configuration option
>> about the minimum size of destination use_block, with the rough theory
>> that if the block is bigger than some threshold the instruction
>> scheduling could probably hide the latency?
>
> I thought about this a bit more... I think the driver's scheduler should
> fix this case.  If it can't hide the latency of a load immediate, the
> scheduler should move it to the earlier block.  That might even help
> hide the latency of a compare.  Something like
>
> cmp.g.f0(16)    null            g2<8,8,1>F      -g9.4<0,1,0>F
> (+f0) if(16)    JIP: 40         UIP: 40
>    END B0 ->B1 ->B2
>    START B1 <-B0 (40 cycles)
> mov(16)         g77<1>F         1F
>
> should become
>
> cmp.g.f0(16)    null            g2<8,8,1>F      -g9.4<0,1,0>F
> mov(16)         g77<1>F         1F
> (+f0) if(16)    JIP: 40         UIP: 40
>    END B0 ->B1 ->B2
>    START B1 <-B0 (40 cycles)

I guess I don't know enough about your shader isa to know what that means..

But also I guess there are some drivers that have less sophisticated
post-nir backends where a rough heuristic might be useful?  For me,
ir3 does scheduling on a per block basis and isn't clever enough to
move things between blocks during scheduling (which at least for ir3
is complicated since scheduling is required for correctness so moving
an instruction requires re-scheduling both blocks).  Probably i965 and
codegen could make better decisions post-nir so they might not care as
much.

BR,
-R

>
>> BR,
>> -R
>>
>>
>>> total instructions in shared programs: 14398410 -> 14397918 (<.01%)
>>> instructions in affected programs: 134856 -> 134364 (-0.36%)
>>> helped: 53
>>> HURT: 10
>>> helped stats (abs) min: 1 max: 30 x̄: 9.47 x̃: 7
>>> helped stats (rel) min: 0.16% max: 2.74% x̄: 0.64% x̃: 0.40%
>>> HURT stats (abs)   min: 1 max: 1 x̄: 1.00 x̃: 1
>>> HURT stats (rel)   min: 0.03% max: 0.21% x̄: 0.05% x̃: 0.03%
>>> 95% mean confidence interval for instructions value: -9.61 -6.01
>>> 95% mean confidence interval for instructions %-change: -0.68% -0.38%
>>> Instructions are helped.
>>>
>>> total cycles in shared programs: 532949823 -> 532851352 (-0.02%)
>>> cycles in affected programs: 260693023 -> 260594552 (-0.04%)
>>> helped: 122
>>> HURT: 88
>>> helped stats (abs) min: 1 max: 24933 x̄: 1080.30 x̃: 20
>>> helped stats (rel) min: <.01% max: 34.94% x̄: 2.20% x̃: 0.32%
>>> HURT stats (abs)   min: 1 max: 1884 x̄: 378.69 x̃: 29
>>> HURT stats (rel)   min: <.01% max: 10.23% x̄: 0.64% x̃: 0.09%
>>> 95% mean confidence interval for cycles value: -943.71 5.89
>>> 95% mean confidence interval for cycles %-change: -1.71% -0.32%
>>> Inconclusive result (value mean confidence interval includes 0).
>>>
>>> total spills in shared programs: 8044 -> 7975 (-0.86%)
>>> spills in affected programs: 2391 -> 2322 (-2.89%)
>>> helped: 42
>>> HURT: 0
>>>
>>> total fills in shared programs: 10950 -> 10884 (-0.60%)
>>> fills in affected programs: 2999 -> 2933 (-2.20%)
>>> helped: 42
>>> HURT: 0
>>>
>>> LOST:   4
>>> GAINED: 7
>>>
>>>> Signed-off-by: Rob Clark <robdclark at gmail.com>
>>>> ---
>>>>  src/compiler/Makefile.sources          |   1 +
>>>>  src/compiler/nir/meson.build           |   1 +
>>>>  src/compiler/nir/nir.h                 |   1 +
>>>>  src/compiler/nir/nir_move_load_const.c | 131 +++++++++++++++++++++++++
>>>>  4 files changed, 134 insertions(+)
>>>>  create mode 100644 src/compiler/nir/nir_move_load_const.c
>>>>
>>>> diff --git a/src/compiler/Makefile.sources b/src/compiler/Makefile.sources
>>>> index 3daa2c51334..cb8a2875a84 100644
>>>> --- a/src/compiler/Makefile.sources
>>>> +++ b/src/compiler/Makefile.sources
>>>> @@ -253,6 +253,7 @@ NIR_FILES = \
>>>>       nir/nir_lower_wpos_center.c \
>>>>       nir/nir_lower_wpos_ytransform.c \
>>>>       nir/nir_metadata.c \
>>>> +     nir/nir_move_load_const.c \
>>>>       nir/nir_move_vec_src_uses_to_dest.c \
>>>>       nir/nir_normalize_cubemap_coords.c \
>>>>       nir/nir_opt_conditional_discard.c \
>>>> diff --git a/src/compiler/nir/meson.build b/src/compiler/nir/meson.build
>>>> index 3fec363691d..373a47fd155 100644
>>>> --- a/src/compiler/nir/meson.build
>>>> +++ b/src/compiler/nir/meson.build
>>>> @@ -144,6 +144,7 @@ files_libnir = files(
>>>>    'nir_lower_wpos_ytransform.c',
>>>>    'nir_lower_bit_size.c',
>>>>    'nir_metadata.c',
>>>> +  'nir_move_load_const.c',
>>>>    'nir_move_vec_src_uses_to_dest.c',
>>>>    'nir_normalize_cubemap_coords.c',
>>>>    'nir_opt_conditional_discard.c',
>>>> diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h
>>>> index 5a1f79515ad..073ab4e82ea 100644
>>>> --- a/src/compiler/nir/nir.h
>>>> +++ b/src/compiler/nir/nir.h
>>>> @@ -2612,6 +2612,7 @@ bool nir_remove_dead_variables(nir_shader *shader, nir_variable_mode modes);
>>>>  bool nir_lower_constant_initializers(nir_shader *shader,
>>>>                                       nir_variable_mode modes);
>>>>
>>>> +bool nir_move_load_const(nir_shader *shader);
>>>>  bool nir_move_vec_src_uses_to_dest(nir_shader *shader);
>>>>  bool nir_lower_vec_to_movs(nir_shader *shader);
>>>>  void nir_lower_alpha_test(nir_shader *shader, enum compare_func func,
>>>> diff --git a/src/compiler/nir/nir_move_load_const.c b/src/compiler/nir/nir_move_load_const.c
>>>> new file mode 100644
>>>> index 00000000000..c0750035fbb
>>>> --- /dev/null
>>>> +++ b/src/compiler/nir/nir_move_load_const.c
>>>> @@ -0,0 +1,131 @@
>>>> +/*
>>>> + * Copyright © 2018 Red Hat
>>>> + *
>>>> + * Permission is hereby granted, free of charge, to any person obtaining a
>>>> + * copy of this software and associated documentation files (the "Software"),
>>>> + * to deal in the Software without restriction, including without limitation
>>>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>>>> + * and/or sell copies of the Software, and to permit persons to whom the
>>>> + * Software is furnished to do so, subject to the following conditions:
>>>> + *
>>>> + * The above copyright notice and this permission notice (including the next
>>>> + * paragraph) shall be included in all copies or substantial portions of the
>>>> + * Software.
>>>> + *
>>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>>>> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
>>>> + * IN THE SOFTWARE.
>>>> + *
>>>> + * Authors:
>>>> + *    Rob Clark (robdclark at gmail.com>
>>>> + *
>>>> + */
>>>> +
>>>> +#include "nir.h"
>>>> +
>>>> +
>>>> +/*
>>>> + * A simple pass that moves load_const's into consuming block if
>>>> + * they are only consumed in a single block, to try to counter-
>>>> + * act nir's tendency to move all load_const to the top of the
>>>> + * first block.
>>>> + */
>>>> +
>>>> +/* iterate a ssa def's use's and if all uses are within a single
>>>> + * block, return it, otherwise return null.
>>>> + */
>>>> +static nir_block *
>>>> +get_single_block(nir_ssa_def *def)
>>>> +{
>>>> +   nir_block *b = NULL;
>>>> +
>>>> +   /* hmm, probably ignore if-uses: */
>>>> +   if (!list_empty(&def->if_uses))
>>>> +      return NULL;
>>>> +
>>>> +   nir_foreach_use(use, def) {
>>>> +      nir_instr *instr = use->parent_instr;
>>>> +
>>>> +      /*
>>>> +       * Kind of an ugly special-case, but phi instructions
>>>> +       * need to appear first in the block, so by definition
>>>> +       * we can't move a load_immed into a block where it is
>>>> +       * consumed by a phi instruction.  We could conceivably
>>>> +       * move it into a dominator block.
>>>> +       */
>>>> +      if (instr->type == nir_instr_type_phi)
>>>> +         return NULL;
>>>> +
>>>> +      nir_block *use_block = instr->block;
>>>> +
>>>> +      if (!b) {
>>>> +         b = use_block;
>>>> +      } else if (b != use_block) {
>>>> +         return NULL;
>>>> +      }
>>>> +   }
>>>> +
>>>> +   return b;
>>>> +}
>>>> +
>>>> +bool
>>>> +nir_move_load_const(nir_shader *shader)
>>>> +{
>>>> +   bool progress = false;
>>>> +
>>>> +   nir_foreach_function(function, shader) {
>>>> +      if (!function->impl)
>>>> +         continue;
>>>> +
>>>> +      nir_foreach_block(block, function->impl) {
>>>> +         nir_foreach_instr_safe(instr, block) {
>>>> +            if (instr->type != nir_instr_type_load_const)
>>>> +               continue;
>>>> +
>>>> +            nir_load_const_instr *load =
>>>> +                  nir_instr_as_load_const(instr);
>>>> +            nir_block *use_block =
>>>> +                  get_single_block(&load->def);
>>>> +
>>>> +            /* if consumed in more than one block, ignore.  Possibly
>>>> +             * there should be a threshold, ie. if used in a small #
>>>> +             * of blocks in a large shader, it could be better to
>>>> +             * split the load_const.  For now ignore that and focus
>>>> +             * on the clear wins:
>>>> +             */
>>>> +            if (!use_block)
>>>> +               continue;
>>>> +
>>>> +            if (use_block == load->instr.block)
>>>> +               continue;
>>>> +
>>>> +            exec_node_remove(&load->instr.node);
>>>> +
>>>> +            /* insert before first non-phi instruction: */
>>>> +            nir_foreach_instr(instr2, use_block) {
>>>> +               if (instr2->type == nir_instr_type_phi)
>>>> +                  continue;
>>>> +
>>>> +               exec_node_insert_node_before(&instr2->node,
>>>> +                                            &load->instr.node);
>>>> +
>>>> +               break;
>>>> +            }
>>>> +
>>>> +            load->instr.block = use_block;
>>>> +
>>>> +            progress = true;
>>>> +         }
>>>> +      }
>>>> +
>>>> +      nir_metadata_preserve(function->impl,
>>>> +                            nir_metadata_block_index | nir_metadata_dominance);
>>>> +   }
>>>> +
>>>> +
>>>> +   return progress;
>>>> +}