[Mesa-dev] [PATCH 1/2] nir: add pass to move load_const
Ian Romanick
idr at freedesktop.org
Thu Jun 7 00:34:15 UTC 2018
On 06/06/2018 04:47 PM, Rob Clark wrote:
> On Wed, Jun 6, 2018 at 6:51 PM, Ian Romanick <idr at freedesktop.org> wrote:
>> On 06/06/2018 07:43 AM, Rob Clark wrote:
>>> Run this pass late (after opt loop) to move load_const instructions back
>>> into the basic blocks which use the result, in cases where a load_const
>>> is only consumed in a single block.
>>
>> If the load_const is used in more than one block, you could use it to
>> the block that dominates all of the block that contain a use. I suspect
>> in many cases that will just be the first block, but it's probably worth
>> trying.
>
> hmm, good point.. I was mostly going for the super-obvious wins with
> now downside cases (ie. low hanging fruit), but I guess the block that
> dominates all the uses isn't going to be worse than the first block..
>
> I can play around w/ the common denominator case.. although in at
> least some cases I think it might be better to split the load_const.
> Idk, the low hanging fruit approach cuts register usage by 1/3rd in
> some of these edge case shaders, which by itself seems like a big
> enough win, but I guess nir_shader_compiler_options config params is
> an option for more aggressive approaches that may or may not be a win
> for different drivers.
True. We could land this and incrementally improve it later. If we
want to do that, I'd suggest
- Put all of our ideas for improving the pass, including the one
already documented in the code, to the file header comment.
- Remove the suggestions for future work from the "if consumed in more
than one block, ignore." comment.
>>> This helps reduce register usage in cases where the backend driver
>>> cannot lower the load_const to a uniform.
>>
>> The trade off is that now you might not be able to hide the latency of
>> the load. I tried this on i965, and the results (below) support this
>> idea a bit. If you have a structured backend IR, this seems like a
>> thing better done there... like, if you can't register allocate without
>> spilling, try moving the load. But that's really just a scheduling
>> heuristic. Hmm...
>
> iirc, i965 had some clever load_const lower passes so all the threads
> in a warp could share the same load_const (iirc? I might not have
> that description quite right).. so I wasn't expecting this to be
> useful to i965 (but if it is, then, bonus!). For me I can *almost*
> always lower load_const to uniform, so in the common cases it hasn't
> been a problem. The exceptions are cases where an alu instruction
> consume >1 const (but those get optimized away) or a const plus
> uniform, or things like ssbo/image store. (I noticed the register
> usage blowing up thing looking at some deqp tests like
> dEQP-GLES31.functional.ssbo.layout.2_level_array.shared.vec4 ;-))
There are some optimizations for loading constants in the vec4 backend.
There is some code to lower arrays of constants to uniforms. Uniforms
are already shared within the SIMD group. I don't know of anything
other than that, but that's doesn't mean such a thing doesn't exist.
Also... at what point do you call this pass? Here's what I did for
i965... basically called it as late as I thought was possible:
diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
index dfeea73b06a..2209d6142b3 100644
--- a/src/intel/compiler/brw_nir.c
+++ b/src/intel/compiler/brw_nir.c
@@ -795,6 +795,8 @@ brw_postprocess_nir(nir_shader *nir, const struct
brw_compiler *compiler,
if (devinfo->gen <= 5)
brw_nir_analyze_boolean_resolves(nir);
+ nir_move_load_const(nir);
+
nir_sweep(nir);
if (unlikely(debug_enabled)) {
Now that the full shader-db run is done, it seems to have some odd
effects on vec4 shaders, so I'll probably tweak this a bit.
> If latency of the load is an issue, perhaps a configuration option
> about the minimum size of destination use_block, with the rough theory
> that if the block is bigger than some threshold the instruction
> scheduling could probably hide the latency?
I thought about this a bit more... I think the driver's scheduler should
fix this case. If it can't hide the latency of a load immediate, the
scheduler should move it to the earlier block. That might even help
hide the latency of a compare. Something like
cmp.g.f0(16) null g2<8,8,1>F -g9.4<0,1,0>F
(+f0) if(16) JIP: 40 UIP: 40
END B0 ->B1 ->B2
START B1 <-B0 (40 cycles)
mov(16) g77<1>F 1F
should become
cmp.g.f0(16) null g2<8,8,1>F -g9.4<0,1,0>F
mov(16) g77<1>F 1F
(+f0) if(16) JIP: 40 UIP: 40
END B0 ->B1 ->B2
START B1 <-B0 (40 cycles)
> BR,
> -R
>
>
>> total instructions in shared programs: 14398410 -> 14397918 (<.01%)
>> instructions in affected programs: 134856 -> 134364 (-0.36%)
>> helped: 53
>> HURT: 10
>> helped stats (abs) min: 1 max: 30 x̄: 9.47 x̃: 7
>> helped stats (rel) min: 0.16% max: 2.74% x̄: 0.64% x̃: 0.40%
>> HURT stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
>> HURT stats (rel) min: 0.03% max: 0.21% x̄: 0.05% x̃: 0.03%
>> 95% mean confidence interval for instructions value: -9.61 -6.01
>> 95% mean confidence interval for instructions %-change: -0.68% -0.38%
>> Instructions are helped.
>>
>> total cycles in shared programs: 532949823 -> 532851352 (-0.02%)
>> cycles in affected programs: 260693023 -> 260594552 (-0.04%)
>> helped: 122
>> HURT: 88
>> helped stats (abs) min: 1 max: 24933 x̄: 1080.30 x̃: 20
>> helped stats (rel) min: <.01% max: 34.94% x̄: 2.20% x̃: 0.32%
>> HURT stats (abs) min: 1 max: 1884 x̄: 378.69 x̃: 29
>> HURT stats (rel) min: <.01% max: 10.23% x̄: 0.64% x̃: 0.09%
>> 95% mean confidence interval for cycles value: -943.71 5.89
>> 95% mean confidence interval for cycles %-change: -1.71% -0.32%
>> Inconclusive result (value mean confidence interval includes 0).
>>
>> total spills in shared programs: 8044 -> 7975 (-0.86%)
>> spills in affected programs: 2391 -> 2322 (-2.89%)
>> helped: 42
>> HURT: 0
>>
>> total fills in shared programs: 10950 -> 10884 (-0.60%)
>> fills in affected programs: 2999 -> 2933 (-2.20%)
>> helped: 42
>> HURT: 0
>>
>> LOST: 4
>> GAINED: 7
>>
>>> Signed-off-by: Rob Clark <robdclark at gmail.com>
>>> ---
>>> src/compiler/Makefile.sources | 1 +
>>> src/compiler/nir/meson.build | 1 +
>>> src/compiler/nir/nir.h | 1 +
>>> src/compiler/nir/nir_move_load_const.c | 131 +++++++++++++++++++++++++
>>> 4 files changed, 134 insertions(+)
>>> create mode 100644 src/compiler/nir/nir_move_load_const.c
>>>
>>> diff --git a/src/compiler/Makefile.sources b/src/compiler/Makefile.sources
>>> index 3daa2c51334..cb8a2875a84 100644
>>> --- a/src/compiler/Makefile.sources
>>> +++ b/src/compiler/Makefile.sources
>>> @@ -253,6 +253,7 @@ NIR_FILES = \
>>> nir/nir_lower_wpos_center.c \
>>> nir/nir_lower_wpos_ytransform.c \
>>> nir/nir_metadata.c \
>>> + nir/nir_move_load_const.c \
>>> nir/nir_move_vec_src_uses_to_dest.c \
>>> nir/nir_normalize_cubemap_coords.c \
>>> nir/nir_opt_conditional_discard.c \
>>> diff --git a/src/compiler/nir/meson.build b/src/compiler/nir/meson.build
>>> index 3fec363691d..373a47fd155 100644
>>> --- a/src/compiler/nir/meson.build
>>> +++ b/src/compiler/nir/meson.build
>>> @@ -144,6 +144,7 @@ files_libnir = files(
>>> 'nir_lower_wpos_ytransform.c',
>>> 'nir_lower_bit_size.c',
>>> 'nir_metadata.c',
>>> + 'nir_move_load_const.c',
>>> 'nir_move_vec_src_uses_to_dest.c',
>>> 'nir_normalize_cubemap_coords.c',
>>> 'nir_opt_conditional_discard.c',
>>> diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h
>>> index 5a1f79515ad..073ab4e82ea 100644
>>> --- a/src/compiler/nir/nir.h
>>> +++ b/src/compiler/nir/nir.h
>>> @@ -2612,6 +2612,7 @@ bool nir_remove_dead_variables(nir_shader *shader, nir_variable_mode modes);
>>> bool nir_lower_constant_initializers(nir_shader *shader,
>>> nir_variable_mode modes);
>>>
>>> +bool nir_move_load_const(nir_shader *shader);
>>> bool nir_move_vec_src_uses_to_dest(nir_shader *shader);
>>> bool nir_lower_vec_to_movs(nir_shader *shader);
>>> void nir_lower_alpha_test(nir_shader *shader, enum compare_func func,
>>> diff --git a/src/compiler/nir/nir_move_load_const.c b/src/compiler/nir/nir_move_load_const.c
>>> new file mode 100644
>>> index 00000000000..c0750035fbb
>>> --- /dev/null
>>> +++ b/src/compiler/nir/nir_move_load_const.c
>>> @@ -0,0 +1,131 @@
>>> +/*
>>> + * Copyright © 2018 Red Hat
>>> + *
>>> + * Permission is hereby granted, free of charge, to any person obtaining a
>>> + * copy of this software and associated documentation files (the "Software"),
>>> + * to deal in the Software without restriction, including without limitation
>>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>>> + * and/or sell copies of the Software, and to permit persons to whom the
>>> + * Software is furnished to do so, subject to the following conditions:
>>> + *
>>> + * The above copyright notice and this permission notice (including the next
>>> + * paragraph) shall be included in all copies or substantial portions of the
>>> + * Software.
>>> + *
>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>>> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
>>> + * IN THE SOFTWARE.
>>> + *
>>> + * Authors:
>>> + * Rob Clark (robdclark at gmail.com>
>>> + *
>>> + */
>>> +
>>> +#include "nir.h"
>>> +
>>> +
>>> +/*
>>> + * A simple pass that moves load_const's into consuming block if
>>> + * they are only consumed in a single block, to try to counter-
>>> + * act nir's tendency to move all load_const to the top of the
>>> + * first block.
>>> + */
>>> +
>>> +/* iterate a ssa def's use's and if all uses are within a single
>>> + * block, return it, otherwise return null.
>>> + */
>>> +static nir_block *
>>> +get_single_block(nir_ssa_def *def)
>>> +{
>>> + nir_block *b = NULL;
>>> +
>>> + /* hmm, probably ignore if-uses: */
>>> + if (!list_empty(&def->if_uses))
>>> + return NULL;
>>> +
>>> + nir_foreach_use(use, def) {
>>> + nir_instr *instr = use->parent_instr;
>>> +
>>> + /*
>>> + * Kind of an ugly special-case, but phi instructions
>>> + * need to appear first in the block, so by definition
>>> + * we can't move a load_immed into a block where it is
>>> + * consumed by a phi instruction. We could conceivably
>>> + * move it into a dominator block.
>>> + */
>>> + if (instr->type == nir_instr_type_phi)
>>> + return NULL;
>>> +
>>> + nir_block *use_block = instr->block;
>>> +
>>> + if (!b) {
>>> + b = use_block;
>>> + } else if (b != use_block) {
>>> + return NULL;
>>> + }
>>> + }
>>> +
>>> + return b;
>>> +}
>>> +
>>> +bool
>>> +nir_move_load_const(nir_shader *shader)
>>> +{
>>> + bool progress = false;
>>> +
>>> + nir_foreach_function(function, shader) {
>>> + if (!function->impl)
>>> + continue;
>>> +
>>> + nir_foreach_block(block, function->impl) {
>>> + nir_foreach_instr_safe(instr, block) {
>>> + if (instr->type != nir_instr_type_load_const)
>>> + continue;
>>> +
>>> + nir_load_const_instr *load =
>>> + nir_instr_as_load_const(instr);
>>> + nir_block *use_block =
>>> + get_single_block(&load->def);
>>> +
>>> + /* if consumed in more than one block, ignore. Possibly
>>> + * there should be a threshold, ie. if used in a small #
>>> + * of blocks in a large shader, it could be better to
>>> + * split the load_const. For now ignore that and focus
>>> + * on the clear wins:
>>> + */
>>> + if (!use_block)
>>> + continue;
>>> +
>>> + if (use_block == load->instr.block)
>>> + continue;
>>> +
>>> + exec_node_remove(&load->instr.node);
>>> +
>>> + /* insert before first non-phi instruction: */
>>> + nir_foreach_instr(instr2, use_block) {
>>> + if (instr2->type == nir_instr_type_phi)
>>> + continue;
>>> +
>>> + exec_node_insert_node_before(&instr2->node,
>>> + &load->instr.node);
>>> +
>>> + break;
>>> + }
>>> +
>>> + load->instr.block = use_block;
>>> +
>>> + progress = true;
>>> + }
>>> + }
>>> +
>>> + nir_metadata_preserve(function->impl,
>>> + nir_metadata_block_index | nir_metadata_dominance);
>>> + }
>>> +
>>> +
>>> + return progress;
>>> +}
More information about the mesa-dev
mailing list