[Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

Tue Oct 30 07:50:15 UTC 2018

Some quick comments on this after experimenting with it:

Intel hardware produces booleans of the same bit-size as the operations
that generates them so I am working on expanding this to add another
lowering pass that can lower them to smaller bit-sizes.

I have a work-in-progress prototype of this and it is significantly
easier, at least for us, to do this as a lowering pass in NIR SSA form
than in the backend. Specifically, doing it in the backend means that
we have to deal with propagating the bit-sizes of the booleans across
operation chains and across CFG blocks, which isn't fun.

There are still a few things that I am figuring out:

Right now I lower instructions as I find them, deciding about the bit-
size of the destination based on the bit-size of the sources. This
usually works fine because by the time we reach an instruction that
reads boolean sources we have already lowered the instructions that
produced them. However, this also means that we need to make decisions
about the bit-sizes of constant booleans before we know which bit-size
is required for them, it could be that we need them as 32-bit booleans
only, 16-bit booleans only, or maybe both. Right now I am assuming that
constants are 32-bit always, which means that at the moment I have to
inject a MOV when they are used in 16-bit instructions. We can probably
do better, at least in some cases, if we inspect all uses of the
constants before we lower them.

Another thing we need to deal with is canonicalization. Some
instructions may have boolean sources of different native bit-sizes and
it looks like we need to choose a canonical bit-size for them in this
case (at least in my experiments this seems to be required for Intel).
For now I am just taking the bit-size of the first source as the
canonical form, but we could try to be smarter about this. In the case
of Intel I think this isn't too bad, because constant operands are
usually second, so for instructions that use a boolean immediate, if we
choose the first source as canonical, we can then adjust the imediate
source without having to emit additional instructions. Unfortunately,
the 16-bit integer tests in CTS are not good to test set for
canonicalization, since they focus very specifically in 16-bit paths
and I don't think we have many cases (if any) of boolean instructions
mixing booleans of different bit-sizzes, so we will probably have to
try and write our more tests to exercise these cases properly. Also, it
might be that some hardware that supports multiple bit-sized booleans
doesn't require canonicalization, in which case we might want to do
this configurable.

Another thing I noticed, is that with phi sources the canonicalization
aspect is a bit trickier, since we can't just insert conversions before
the phi instructions (nir_validate complains that this is not allowed),
so instead we need to get to the instruction that generated the ssa
source and inject the conversion after it.

Last thing I noticed is that our backend optimizer is much better at
optimizing 32-bit booleans that it is at optimizing 16-bit, so there
are some cases where we end up producing worse code. Obviously this is
not a problem of the NIR lowering pass though, it just means that we
need to do better at optimizing 16-bit code in the backend.

So all in all, I am quite happy with this. It is certainly easier than
lowering in the backend and it gives us a more reasonable framework to
work with. We still need to do some extra work if we want to do the
best lowering for things like boolean constants or canonicalization,
but since neither SPIR-V nor GLSL define multiple bit-sizes for
booleans that would always have been something that drivers might need
to address.

Iago

On Mon, 2018-10-22 at 17:13 -0500, Jason Ekstrand wrote:
> This is something that Connor and I have talked about quite a bit
> over the
> last couple of months.  The core idea is to replace NIR's current 32-
> bit
> 0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I
> think it
> worked out pretty well though I really don't like the proliferation
> of
> 32-bit comparison opcodes we now have kicking around for i965.
> 
> Why?  No hardware really has a 1-bit type, right?  Well, sort
> of...  AMD
> actually uses 64-bit scalars for booleans with one bit per
> invocation.
> However, most hardware such as Intel uses some other larger value for
> booleans.  The real benefit of 1-bit booleans and requiring a
> lowering pass
> is that you can do somewhat custom lowering (like AMD wants) and your
> lowering pass can always tell in an instant if a value is a boolean
> based
> on the bit size.  As can be seen in the last patch, this makes it
> really
> easy to implement a bool -> float lowering pass for hardware that
> doesn't
> have real integers where NIR's current booleans are actually rather
> painful.
> 
> On Intel, the situation is a bit more complicated.  It's tempting to
> say
> that we have 32-bit D3D10 booleans.  However, they're not really
> D3D10
> booleans on gen4-5 because the top 31 bits are undefined garbage and,
> while
> iand, ior, ixor, and inot operations work, you have to iand with 1 at
> the
> last minute to clear off all that garbage.  Also, on all generations,
> a
> comparison of two N-bit values results in an N-bit boolean, not a 32-
> bit
> bool.  This has caused the Igalia folks no end of trouble as they've
> been
> working on native 8 and 16-bit support.  If, instead, we have a 1-bit 
> bool
> with a lowering pass and we can lower to whatever we want, then we
> could
> lower to a set of comparison opcodes that return the same bit-size as
> they
> compare and it would match GEN hardware much better.
> 
> But what about performance?  Aren't there all sorts of neat tricks we
> can
> do with D3D10 booleans like b & 1.0f for b2f?  As it turns out, not
> really;
> that's about the only one.  There is some small advantage when
> optimizing
> shaders that come from D3D if your native representation of booleans
> matches that of D3D.  However, penultimate patch in this series adds
> a few
> small optimizations that get us to actually better than we were
> before.
> With the entire series, shader-db on Kaby Lak looks like this:
> 
>     total instructions in shared programs: 15084098 -> 14988578 (-
> 0.63%)
>     instructions in affected programs: 1321114 -> 1225594 (-7.23%)
>     helped: 2340
>     HURT: 23
> 
>     total cycles in shared programs: 369790134 -> 359798399 (-2.70%)
>     cycles in affected programs: 134085452 -> 124093717 (-7.45%)
>     helped: 2149
>     HURT: 720
> 
>     total loops in shared programs: 4393 -> 4393 (0.00%)
>     loops in affected programs: 0 -> 0
>     helped: 0
>     HURT: 0
> 
>     total spills in shared programs: 10158 -> 10051 (-1.05%)
>     spills in affected programs: 1429 -> 1322 (-7.49%)
>     helped: 8
>     HURT: 15
> 
>     total fills in shared programs: 22105 -> 21720 (-1.74%)
>     fills in affected programs: 2853 -> 2468 (-13.49%)
>     helped: 12
>     HURT: 15
> 
> How about ease of use?  Are they a pain to deal with?  Yes, adding
> support
> for 1-bit types was a bit awkward in a few places but most of it was
> dealing with all the places where we have 32-bit booleans baked into
> assumptions.  Getting rid of that baking in solves the problem and
> also
> makes the whole IR more future-proof.
> 
> All in all, I'd say I'm pretty happy with it.  However, I'd like
> other
> people (particularly the AMD folks) to play with it a bit and verify
> that
> it solves their problems as well.  Also, I added a lowering pass and
> tried
> to turn it on in everyone's driver but may not have put it in the
> right
> spot.  Please double-check my work.  For those wishing to take a
> look, you
> can also find the entire series on my gitlab here:
> 
> 
https://gitlab.freedesktop.org/jekstrand/mesa/commits/review/nir-1-bit-bool
> 
> Please review!
> 
> --Jason
> 
> Cc: Connor Abbott <cwabbott0 at gmail.com>
> Cc: Timothy Arceri <tarceri at itsqueeze.com>
> Cc: Eric Anholt <eric at anholt.net>
> Cc: Rob Clark <robdclark at gmail.com>
> Cc: Karol Herbst <karolherbst at gmail.com>
> Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
> Cc: Alyssa Rosenzweig <alyssa at rosenzweig.io>
> 
> 
> Jason Ekstrand (31):
>   nir/validate: Print when the validation failed
>   nir/constant_folding: Add an unreachable to a switch
>   nir/constant_folding: Use nir_src_as_bool for discard_if
>   nir/builder: Add a nir_imm_true/false helpers
>   nir/builder: Handle 16-bit floats in nir_imm_floatN_t
>   nir/search: Use nir_builder
>   nir/opt_if: Rework condition propagation
>   nir/system_values: Use the bit size from the load_deref
>   glsl/nir: Use i2b instead of ine for fixing UBO/SSBO booleans
>   nir/prog: Use nir_bany in kill handling
>   spirv: Use the right bit-size for spec constant ops
>   spirv: Initialize subgroup destinations with the destination type
>   intel/nir: Use the OPT macro for more passes
>   nir/algebraic: Disable b2f lowering and two optimizations
>   nir: Rename boolean-related opcodes to include 32 in the name
>   FIXUP: nir/builder: Generate 32-bit bool opcodes transparently
>   FIXUP: nir/algebraic: Remap boolean opcodes to the 32-bit variant
>   FIXUP: Use 32-bit opcodes in the NIR producers and optimizations
>   FIXUP: Use 32-bit opcodes in the NIR back-ends
>   nir/constant_expressions: Rework boolean handling
>   nir: Add support for 1-bit data types
>   nir: Add 1-bit boolean opcodes
>   nir: Add a bool to int32 lowering pass
>   glsl,spirv: Generate 1-bit booleans
>   FIXUP: Revert "Use 32-bit opcodes in the NIR producers and
>     optimizations"
>   FIXUP: Revert "nir/builder: Generate 32-bit bool opcodes
>     transparently"
>   FIXUP: nir/builder: Generate 1-bit booleans in nir_build_imm_bool
>   nir/algebraic: Optimize 1-bit booleans
>   nir/algebraic: Add back lower_b2f and a b2f optimization
>   nir/algebraic: Add some optimizations for D3D-style booleans
>   nir: Add a bool to float32 lowering pass
> 
>  src/amd/common/ac_nir_to_llvm.c               |  30 +--
>  src/amd/vulkan/radv_shader.c                  |   6 +-
>  src/broadcom/compiler/nir_to_vir.c            |  44 ++---
>  src/broadcom/compiler/vir.c                   |   2 +
>  src/compiler/Makefile.sources                 |   2 +
>  src/compiler/glsl/glsl_to_nir.cpp             |  28 +--
>  src/compiler/nir/meson.build                  |   2 +
>  src/compiler/nir/nir.c                        |  15 +-
>  src/compiler/nir/nir.h                        |  36 ++--
>  src/compiler/nir/nir_algebraic.py             |  22 ++-
>  src/compiler/nir/nir_builder.h                |  51 ++++-
>  src/compiler/nir/nir_constant_expressions.py  |  53 ++---
>  src/compiler/nir/nir_instr_set.c              |  23 ++-
>  src/compiler/nir/nir_lower_alu_to_scalar.c    |   4 +
>  src/compiler/nir/nir_lower_bool_to_float.c    | 181
> ++++++++++++++++++
>  src/compiler/nir/nir_lower_bool_to_int32.c    | 162 ++++++++++++++++
>  src/compiler/nir/nir_lower_int64.c            |   2 +-
>  .../nir/nir_lower_load_const_to_scalar.c      |   3 +
>  src/compiler/nir/nir_lower_returns.c          |   4 +-
>  src/compiler/nir/nir_lower_subgroups.c        |   2 +-
>  src/compiler/nir/nir_lower_system_values.c    |   1 +
>  src/compiler/nir/nir_opcodes.py               |  31 ++-
>  src/compiler/nir/nir_opt_algebraic.py         |  19 +-
>  src/compiler/nir/nir_opt_constant_folding.c   |  18 +-
>  src/compiler/nir/nir_opt_if.c                 |  91 +++------
>  src/compiler/nir/nir_opt_intrinsics.c         |   2 +-
>  src/compiler/nir/nir_opt_large_constants.c    |   5 +
>  src/compiler/nir/nir_print.c                  |   3 +
>  src/compiler/nir/nir_search.c                 | 112 +++--------
>  src/compiler/nir/nir_search.h                 |   9 +-
>  src/compiler/nir/nir_validate.c               |  16 +-
>  src/compiler/nir/tests/vars_tests.cpp         |  38 ++--
>  src/compiler/nir_types.cpp                    |   2 +-
>  src/compiler/nir_types.h                      |   4 +-
>  src/compiler/spirv/spirv_to_nir.c             |  23 ++-
>  src/compiler/spirv/vtn_alu.c                  |   4 +-
>  src/compiler/spirv/vtn_cfg.c                  |  10 +-
>  src/compiler/spirv/vtn_subgroup.c             |  12 +-
>  src/gallium/auxiliary/nir/tgsi_to_nir.c       |  20 +-
>  .../drivers/freedreno/ir3/ir3_compiler_nir.c  |  31 +--
>  src/gallium/drivers/radeonsi/si_shader_nir.c  |   2 +
>  src/gallium/drivers/vc4/vc4_program.c         |  50 ++---
>  src/intel/compiler/brw_fs_nir.cpp             |  80 ++++----
>  src/intel/compiler/brw_nir.c                  |  16 +-
>  .../brw_nir_analyze_boolean_resolves.c        |  24 +--
>  .../compiler/brw_nir_lower_image_load_store.c |   2 +-
>  src/intel/compiler/brw_vec4_nir.cpp           | 122 ++++++------
>  src/intel/vulkan/anv_pipeline.c               |   2 +-
>  src/mesa/drivers/dri/i965/brw_program.c       |   4 +-
>  src/mesa/main/glspirv.c                       |   2 +-
>  src/mesa/program/prog_to_nir.c                |   2 +-
>  src/mesa/state_tracker/st_glsl_to_nir.cpp     |   2 +-
>  52 files changed, 934 insertions(+), 497 deletions(-)
>  create mode 100644 src/compiler/nir/nir_lower_bool_to_float.c
>  create mode 100644 src/compiler/nir/nir_lower_bool_to_int32.c
>