[Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

Tue Oct 30 15:18:10 UTC 2018

On Tue, Oct 30, 2018 at 2:50 AM Iago Toral <itoral at igalia.com> wrote:

> Some quick comments on this after experimenting with it:
>
> Intel hardware produces booleans of the same bit-size as the operations
> that generates them so I am working on expanding this to add another
> lowering pass that can lower them to smaller bit-sizes.
>
> I have a work-in-progress prototype of this and it is significantly
> easier, at least for us, to do this as a lowering pass in NIR SSA form
> than in the backend. Specifically, doing it in the backend means that
> we have to deal with propagating the bit-sizes of the booleans across
> operation chains and across CFG blocks, which isn't fun.
>
> There are still a few things that I am figuring out:
>
> Right now I lower instructions as I find them, deciding about the bit-
> size of the destination based on the bit-size of the sources. This
> usually works fine because by the time we reach an instruction that
> reads boolean sources we have already lowered the instructions that
> produced them. However, this also means that we need to make decisions
> about the bit-sizes of constant booleans before we know which bit-size
> is required for them, it could be that we need them as 32-bit booleans
> only, 16-bit booleans only, or maybe both. Right now I am assuming that
> constants are 32-bit always, which means that at the moment I have to
> inject a MOV when they are used in 16-bit instructions. We can probably
> do better, at least in some cases, if we inspect all uses of the
> constants before we lower them.
>

If we did a step of constant folding, wouldn't the constant booleans get
nicely folded to the right bit-size?

> Another thing we need to deal with is canonicalization. Some
> instructions may have boolean sources of different native bit-sizes and
> it looks like we need to choose a canonical bit-size for them in this
> case (at least in my experiments this seems to be required for Intel).
> For now I am just taking the bit-size of the first source as the
> canonical form, but we could try to be smarter about this. In the case
> of Intel I think this isn't too bad, because constant operands are
> usually second, so for instructions that use a boolean immediate, if we
> choose the first source as canonical, we can then adjust the imediate
> source without having to emit additional instructions. Unfortunately,
> the 16-bit integer tests in CTS are not good to test set for
> canonicalization, since they focus very specifically in 16-bit paths
> and I don't think we have many cases (if any) of boolean instructions
> mixing booleans of different bit-sizzes, so we will probably have to
> try and write our more tests to exercise these cases properly. Also, it
> might be that some hardware that supports multiple bit-sized booleans
> doesn't require canonicalization, in which case we might want to do
> this configurable.
>

Assuming a back-end that can properly handle 8 and 16-bit booleans, I
suspect "always go down" would be a reasonable for two reasons:

 1) It always uses no more bits than the biggest so it helps with register
pressure
 2) We can do the conversion "for free" either with a mixed-type operation
(or of D and W) or, if that doesn't work, by using a register region to
just reinterpret the D as a scattered out W.

For that matter, there's a part of me that's tempted to always spill to
8-bit for everything other than the comparison operation itself.  Again,
this is all dependent on having a competent back-end compiler which we
currently don't have. :-(

> Another thing I noticed, is that with phi sources the canonicalization
> aspect is a bit trickier, since we can't just insert conversions before
> the phi instructions (nir_validate complains that this is not allowed),
> so instead we need to get to the instruction that generated the ssa
> source and inject the conversion after it.
>

There's a nifty nir_before_src cursor builder function that correctly
handles phis.

> Last thing I noticed is that our backend optimizer is much better at
> optimizing 32-bit booleans that it is at optimizing 16-bit, so there
> are some cases where we end up producing worse code. Obviously this is
> not a problem of the NIR lowering pass though, it just means that we
> need to do better at optimizing 16-bit code in the backend.
>
> So all in all, I am quite happy with this. It is certainly easier than
> lowering in the backend and it gives us a more reasonable framework to
> work with. We still need to do some extra work if we want to do the
> best lowering for things like boolean constants or canonicalization,
> but since neither SPIR-V nor GLSL define multiple bit-sizes for
> booleans that would always have been something that drivers might need
> to address.
>
> Iago
>
>
> On Mon, 2018-10-22 at 17:13 -0500, Jason Ekstrand wrote:
> > This is something that Connor and I have talked about quite a bit
> > over the
> > last couple of months.  The core idea is to replace NIR's current 32-
> > bit
> > 0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I
> > think it
> > worked out pretty well though I really don't like the proliferation
> > of
> > 32-bit comparison opcodes we now have kicking around for i965.
> >
> > Why?  No hardware really has a 1-bit type, right?  Well, sort
> > of...  AMD
> > actually uses 64-bit scalars for booleans with one bit per
> > invocation.
> > However, most hardware such as Intel uses some other larger value for
> > booleans.  The real benefit of 1-bit booleans and requiring a
> > lowering pass
> > is that you can do somewhat custom lowering (like AMD wants) and your
> > lowering pass can always tell in an instant if a value is a boolean
> > based
> > on the bit size.  As can be seen in the last patch, this makes it
> > really
> > easy to implement a bool -> float lowering pass for hardware that
> > doesn't
> > have real integers where NIR's current booleans are actually rather
> > painful.
> >
> > On Intel, the situation is a bit more complicated.  It's tempting to
> > say
> > that we have 32-bit D3D10 booleans.  However, they're not really
> > D3D10
> > booleans on gen4-5 because the top 31 bits are undefined garbage and,
> > while
> > iand, ior, ixor, and inot operations work, you have to iand with 1 at
> > the
> > last minute to clear off all that garbage.  Also, on all generations,
> > a
> > comparison of two N-bit values results in an N-bit boolean, not a 32-
> > bit
> > bool.  This has caused the Igalia folks no end of trouble as they've
> > been
> > working on native 8 and 16-bit support.  If, instead, we have a 1-bit
> > bool
> > with a lowering pass and we can lower to whatever we want, then we
> > could
> > lower to a set of comparison opcodes that return the same bit-size as
> > they
> > compare and it would match GEN hardware much better.
> >
> > But what about performance?  Aren't there all sorts of neat tricks we
> > can
> > do with D3D10 booleans like b & 1.0f for b2f?  As it turns out, not
> > really;
> > that's about the only one.  There is some small advantage when
> > optimizing
> > shaders that come from D3D if your native representation of booleans
> > matches that of D3D.  However, penultimate patch in this series adds
> > a few
> > small optimizations that get us to actually better than we were
> > before.
> > With the entire series, shader-db on Kaby Lak looks like this:
> >
> >     total instructions in shared programs: 15084098 -> 14988578 (-
> > 0.63%)
> >     instructions in affected programs: 1321114 -> 1225594 (-7.23%)
> >     helped: 2340
> >     HURT: 23
> >
> >     total cycles in shared programs: 369790134 -> 359798399 (-2.70%)
> >     cycles in affected programs: 134085452 -> 124093717 (-7.45%)
> >     helped: 2149
> >     HURT: 720
> >
> >     total loops in shared programs: 4393 -> 4393 (0.00%)
> >     loops in affected programs: 0 -> 0
> >     helped: 0
> >     HURT: 0
> >
> >     total spills in shared programs: 10158 -> 10051 (-1.05%)
> >     spills in affected programs: 1429 -> 1322 (-7.49%)
> >     helped: 8
> >     HURT: 15
> >
> >     total fills in shared programs: 22105 -> 21720 (-1.74%)
> >     fills in affected programs: 2853 -> 2468 (-13.49%)
> >     helped: 12
> >     HURT: 15
> >
> > How about ease of use?  Are they a pain to deal with?  Yes, adding
> > support
> > for 1-bit types was a bit awkward in a few places but most of it was
> > dealing with all the places where we have 32-bit booleans baked into
> > assumptions.  Getting rid of that baking in solves the problem and
> > also
> > makes the whole IR more future-proof.
> >
> > All in all, I'd say I'm pretty happy with it.  However, I'd like
> > other
> > people (particularly the AMD folks) to play with it a bit and verify
> > that
> > it solves their problems as well.  Also, I added a lowering pass and
> > tried
> > to turn it on in everyone's driver but may not have put it in the
> > right
> > spot.  Please double-check my work.  For those wishing to take a
> > look, you
> > can also find the entire series on my gitlab here:
> >
> >
> https://gitlab.freedesktop.org/jekstrand/mesa/commits/review/nir-1-bit-bool
> >
> > Please review!
> >
> > --Jason
> >
> > Cc: Connor Abbott <cwabbott0 at gmail.com>
> > Cc: Timothy Arceri <tarceri at itsqueeze.com>
> > Cc: Eric Anholt <eric at anholt.net>
> > Cc: Rob Clark <robdclark at gmail.com>
> > Cc: Karol Herbst <karolherbst at gmail.com>
> > Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
> > Cc: Alyssa Rosenzweig <alyssa at rosenzweig.io>
> >
> >
> > Jason Ekstrand (31):
> >   nir/validate: Print when the validation failed
> >   nir/constant_folding: Add an unreachable to a switch
> >   nir/constant_folding: Use nir_src_as_bool for discard_if
> >   nir/builder: Add a nir_imm_true/false helpers
> >   nir/builder: Handle 16-bit floats in nir_imm_floatN_t
> >   nir/search: Use nir_builder
> >   nir/opt_if: Rework condition propagation
> >   nir/system_values: Use the bit size from the load_deref
> >   glsl/nir: Use i2b instead of ine for fixing UBO/SSBO booleans
> >   nir/prog: Use nir_bany in kill handling
> >   spirv: Use the right bit-size for spec constant ops
> >   spirv: Initialize subgroup destinations with the destination type
> >   intel/nir: Use the OPT macro for more passes
> >   nir/algebraic: Disable b2f lowering and two optimizations
> >   nir: Rename boolean-related opcodes to include 32 in the name
> >   FIXUP: nir/builder: Generate 32-bit bool opcodes transparently
> >   FIXUP: nir/algebraic: Remap boolean opcodes to the 32-bit variant
> >   FIXUP: Use 32-bit opcodes in the NIR producers and optimizations
> >   FIXUP: Use 32-bit opcodes in the NIR back-ends
> >   nir/constant_expressions: Rework boolean handling
> >   nir: Add support for 1-bit data types
> >   nir: Add 1-bit boolean opcodes
> >   nir: Add a bool to int32 lowering pass
> >   glsl,spirv: Generate 1-bit booleans
> >   FIXUP: Revert "Use 32-bit opcodes in the NIR producers and
> >     optimizations"
> >   FIXUP: Revert "nir/builder: Generate 32-bit bool opcodes
> >     transparently"
> >   FIXUP: nir/builder: Generate 1-bit booleans in nir_build_imm_bool
> >   nir/algebraic: Optimize 1-bit booleans
> >   nir/algebraic: Add back lower_b2f and a b2f optimization
> >   nir/algebraic: Add some optimizations for D3D-style booleans
> >   nir: Add a bool to float32 lowering pass
> >
> >  src/amd/common/ac_nir_to_llvm.c               |  30 +--
> >  src/amd/vulkan/radv_shader.c                  |   6 +-
> >  src/broadcom/compiler/nir_to_vir.c            |  44 ++---
> >  src/broadcom/compiler/vir.c                   |   2 +
> >  src/compiler/Makefile.sources                 |   2 +
> >  src/compiler/glsl/glsl_to_nir.cpp             |  28 +--
> >  src/compiler/nir/meson.build                  |   2 +
> >  src/compiler/nir/nir.c                        |  15 +-
> >  src/compiler/nir/nir.h                        |  36 ++--
> >  src/compiler/nir/nir_algebraic.py             |  22 ++-
> >  src/compiler/nir/nir_builder.h                |  51 ++++-
> >  src/compiler/nir/nir_constant_expressions.py  |  53 ++---
> >  src/compiler/nir/nir_instr_set.c              |  23 ++-
> >  src/compiler/nir/nir_lower_alu_to_scalar.c    |   4 +
> >  src/compiler/nir/nir_lower_bool_to_float.c    | 181
> > ++++++++++++++++++
> >  src/compiler/nir/nir_lower_bool_to_int32.c    | 162 ++++++++++++++++
> >  src/compiler/nir/nir_lower_int64.c            |   2 +-
> >  .../nir/nir_lower_load_const_to_scalar.c      |   3 +
> >  src/compiler/nir/nir_lower_returns.c          |   4 +-
> >  src/compiler/nir/nir_lower_subgroups.c        |   2 +-
> >  src/compiler/nir/nir_lower_system_values.c    |   1 +
> >  src/compiler/nir/nir_opcodes.py               |  31 ++-
> >  src/compiler/nir/nir_opt_algebraic.py         |  19 +-
> >  src/compiler/nir/nir_opt_constant_folding.c   |  18 +-
> >  src/compiler/nir/nir_opt_if.c                 |  91 +++------
> >  src/compiler/nir/nir_opt_intrinsics.c         |   2 +-
> >  src/compiler/nir/nir_opt_large_constants.c    |   5 +
> >  src/compiler/nir/nir_print.c                  |   3 +
> >  src/compiler/nir/nir_search.c                 | 112 +++--------
> >  src/compiler/nir/nir_search.h                 |   9 +-
> >  src/compiler/nir/nir_validate.c               |  16 +-
> >  src/compiler/nir/tests/vars_tests.cpp         |  38 ++--
> >  src/compiler/nir_types.cpp                    |   2 +-
> >  src/compiler/nir_types.h                      |   4 +-
> >  src/compiler/spirv/spirv_to_nir.c             |  23 ++-
> >  src/compiler/spirv/vtn_alu.c                  |   4 +-
> >  src/compiler/spirv/vtn_cfg.c                  |  10 +-
> >  src/compiler/spirv/vtn_subgroup.c             |  12 +-
> >  src/gallium/auxiliary/nir/tgsi_to_nir.c       |  20 +-
> >  .../drivers/freedreno/ir3/ir3_compiler_nir.c  |  31 +--
> >  src/gallium/drivers/radeonsi/si_shader_nir.c  |   2 +
> >  src/gallium/drivers/vc4/vc4_program.c         |  50 ++---
> >  src/intel/compiler/brw_fs_nir.cpp             |  80 ++++----
> >  src/intel/compiler/brw_nir.c                  |  16 +-
> >  .../brw_nir_analyze_boolean_resolves.c        |  24 +--
> >  .../compiler/brw_nir_lower_image_load_store.c |   2 +-
> >  src/intel/compiler/brw_vec4_nir.cpp           | 122 ++++++------
> >  src/intel/vulkan/anv_pipeline.c               |   2 +-
> >  src/mesa/drivers/dri/i965/brw_program.c       |   4 +-
> >  src/mesa/main/glspirv.c                       |   2 +-
> >  src/mesa/program/prog_to_nir.c                |   2 +-
> >  src/mesa/state_tracker/st_glsl_to_nir.cpp     |   2 +-
> >  52 files changed, 934 insertions(+), 497 deletions(-)
> >  create mode 100644 src/compiler/nir/nir_lower_bool_to_float.c
> >  create mode 100644 src/compiler/nir/nir_lower_bool_to_int32.c
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20181030/5abebcd1/attachment-0001.html>