[Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

Thu Nov 8 08:56:24 UTC 2018

I just noticed I never replied to this:
On Tue, 2018-10-30 at 10:18 -0500, Jason Ekstrand wrote:
> On Tue, Oct 30, 2018 at 2:50 AM Iago Toral <itoral at igalia.com> wrote:
> > Some quick comments on this after experimenting with it:
> > 
> > 
> > 
> > Intel hardware produces booleans of the same bit-size as the
> > operations
> > 
> > that generates them so I am working on expanding this to add
> > another
> > 
> > lowering pass that can lower them to smaller bit-sizes.
> > 
> > 
> > 
> > I have a work-in-progress prototype of this and it is significantly
> > 
> > easier, at least for us, to do this as a lowering pass in NIR SSA
> > form
> > 
> > than in the backend. Specifically, doing it in the backend means
> > that
> > 
> > we have to deal with propagating the bit-sizes of the booleans
> > across
> > 
> > operation chains and across CFG blocks, which isn't fun.
> > 
> > 
> > 
> > There are still a few things that I am figuring out:
> > 
> > 
> > 
> > Right now I lower instructions as I find them, deciding about the
> > bit-
> > 
> > size of the destination based on the bit-size of the sources. This
> > 
> > usually works fine because by the time we reach an instruction that
> > 
> > reads boolean sources we have already lowered the instructions that
> > 
> > produced them. However, this also means that we need to make
> > decisions
> > 
> > about the bit-sizes of constant booleans before we know which bit-
> > size
> > 
> > is required for them, it could be that we need them as 32-bit
> > booleans
> > 
> > only, 16-bit booleans only, or maybe both. Right now I am assuming
> > that
> > 
> > constants are 32-bit always, which means that at the moment I have
> > to
> > 
> > inject a MOV when they are used in 16-bit instructions. We can
> > probably
> > 
> > do better, at least in some cases, if we inspect all uses of the
> > 
> > constants before we lower them.
> 
> If we did a step of constant folding, wouldn't the constant booleans
> get nicely folded to the right bit-size?

The lowering pass runs late, so we have already run constant folding by
the time we get here. Anyway, I have double checked and the load_const
instructions seem to come with bit-size > 1 in all the cases I have
available from CTS, so maybe this just works already and we never hit
this. With that said, I am not 100% sure if there are any cases where
we could still end up having to lower a 1-bit bool load const
instruction and figuring our a bit-size for it, but at the very least
this looks like it would not be a common case anyway, so maybe we can
deal with that whenwe find an actual case.
> > Another thing we need to deal with is canonicalization. Some
> > 
> > instructions may have boolean sources of different native bit-sizes 
> > and
> > 
> > it looks like we need to choose a canonical bit-size for them in
> > this
> > 
> > case (at least in my experiments this seems to be required for
> > Intel).
> > 
> > For now I am just taking the bit-size of the first source as the
> > 
> > canonical form, but we could try to be smarter about this. In the
> > case
> > 
> > of Intel I think this isn't too bad, because constant operands are
> > 
> > usually second, so for instructions that use a boolean immediate,
> > if we
> > 
> > choose the first source as canonical, we can then adjust the
> > imediate
> > 
> > source without having to emit additional instructions.
> > Unfortunately,
> > 
> > the 16-bit integer tests in CTS are not good to test set for
> > 
> > canonicalization, since they focus very specifically in 16-bit
> > paths
> > 
> > and I don't think we have many cases (if any) of boolean
> > instructions
> > 
> > mixing booleans of different bit-sizzes, so we will probably have
> > to
> > 
> > try and write our more tests to exercise these cases properly.
> > Also, it
> > 
> > might be that some hardware that supports multiple bit-sized
> > booleans
> > 
> > doesn't require canonicalization, in which case we might want to do
> > 
> > this configurable.
> 
> Assuming a back-end that can properly handle 8 and 16-bit booleans, I
> suspect "always go down" would be a reasonable for two reasons:
> 
>  1) It always uses no more bits than the biggest so it helps with
> register pressure
>  2) We can do the conversion "for free" either with a mixed-type
> operation (or of D and W) or, if that doesn't work, by using a
> register region to just reinterpret the D as a scattered out W.
> 
> For that matter, there's a part of me that's tempted to always spill
> to 8-bit for everything other than the comparison operation itself. 
> Again, this is all dependent on having a competent back-end compiler
> which we currently don't have. :-(

Right, doing 8-bit is probably not a good idea for now. Byte
instructions have this annoying restriction by which destinations need
to have a stride of 2 which I think doesn't play well with the
optimizer in general right now, but beyond that, the hardware doesn't
support byte immediates, which means that we lose algebraic and cmod
optimizations that rely on the is_zero(), is_one() and
is_negative_one() immediate helpers to work. This affects, for example,
SEL instructions, of which we have plenty, where we need emit a cmp
with 0 to right before to prepare the flag register. With the lowering
pass I have right now I see clearly better instruction counts for 16-
bit code but I still get worse counts for 8-bit booleans (compared to
always using 32-bit booleans) because of these reasons. I think that
not having byte immediates might be bad enough that we might want to
avoid emitting 8-bit booleans completely in the end. I still want to
test a few tings related to this though, such as trying to use 16-bit
immediates with 8-bit instructions and see if that works...
> > Another thing I noticed, is that with phi sources the
> > canonicalization
> > 
> > aspect is a bit trickier, since we can't just insert conversions
> > before
> > 
> > the phi instructions (nir_validate complains that this is not
> > allowed),
> > 
> > so instead we need to get to the instruction that generated the ssa
> > 
> > source and inject the conversion after it.
> 
> There's a nifty nir_before_src  cursor builder function that
> correctly handles phis.

Yeah, that's useful, thanks for pointing it out! One question about
this, the helper receives an 'is_if_condition' parameter which I guess
should be false for phi sources, is that correct?
> > Last thing I noticed is that our backend optimizer is much better
> > at
> > 
> > optimizing 32-bit booleans that it is at optimizing 16-bit, so
> > there
> > 
> > are some cases where we end up producing worse code. Obviously this
> > is
> > 
> > not a problem of the NIR lowering pass though, it just means that
> > we
> > 
> > need to do better at optimizing 16-bit code in the backend.
> > 
> > 
> > 
> > So all in all, I am quite happy with this. It is certainly easier
> > than
> > 
> > lowering in the backend and it gives us a more reasonable framework
> > to
> > 
> > work with. We still need to do some extra work if we want to do the
> > 
> > best lowering for things like boolean constants or
> > canonicalization,
> > 
> > but since neither SPIR-V nor GLSL define multiple bit-sizes for
> > 
> > booleans that would always have been something that drivers might
> > need
> > 
> > to address.
> > 
> > 
> > 
> > Iago
> > 
> > 
> > 
> > 
> > 
> > On Mon, 2018-10-22 at 17:13 -0500, Jason Ekstrand wrote:
> > 
> > > This is something that Connor and I have talked about quite a bit
> > 
> > > over the
> > 
> > > last couple of months.  The core idea is to replace NIR's current
> > 32-
> > 
> > > bit
> > 
> > > 0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I
> > 
> > > think it
> > 
> > > worked out pretty well though I really don't like the
> > proliferation
> > 
> > > of
> > 
> > > 32-bit comparison opcodes we now have kicking around for i965.
> > 
> > > 
> > 
> > > Why?  No hardware really has a 1-bit type, right?  Well, sort
> > 
> > > of...  AMD
> > 
> > > actually uses 64-bit scalars for booleans with one bit per
> > 
> > > invocation.
> > 
> > > However, most hardware such as Intel uses some other larger value
> > for
> > 
> > > booleans.  The real benefit of 1-bit booleans and requiring a
> > 
> > > lowering pass
> > 
> > > is that you can do somewhat custom lowering (like AMD wants) and
> > your
> > 
> > > lowering pass can always tell in an instant if a value is a
> > boolean
> > 
> > > based
> > 
> > > on the bit size.  As can be seen in the last patch, this makes it
> > 
> > > really
> > 
> > > easy to implement a bool -> float lowering pass for hardware that
> > 
> > > doesn't
> > 
> > > have real integers where NIR's current booleans are actually
> > rather
> > 
> > > painful.
> > 
> > > 
> > 
> > > On Intel, the situation is a bit more complicated.  It's tempting
> > to
> > 
> > > say
> > 
> > > that we have 32-bit D3D10 booleans.  However, they're not really
> > 
> > > D3D10
> > 
> > > booleans on gen4-5 because the top 31 bits are undefined garbage
> > and,
> > 
> > > while
> > 
> > > iand, ior, ixor, and inot operations work, you have to iand with
> > 1 at
> > 
> > > the
> > 
> > > last minute to clear off all that garbage.  Also, on all
> > generations,
> > 
> > > a
> > 
> > > comparison of two N-bit values results in an N-bit boolean, not a
> > 32-
> > 
> > > bit
> > 
> > > bool.  This has caused the Igalia folks no end of trouble as
> > they've
> > 
> > > been
> > 
> > > working on native 8 and 16-bit support.  If, instead, we have a
> > 1-bit 
> > 
> > > bool
> > 
> > > with a lowering pass and we can lower to whatever we want, then
> > we
> > 
> > > could
> > 
> > > lower to a set of comparison opcodes that return the same bit-
> > size as
> > 
> > > they
> > 
> > > compare and it would match GEN hardware much better.
> > 
> > > 
> > 
> > > But what about performance?  Aren't there all sorts of neat
> > tricks we
> > 
> > > can
> > 
> > > do with D3D10 booleans like b & 1.0f for b2f?  As it turns out,
> > not
> > 
> > > really;
> > 
> > > that's about the only one.  There is some small advantage when
> > 
> > > optimizing
> > 
> > > shaders that come from D3D if your native representation of
> > booleans
> > 
> > > matches that of D3D.  However, penultimate patch in this series
> > adds
> > 
> > > a few
> > 
> > > small optimizations that get us to actually better than we were
> > 
> > > before.
> > 
> > > With the entire series, shader-db on Kaby Lak looks like this:
> > 
> > > 
> > 
> > >     total instructions in shared programs: 15084098 -> 14988578
> > (-
> > 
> > > 0.63%)
> > 
> > >     instructions in affected programs: 1321114 -> 1225594 (-
> > 7.23%)
> > 
> > >     helped: 2340
> > 
> > >     HURT: 23
> > 
> > > 
> > 
> > >     total cycles in shared programs: 369790134 -> 359798399 (-
> > 2.70%)
> > 
> > >     cycles in affected programs: 134085452 -> 124093717 (-7.45%)
> > 
> > >     helped: 2149
> > 
> > >     HURT: 720
> > 
> > > 
> > 
> > >     total loops in shared programs: 4393 -> 4393 (0.00%)
> > 
> > >     loops in affected programs: 0 -> 0
> > 
> > >     helped: 0
> > 
> > >     HURT: 0
> > 
> > > 
> > 
> > >     total spills in shared programs: 10158 -> 10051 (-1.05%)
> > 
> > >     spills in affected programs: 1429 -> 1322 (-7.49%)
> > 
> > >     helped: 8
> > 
> > >     HURT: 15
> > 
> > > 
> > 
> > >     total fills in shared programs: 22105 -> 21720 (-1.74%)
> > 
> > >     fills in affected programs: 2853 -> 2468 (-13.49%)
> > 
> > >     helped: 12
> > 
> > >     HURT: 15
> > 
> > > 
> > 
> > > How about ease of use?  Are they a pain to deal with?  Yes,
> > adding
> > 
> > > support
> > 
> > > for 1-bit types was a bit awkward in a few places but most of it
> > was
> > 
> > > dealing with all the places where we have 32-bit booleans baked
> > into
> > 
> > > assumptions.  Getting rid of that baking in solves the problem
> > and
> > 
> > > also
> > 
> > > makes the whole IR more future-proof.
> > 
> > > 
> > 
> > > All in all, I'd say I'm pretty happy with it.  However, I'd like
> > 
> > > other
> > 
> > > people (particularly the AMD folks) to play with it a bit and
> > verify
> > 
> > > that
> > 
> > > it solves their problems as well.  Also, I added a lowering pass
> > and
> > 
> > > tried
> > 
> > > to turn it on in everyone's driver but may not have put it in the
> > 
> > > right
> > 
> > > spot.  Please double-check my work.  For those wishing to take a
> > 
> > > look, you
> > 
> > > can also find the entire series on my gitlab here:
> > 
> > > 
> > 
> > > 
> > 
> > 
https://gitlab.freedesktop.org/jekstrand/mesa/commits/review/nir-1-bit-bool
> > 
> > > 
> > 
> > > Please review!
> > 
> > > 
> > 
> > > --Jason
> > 
> > > 
> > 
> > > Cc: Connor Abbott <cwabbott0 at gmail.com>
> > 
> > > Cc: Timothy Arceri <tarceri at itsqueeze.com>
> > 
> > > Cc: Eric Anholt <eric at anholt.net>
> > 
> > > Cc: Rob Clark <robdclark at gmail.com>
> > 
> > > Cc: Karol Herbst <karolherbst at gmail.com>
> > 
> > > Cc: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
> > 
> > > Cc: Alyssa Rosenzweig <alyssa at rosenzweig.io>
> > 
> > > 
> > 
> > > 
> > 
> > > Jason Ekstrand (31):
> > 
> > >   nir/validate: Print when the validation failed
> > 
> > >   nir/constant_folding: Add an unreachable to a switch
> > 
> > >   nir/constant_folding: Use nir_src_as_bool for discard_if
> > 
> > >   nir/builder: Add a nir_imm_true/false helpers
> > 
> > >   nir/builder: Handle 16-bit floats in nir_imm_floatN_t
> > 
> > >   nir/search: Use nir_builder
> > 
> > >   nir/opt_if: Rework condition propagation
> > 
> > >   nir/system_values: Use the bit size from the load_deref
> > 
> > >   glsl/nir: Use i2b instead of ine for fixing UBO/SSBO booleans
> > 
> > >   nir/prog: Use nir_bany in kill handling
> > 
> > >   spirv: Use the right bit-size for spec constant ops
> > 
> > >   spirv: Initialize subgroup destinations with the destination
> > type
> > 
> > >   intel/nir: Use the OPT macro for more passes
> > 
> > >   nir/algebraic: Disable b2f lowering and two optimizations
> > 
> > >   nir: Rename boolean-related opcodes to include 32 in the name
> > 
> > >   FIXUP: nir/builder: Generate 32-bit bool opcodes transparently
> > 
> > >   FIXUP: nir/algebraic: Remap boolean opcodes to the 32-bit
> > variant
> > 
> > >   FIXUP: Use 32-bit opcodes in the NIR producers and
> > optimizations
> > 
> > >   FIXUP: Use 32-bit opcodes in the NIR back-ends
> > 
> > >   nir/constant_expressions: Rework boolean handling
> > 
> > >   nir: Add support for 1-bit data types
> > 
> > >   nir: Add 1-bit boolean opcodes
> > 
> > >   nir: Add a bool to int32 lowering pass
> > 
> > >   glsl,spirv: Generate 1-bit booleans
> > 
> > >   FIXUP: Revert "Use 32-bit opcodes in the NIR producers and
> > 
> > >     optimizations"
> > 
> > >   FIXUP: Revert "nir/builder: Generate 32-bit bool opcodes
> > 
> > >     transparently"
> > 
> > >   FIXUP: nir/builder: Generate 1-bit booleans in
> > nir_build_imm_bool
> > 
> > >   nir/algebraic: Optimize 1-bit booleans
> > 
> > >   nir/algebraic: Add back lower_b2f and a b2f optimization
> > 
> > >   nir/algebraic: Add some optimizations for D3D-style booleans
> > 
> > >   nir: Add a bool to float32 lowering pass
> > 
> > > 
> > 
> > >  src/amd/common/ac_nir_to_llvm.c               |  30 +--
> > 
> > >  src/amd/vulkan/radv_shader.c                  |   6 +-
> > 
> > >  src/broadcom/compiler/nir_to_vir.c            |  44 ++---
> > 
> > >  src/broadcom/compiler/vir.c                   |   2 +
> > 
> > >  src/compiler/Makefile.sources                 |   2 +
> > 
> > >  src/compiler/glsl/glsl_to_nir.cpp             |  28 +--
> > 
> > >  src/compiler/nir/meson.build                  |   2 +
> > 
> > >  src/compiler/nir/nir.c                        |  15 +-
> > 
> > >  src/compiler/nir/nir.h                        |  36 ++--
> > 
> > >  src/compiler/nir/nir_algebraic.py             |  22 ++-
> > 
> > >  src/compiler/nir/nir_builder.h                |  51 ++++-
> > 
> > >  src/compiler/nir/nir_constant_expressions.py  |  53 ++---
> > 
> > >  src/compiler/nir/nir_instr_set.c              |  23 ++-
> > 
> > >  src/compiler/nir/nir_lower_alu_to_scalar.c    |   4 +
> > 
> > >  src/compiler/nir/nir_lower_bool_to_float.c    | 181
> > 
> > > ++++++++++++++++++
> > 
> > >  src/compiler/nir/nir_lower_bool_to_int32.c    | 162
> > ++++++++++++++++
> > 
> > >  src/compiler/nir/nir_lower_int64.c            |   2 +-
> > 
> > >  .../nir/nir_lower_load_const_to_scalar.c      |   3 +
> > 
> > >  src/compiler/nir/nir_lower_returns.c          |   4 +-
> > 
> > >  src/compiler/nir/nir_lower_subgroups.c        |   2 +-
> > 
> > >  src/compiler/nir/nir_lower_system_values.c    |   1 +
> > 
> > >  src/compiler/nir/nir_opcodes.py               |  31 ++-
> > 
> > >  src/compiler/nir/nir_opt_algebraic.py         |  19 +-
> > 
> > >  src/compiler/nir/nir_opt_constant_folding.c   |  18 +-
> > 
> > >  src/compiler/nir/nir_opt_if.c                 |  91 +++------
> > 
> > >  src/compiler/nir/nir_opt_intrinsics.c         |   2 +-
> > 
> > >  src/compiler/nir/nir_opt_large_constants.c    |   5 +
> > 
> > >  src/compiler/nir/nir_print.c                  |   3 +
> > 
> > >  src/compiler/nir/nir_search.c                 | 112 +++--------
> > 
> > >  src/compiler/nir/nir_search.h                 |   9 +-
> > 
> > >  src/compiler/nir/nir_validate.c               |  16 +-
> > 
> > >  src/compiler/nir/tests/vars_tests.cpp         |  38 ++--
> > 
> > >  src/compiler/nir_types.cpp                    |   2 +-
> > 
> > >  src/compiler/nir_types.h                      |   4 +-
> > 
> > >  src/compiler/spirv/spirv_to_nir.c             |  23 ++-
> > 
> > >  src/compiler/spirv/vtn_alu.c                  |   4 +-
> > 
> > >  src/compiler/spirv/vtn_cfg.c                  |  10 +-
> > 
> > >  src/compiler/spirv/vtn_subgroup.c             |  12 +-
> > 
> > >  src/gallium/auxiliary/nir/tgsi_to_nir.c       |  20 +-
> > 
> > >  .../drivers/freedreno/ir3/ir3_compiler_nir.c  |  31 +--
> > 
> > >  src/gallium/drivers/radeonsi/si_shader_nir.c  |   2 +
> > 
> > >  src/gallium/drivers/vc4/vc4_program.c         |  50 ++---
> > 
> > >  src/intel/compiler/brw_fs_nir.cpp             |  80 ++++----
> > 
> > >  src/intel/compiler/brw_nir.c                  |  16 +-
> > 
> > >  .../brw_nir_analyze_boolean_resolves.c        |  24 +--
> > 
> > >  .../compiler/brw_nir_lower_image_load_store.c |   2 +-
> > 
> > >  src/intel/compiler/brw_vec4_nir.cpp           | 122 ++++++------
> > 
> > >  src/intel/vulkan/anv_pipeline.c               |   2 +-
> > 
> > >  src/mesa/drivers/dri/i965/brw_program.c       |   4 +-
> > 
> > >  src/mesa/main/glspirv.c                       |   2 +-
> > 
> > >  src/mesa/program/prog_to_nir.c                |   2 +-
> > 
> > >  src/mesa/state_tracker/st_glsl_to_nir.cpp     |   2 +-
> > 
> > >  52 files changed, 934 insertions(+), 497 deletions(-)
> > 
> > >  create mode 100644 src/compiler/nir/nir_lower_bool_to_float.c
> > 
> > >  create mode 100644 src/compiler/nir/nir_lower_bool_to_int32.c
> > 
> > > 
> > 
> > 
> > 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20181108/4b9c2d85/attachment-0001.html>