[Mesa-dev] intel: WIP: Support for using 16-bits for mediump

Wed Nov 7 06:53:41 UTC 2018

On Tue, 2018-11-06 at 11:04 +0200, Pohjolainen, Topi wrote:
> On Tue, Nov 06, 2018 at 09:40:17AM +0100, Iago Toral wrote:
> > On Tue, 2018-11-06 at 08:30 +0200, Topi Pohjolainen wrote:
> > > Here is a version 2 of adding support for 16-bit float
> > > instructions
> > > in
> > > the shader compiler. Unlike the first version which did all the
> > > analysis
> > > at glsl level here one adds the notion of precision to NIR
> > > variables
> > > and
> > > does the analysis and precision lowering in NIR level.
> > > 
> > > This lives in: gitlab.freedesktop.org:tpohjola/mesa and branch
> > > fp16.
> > > 
> > > This is now mature enough to be able to use 16-bit precision for
> > > all
> > > instructions except a few special cases for gfxbench trex and
> > > alu2.
> > > (Unfortunately I'm not seeing any performance benefit.
> > 
> > That is not too surprising. The backend optimizer has been
> > implemented
> > in terms of 32-bit and you are probably losing a lot of
> > optimizations
> > in the generated code for 16-bit paths. I have hit some of that as
> > well
> > while working on the backend aspects of enabling 16-bit. For
> > example,
> > SIMD8 executions (which is all of the geometry pipeline) will not
> > benefit from copy-propagation because is_partial_write() is always
> > true
> > for SIMD8 16-bit instructions with its current semantics. There are
> > other optimization passes that have hard-coded 32-bit conditions,
> > etc I
> > have addressed a small part of this and have some code available
> > that I
> > expect to send for review soon, but there is clearly work to be
> > done in
> > the backend to optimize things for 16-bit paths, which I hope to
> > work
> > on in the near future.
> 
> I have added the concept of padding registers in virtual space in
> order
> to keep all the optimizations functioning. Comparing side-by-side the
> brw_fs instructions between 32-bit and 16-bit versions (of t-rex at
> least)
> tells me that 16-bit version is equivalent to 32-bit. Only extra
> things
> are conversions from 32-bit input varyings to 16-bits and conversions
> from
> 16-bits to 32-bits texture coordinates. Both of which I'm working on.

That's surprising, because there're optimization paths in the backend
that  are clearly hardcoded for 32-bit register types. It could be that
t-rex isn't affected by any of this though.

Anyway, if the code for t-rex is similarly optimized (modulo
input/output conversions) it is very disapointing that we're not seeing
any advantage :-/

> > 
> > >  This is not
> > > that surprising as I got to the same point with the glsl-based
> > > solution and was able to measure the performance already back
> > > then).
> > > Hence I thought it is time to share it.
> > > 
> > > While this is still work-in-progress I didn't want to flood the
> > > list
> > > with the full set of patches but instead included the very last
> > > where
> > > I try to outline the logic and its current shortcomings. There is
> > > also
> > > a short list of TODO items.
> > > 
> > > In addition to those I need to examine couple of Intel specific
> > > misrenderings. I haven't gotten that deep yet but it looks I'm
> > > missing
> > > something with 16-bit inot and mad/mac lowered interpolation.
> > > Unfortunately I get corrupted rendering only with hardware while
> > > simulator is happy.
> > 
> > Are you implementing interpolation of 16-bit fragment shader
> > inputs? I
> > have discussed that with Jason in the past and based on my
> > experimentation, I think the hardware doesn't support this
> > natively:
> > the interpolator seems to produce 32-bit deltas only and assumes
> > 32-bit 
> > inputs only as well.
> 
> Matt has added lowering of pln() to mad/mac for gen11+. In addition,
> hardware allows mixed mode operations on mad() and hence we can
> interpolate
> mixing with 16-bit and 32-bit operands producing 16-bit results.
> Therefore
> leveraring both these I can keep the incoming sources to the shader
> as 32-bits
> but interpolate with 16-bit precision. Looking like this in practise:
> 
> mad(8)  g11<1>HF  g4.3<0,1,0>F    g2<4,4,1>F   g4.0<0,1,0>F { align16
> 1Q };
> mad(8)  g11<1>HF  g11<4,4,1>HF    g3<4,4,1>F   g4.1<0,1,0>F { align16
> 1Q };
> mad(8)  g12<1>HF  g4.7<0,1,0>F    g2<4,4,1>F   g4.4<0,1,0>F { align16
> 1Q };
> mad(8)  g12<1>HF  g12<4,4,1>HF    g3<4,4,1>F   g4.5<0,1,0>F { align16
> 1Q };

Ok, so you are using 32-bit inputs in the FS, that should be fine then.

> Although there is something amiss with this, I'm getting partially
> corrupted data with hardware while simulator is happy. Working on
> it...
> 
> > 
> > > Mostly I'm afraid how to test all of this properly. I haven't
> > > written
> > > any unit tests but that is high on my list. This is mostly
> > > because
> > > I've
> > > been uncertain about my design choices. So far I've used shader
> > > runner tests that I've written for specific cases. These are
> > > useful
> > > for
> > > development purposes but don't bring much value for regression
> > > testing.
> > 
> > Have you tried dEQP / GLES CTS yet? I figure there should be a lot
> > of
> > mediump shaders there.
> 
> I need to take a look. What I'm afraid of is checking for precision
> of
> results, i.e., checking that compiler doesn't lower too much, I hope
> those
> tests are addessing this.

Yes, also, I don't think that requirements for mediump precision are
very well defined in the GL ES specs either...

> > 
> > Another note on 16-bit booleans, since I see you've been working on
> > that,  I don't know if you're aware that Jason has posted relevant 
> > patches here:
> > 
> > 
https://lists.freedesktop.org/archives/mesa-dev/2018-October/207458.html
> > 
> > This basically introduced the notion of bit-sized booleans in NIR,
> > and
> > it leaves it up to the backend to lower booleans to the bit-size
> > they
> > need before translating to a backend IR.  I have been working on
> > that
> > lowering and have a prototype working for 16-bit booleans built on
> > top
> > of Jason's series and my backend work for half-float. Let me know
> > if
> > you are interested and I'll point you to the code.
> 
> I'll have a look. I had to address that as well, and if you want to
> take a
> look:
> 
> nir: Add lowering pass setting 16-bit boolean destinations
> nir: Add lowering pass turning b2f(i2i32(x)) into b2f(x)
> Revert "intel/compiler: fix 16-bit comparisons"
> 
> I haven't yet tried to rebase on top of Jason's work. In case of GLSL
> I can't
> just lower in the backend. During the analysis one needs to know if
> sources
> are produced in 16-bit precision and mark this somehow down. I've
> chosen to
> sprinkle artificial i216 and i2132 operations for that and to remove
> them
> once validation is done.
> 
> > 
> > Iago
> > 
> > > Alejandro Piñeiro (1):
> > >   intel/compiler/fs: Use half_precision data_format on 16-bit fb
> > > writes
> > > 
> > > Jose Maria Casanova Crespo (2):
> > >   intel/compiler/fs: Include support for RT data_format bit
> > >   intel/compiler/disasm: Show half-precision data_format on
> > > rt_writes
> > > 
> > > Topi Pohjolainen (58):
> > >   intel/compiler/fs: Set 16-bit sampler return format
> > >   intel/compiler/disasm: Show half-precision for sampler messages
> > >   intel/compiler/fs: Skip tex-inst early in conversion lowering
> > >   intel/compiler/fs: Support for dumping 16-bit IMM values
> > >   intel/compiler: Allow 16-bit math
> > >   intel/compiler/fs: Add helpers for 16-bit null regs
> > >   intel/compiler/fs: Use two SIMD8 instructions for 16-bit math
> > >   intel/compiler/fs: Use 16-bit null dest with 16-bit math
> > >   intel/compiler/fs: Use 16-bit null dest with 16-bit compare
> > >   intel/compiler/fs: Add 16-bit type support for nir_if
> > >   intel/compiler/eu: Prepare 3-src-op for 16-bit sources
> > >   intel/compiler/eu: Prepare 3-src-op for 16-bit dst
> > >   intel/compiler/eu: Allow 3-src-op with mixed precision (HF/F)
> > > sources
> > >   intel/compiler/disasm: Print mixed precision 3-src types
> > > correctly
> > >   intel/compiler/disasm: Print 16-bit IMM values
> > >   intel/compiler/fs: Support for combining 16-bit immediates
> > >   intel/compiler/fs: Set tex type for generator to flag fp16
> > >   intel/compiler/fs: Use component_size() instead of open coded
> > >   intel/compiler/fs: Add register padding support
> > >   intel/compiler/fs: Pad 16-bit texture return payloads
> > >   intel/compiler/fs: Pad 16-bit output (store/fb write) payloads
> > >   intel/compiler/fs: Pad 16-bit nir vec* components into full reg
> > >   intel/compiler/fs: Pad 16-bit nir intrinsic dest into full reg
> > >   intel/compiler/fs: Pad 16-bit const loads into full regs
> > >   intel/compiler/fs: Pad 16-bit load payload lowering
> > >   nir: Lower also 16-bit lrp() if needed
> > >   intel/compiler: Lower 16-bit lrp()
> > >   nir: Recognize f232(f216(x)) as x
> > >   nir: Recognize f216(f232(x)) as x
> > >   nir: Store variable precision when translating from glsl
> > >   glsl: Set default precision for builtin variables
> > >   i965: Prepare uniform mapping for 16-bit values
> > >   i965: Support for uploading 16-bit uniforms from 32-bit store
> > >   intel/compiler/fs: WIP: Use 32-bit slots for 16-bit uniforms
> > >   intel/compiler: Tell compiler if lower precision is supported
> > >   nir: Add lowering pass for variables marked mediump
> > >   nir: Add pass for deref precision lowering
> > >   nir: Add pass for alu precision lowering
> > >   nir: Add precision conversion for load/store_deref
> > >   nir: Add precision conversion for sources of texturing ops
> > >   nir: Don't set destination size 16 for booleans
> > >   nir: Add precision lowering for texture samples
> > >   nir: Add support for non-fixed precision
> > >   nir: Don't try to alter precision of boolean sources
> > >   nir: Add support for variable sized booleans
> > >   nir: Add support for lowering phi precision
> > >   intel/compiler/fs: Prepare alu dest type for 16-bit booleans
> > >   nir: Add lowering pass setting 16-bit boolean destinations
> > >   nir: Add lowering pass turning b2f(i2i32(x)) into b2f(x)
> > >   nir: Adjust integer precision for alus operating with 16-bit
> > > srcs
> > >   nir: Replace b2f(x) with b2f(i2i32(x)) for 16-bit x
> > >   nir: Adjust precision for discard_if
> > >   nir: Allow input varyings to be converted to lower precision
> > >   nir: Replace 16-bit src[0] for bcsel i2i32(src[0])
> > >   nir: Replace 16-bit nir_if condition with i2i32(condition)
> > >   Revert "intel/compiler: fix 16-bit comparisons"
> > >   intel/compiler: Hook in precision lowering pass
> > >   nir: Document precision lowering pass
> > > 
> > >  src/compiler/Makefile.sources                 |   2 +
> > >  src/compiler/glsl/glsl_symbol_table.cpp       |  20 +
> > >  src/compiler/glsl/glsl_symbol_table.h         |   7 +
> > >  src/compiler/glsl/glsl_to_nir.cpp             |   1 +
> > >  src/compiler/nir/meson.build                  |   2 +
> > >  src/compiler/nir/nir.h                        |  18 +
> > >  src/compiler/nir/nir_lower_bool_size.c        | 120 +++
> > >  src/compiler/nir/nir_lower_precision.cpp      | 820
> > > ++++++++++++++++++
> > >  src/compiler/nir/nir_opt_algebraic.py         |   5 +
> > >  src/intel/blorp/blorp.c                       |   4 +-
> > >  src/intel/compiler/brw_compiler.c             |   1 +
> > >  src/intel/compiler/brw_disasm.c               |  28 +-
> > >  src/intel/compiler/brw_eu.h                   |   3 +-
> > >  src/intel/compiler/brw_eu_emit.c              |  83 +-
> > >  src/intel/compiler/brw_fs.cpp                 |  68 +-
> > >  src/intel/compiler/brw_fs.h                   |   4 +-
> > >  src/intel/compiler/brw_fs_builder.h           |  37 +-
> > >  .../compiler/brw_fs_combine_constants.cpp     |  84 +-
> > >  .../compiler/brw_fs_copy_propagation.cpp      |   7 +-
> > >  src/intel/compiler/brw_fs_generator.cpp       |  13 +-
> > >  .../compiler/brw_fs_lower_conversions.cpp     |  42 +
> > >  src/intel/compiler/brw_fs_nir.cpp             | 197 +++--
> > >  src/intel/compiler/brw_fs_surface_builder.cpp |   3 +-
> > >  src/intel/compiler/brw_fs_visitor.cpp         |   6 +
> > >  src/intel/compiler/brw_inst.h                 |   5 +
> > >  src/intel/compiler/brw_ir_fs.h                |  16 +
> > >  src/intel/compiler/brw_nir.c                  |  22 +-
> > >  src/intel/compiler/brw_nir.h                  |   4 +-
> > >  src/intel/compiler/brw_reg_type.c             |   2 +
> > >  src/intel/compiler/brw_shader.h               |   7 +
> > >  src/intel/vulkan/anv_pipeline.c               |   2 +-
> > >  .../drivers/dri/i965/brw_nir_uniforms.cpp     |   8 +-
> > >  src/mesa/drivers/dri/i965/brw_program.c       |  10 +-
> > >  src/mesa/drivers/dri/i965/brw_program.h       |   6 +-
> > >  src/mesa/drivers/dri/i965/brw_tcs.c           |   2 +-
> > >  .../drivers/dri/i965/gen6_constant_state.c    |  14 +-
> > >  36 files changed, 1548 insertions(+), 125 deletions(-)
> > >  create mode 100644 src/compiler/nir/nir_lower_bool_size.c
> > >  create mode 100644 src/compiler/nir/nir_lower_precision.cpp
> > > 
> 
>