[Mesa-dev] i965: Kicking off fp16 glsl support

Fri Nov 24 12:26:27 UTC 2017

After Igalia's work on SPIRV 16-bit storage question arose how much
is needed on top in order to optimize GLES lowp/mediump with 16-bit
floats. I took glb 2.7 trex as a target and started drafting a glsl
lowering pass re-typing mediump floats into float16. In parallel,
I added bit by bit equivalent support into GLSL -> NIR pass and into
Intel compiler backend.

This series enables lowering for fragment shaders only. This was
sufficient for trex which doesn't use mediump precision for vertex
shaders.

First of all this is not complete work. I'd like to think of it more
as trying to give an idea what is currently missing. And by giving
concrete (if not ideal) solutions making each case a little clearer.

On SKL this runs trex pretty much on par compared to 32-bit. Intel
hardware doesn't have native support for linear interpolation using
16-bit floats and therefore pln() and lrp() incur additional moves
from 16-bits to 32-bits (and vice versa). Both can be replaced
relatively efficiently using mad() later on.
Comparing shader dumps between 16-bit and 32-bit indicates that all
optimization passes kick in nicely (sampler-eot, mad(), etc). Only
additional are the before mentioned conversion instructions.

Series starts with miscellanious bits needed in the glsl and nir.
This is followed by equivalent bits in the Intel compiler backend.
These are followed up changes that are subject to more debate:

1) Support for SIMD8 fp16 in liveness analysis, copy propagation,
   dead code elimination, etc.

   In order to tell if one instruction fully overwrites the results
   of another one needs to examine how much of a register is written.
   Until now this has been done in granularity of full or partial
   register, i.e., there is no concept of "full sub-region write".
   And until now there was no need as all data types took 4-bytes
   per element resulting into full 32-byte register even in case of
   SIMD8. Partial writes where all special and could be safely
   ignored in various analysis passes.
   Half precision floats, however, break this assumption. On SIMD8
   full write with 16-bit elements results into half register.

   I tried patching different passes examing partial writes one by
   one but that started to get out hand. Moreover, just by looking
   a register type size it is not safe to say if it really is full
   write or not.
   Solution here is to explicitly store this information into
   registers: added new member fs_reg::pad_per_component.
   Subsequently patching fs_reg::component_size() to take the
   padding into account propagates the information to all users.
   Patch 28 updates a few users to use component_size() instead
   of open coded, 29 adds the actual support and 30-35 update
   NIR -> FS to signal the padding (these are separated just for
   review).

   It should be noted that here one deals with virtual registers.
   Final hardware allocator is separate and using full registers
   in virtual space shouldn't prevent it from using thighter
   packing.

   Chema, this overlaps with your work, I hope you don't mind.

2) Booleans produced from 16-bit sources. Whereas for GLSL and for
   NIR booleans are just booleans, on Intel hardware they are integers.
   And the presentation depends on how they are produced. Comparisons
   (flt, fge, feq and fne) with 32-bit sources produce 32-bit results
   (0x0000000/0xFFFFFFFF) while with 16-bits one gets 16-bit results
   (0x00000/0xFFFF).

   I thought about introducing 16-bit boolean into NIR but that
   felt too much hardware specific thing to do. Instead I patched
   NIR -> FS to take the type of producing instruction into account
   when setting up the SSA values. See patch 39 for the setup and
   patches 36-38 for consulting the backend SSA store instead of
   relying on NIR.

   Another approach left to try is emitting additional moves into
   32-bits (the same way we do for fp64). One could then add an
   optimization pass that removes unnecessary moves and uses
   strided sources instead.

3) Following up 2) GLSL -> NIR decides to emit integer typed
   and/or/xor even for originally boolean typed logic ops. Patch 40
   tries to cope with the case where the booleans are produced with
   non-matching precision.

4) In Intel compiler backend and push/pull constant setup things are
   relying on values being packed in 32-bit slots. Moreover, these
   slots are typeless and the laoder doesn't know if it is dealing
   with floats or integers let alone about precision. Patch 42
   takes the first step and simply adds type information into the
   backend. This is not particularly pretty but I had to start from
   somewhere. This allows the loader to convert float values from
   the 32-bit store in the core to 16-bits on the fly. Patch 43
   adjusts compiler to use 32-bit slots.
   Using 16-bit slots would require substantially more work.

   I think there is no question about core using 32-bit values. And
   even if the values there were 16-bit, backend would still need to
   know types.

   My feeling is that we just need to rewrite fair amount of the
   Intel push/pull constant setup.

5) Patches 44-50 are all about the GLSL lowering pass. This is
   really work-in-progress. What I have here is crude attempt to
   do everything in one pass. It also has several hacks working
   around shortcomings in the Intel backend.

   Short story is that there are quite a few things which don't
   have precision and compiler needs to analyze expressions
   recursively in order to know what precision to use.

   Take, for example, variables that don't have precision but are
   referred to from multiple locations. These require the compiler
   to examine all the expressions involved and use full precision
   for the variable even if one of the expressions require it. This
   in turn alters the requirements in the other expressions -
   compiler would need to emit conversions for them. And I don't
   think this can be done cleanly in one pass.

   I also realized that there may be cases where the compiler
   would need to use full precision instead of half in order to
   submit the most optimal code. Such shaders sound just evil and
   I don't even want to think about that now. There is more than
   enough work to get even the rules covered...

This series doesn't touch hardware register allocator - it still
allocates one full register per 16-bit float component even in
case of SIMD8.

Patches can be found in (it is rebased on current master and
Igalia's work):

git://people.freedesktop.org/~tpohjola/mesa:16_bit_gles

There are also some simple shader runner tests I wrote along
the way:

git://people.freedesktop.org/~tpohjola/piglit:fp16

All feedback is very welcome. I'm prepared to keep on working on
this if people find it useful. Personally I'd be curious to add
fp16 for pln() and lrp() and see if 16-bits could beat 32-bits
performance wise. Proper push/pull constant support is another
thing on the list. Hardware register allocator with sub-register
support sounds both interesting and scary.

CC: Jose Maria Casanova Crespo <jmcasanova at igalia.com>
CC: Jason Ekstrand <jason at jlekstrand.net>
CC: Kenneth Graunke <kenneth at whitecape.org>
CC: Matt Turner <mattst88 at gmail.com>
CC: Ian Romanick <idr at freedesktop.org>
CC: Francisco Jerez <currojerez at riseup.net>

Topi Pohjolainen (51):
  nir: Prepare constant folding for 16-bits
  nir: Prepare constant lowering for 16-bits constants
  nir: Add 16-bit float support into algebraic opts
  glsl: Print 16-bit constants
  nir: Print 16-bit constants
  glsl: Add support for 16-bit float constants in nir-conversion
  glsl: Add conversion ops to/from 16-bit floats
  glsl: Add more conversion ops to/from 16-bit floats
  glsl: Allow 16-bit neg() and dot()
  glsl: Allow 16-bit math
  glsl: Enable 16-bit texturing in nir-conversion
  intel/compiler/disasm: Print 16-bit IMM values
  intel/compiler/disasm: Print fp16 also for sampler messages
  intel/compiler/fs: Support for dumping 16-bit IMM values
  intel/compiler: Add support for loading 16-bit constants
  intel/compiler: Move type_size_scalar() into brw_shader.cpp
  intel/compiler: Prepare for glsl mediump float uniforms
  intel/compiler: Allow 16-bit math
  intel/compiler/fs: Add helpers for 16-bit null regs
  intel/compiler/fs: Use two SIMD8 instructions for 16-bit math
  intel/compiler/fs: Use 16-bit null dest with 16-bit math
  intel/compiler/fs: Use 16-bit null dest with 16-bit compare
  intel/compiler: Prepare for 16-bit 3-src ops
  intel/compiler: Add support for negating 16-bit floats
  intel/compiler/fs: Support for combining 16-bit immediates
  intel/compiler/fs: Set 16-bit sampler return format
  intel/compiler/fs: Set tex type for generator to flag fp16
  intel/compiler/fs: Use component_size() instead of open coded
  intel/compiler/fs: Add register padding support
  intel/compiler/fs: Pad 16-bit texture return payloads
  intel/compiler/fs: Pad 16-bit output (store/fb write) payloads
  intel/compiler/fs: Pad 16-bit nir vec* components into full reg
  intel/compiler/fs: Pad 16-bit nir intrinsic dest into full reg
  intel/compiler/fs: Pad 16-bit const loads into full regs
  intel/compiler/fs: Pad 16-bit payload lowering
  intel/compiler/fs: Prepare nir_emit_if() for 16-bit sources
  intel/compiler/fs: Consider original sizes when retyping alu ops
  intel/compiler/fs: Use original reg size when retyping nir src
  intel/compiler/fs: Consider logic ops on 16-bit booleans
  intel/compiler/fs: Prepare 16-bit and/or/xor for 32-bit src
  intel/compiler/eu: Take stride into account in 16-bit ops
  i965: WIP: Support for uploading 16-bit uniforms from 32-bit store
  intel/compiler/fs: WIP: Use 32-bit slots for 16-bit uniforms
  glsl: WIP: Add lowering pass for treating mediump as float16
  glsl: Use 16-bit constants if operation is otherwise 16-bit
  glsl: Lower float conversions to mediump
  glsl: HACK: Force texture return into 16-bits
  glsl: HACK: Treat input varyings as 16-bits by conversion
  glsl: HACK: Lower builtin float outputs to 16-bits by default
  glsl: HACK: Lower all temporary float variables to 16-bits
  i965/fs: Lower gles mediump floats into 16-bits

 src/compiler/Makefile.sources                     |   1 +
 src/compiler/glsl/glsl_to_nir.cpp                 |  20 ++
 src/compiler/glsl/ir.cpp                          |   8 +
 src/compiler/glsl/ir_expression_operation.py      |  17 +
 src/compiler/glsl/ir_optimization.h               |   1 +
 src/compiler/glsl/ir_print_visitor.cpp            |   1 +
 src/compiler/glsl/ir_validate.cpp                 |  48 ++-
 src/compiler/glsl/lower_mediump.cpp               | 405 ++++++++++++++++++++++
 src/compiler/nir/nir_lower_load_const_to_scalar.c |   6 +-
 src/compiler/nir/nir_opt_constant_folding.c       |   2 +
 src/compiler/nir/nir_print.c                      |   5 +
 src/compiler/nir/nir_search.c                     |   4 +
 src/intel/compiler/brw_compiler.h                 |   9 +
 src/intel/compiler/brw_disasm.c                   |   8 +-
 src/intel/compiler/brw_eu_emit.c                  |  27 +-
 src/intel/compiler/brw_eu_validate.c              |   3 +
 src/intel/compiler/brw_fs.cpp                     | 103 +++---
 src/intel/compiler/brw_fs.h                       |   4 +-
 src/intel/compiler/brw_fs_builder.h               |  37 +-
 src/intel/compiler/brw_fs_combine_constants.cpp   |  84 ++++-
 src/intel/compiler/brw_fs_copy_propagation.cpp    |   5 +-
 src/intel/compiler/brw_fs_generator.cpp           |  10 +-
 src/intel/compiler/brw_fs_nir.cpp                 | 220 ++++++++++--
 src/intel/compiler/brw_fs_visitor.cpp             |   1 +
 src/intel/compiler/brw_inst.h                     |   4 +
 src/intel/compiler/brw_ir_fs.h                    |  16 +
 src/intel/compiler/brw_reg_type.c                 |   2 +
 src/intel/compiler/brw_shader.cpp                 |  64 +++-
 src/intel/compiler/brw_vec4.cpp                   |   8 +
 src/intel/compiler/brw_vec4_gs_visitor.cpp        |   8 +
 src/intel/compiler/brw_vec4_visitor.cpp           |   4 +
 src/mesa/drivers/dri/i965/brw_cs.c                |   2 +
 src/mesa/drivers/dri/i965/brw_curbe.c             |   2 +
 src/mesa/drivers/dri/i965/brw_disk_cache.c        |  14 +
 src/mesa/drivers/dri/i965/brw_gs.c                |   2 +
 src/mesa/drivers/dri/i965/brw_link.cpp            |   3 +
 src/mesa/drivers/dri/i965/brw_nir_uniforms.cpp    |  10 +
 src/mesa/drivers/dri/i965/brw_program.c           |  12 +-
 src/mesa/drivers/dri/i965/brw_state.h             |   1 +
 src/mesa/drivers/dri/i965/brw_tcs.c               |   2 +
 src/mesa/drivers/dri/i965/brw_tes.c               |   2 +
 src/mesa/drivers/dri/i965/brw_vs.c                |   2 +
 src/mesa/drivers/dri/i965/brw_wm.c                |   2 +
 src/mesa/drivers/dri/i965/gen6_constant_state.c   |  17 +-
 src/mesa/program/ir_to_mesa.cpp                   |   8 +
 src/mesa/state_tracker/st_glsl_to_tgsi.cpp        |   9 +
 46 files changed, 1112 insertions(+), 111 deletions(-)
 create mode 100644 src/compiler/glsl/lower_mediump.cpp

-- 
2.11.0