[Mesa-dev] i965: Kicking off fp16 glsl support
Topi Pohjolainen
topi.pohjolainen at gmail.com
Fri Nov 24 12:26:27 UTC 2017
After Igalia's work on SPIRV 16-bit storage question arose how much
is needed on top in order to optimize GLES lowp/mediump with 16-bit
floats. I took glb 2.7 trex as a target and started drafting a glsl
lowering pass re-typing mediump floats into float16. In parallel,
I added bit by bit equivalent support into GLSL -> NIR pass and into
Intel compiler backend.
This series enables lowering for fragment shaders only. This was
sufficient for trex which doesn't use mediump precision for vertex
shaders.
First of all this is not complete work. I'd like to think of it more
as trying to give an idea what is currently missing. And by giving
concrete (if not ideal) solutions making each case a little clearer.
On SKL this runs trex pretty much on par compared to 32-bit. Intel
hardware doesn't have native support for linear interpolation using
16-bit floats and therefore pln() and lrp() incur additional moves
from 16-bits to 32-bits (and vice versa). Both can be replaced
relatively efficiently using mad() later on.
Comparing shader dumps between 16-bit and 32-bit indicates that all
optimization passes kick in nicely (sampler-eot, mad(), etc). Only
additional are the before mentioned conversion instructions.
Series starts with miscellanious bits needed in the glsl and nir.
This is followed by equivalent bits in the Intel compiler backend.
These are followed up changes that are subject to more debate:
1) Support for SIMD8 fp16 in liveness analysis, copy propagation,
dead code elimination, etc.
In order to tell if one instruction fully overwrites the results
of another one needs to examine how much of a register is written.
Until now this has been done in granularity of full or partial
register, i.e., there is no concept of "full sub-region write".
And until now there was no need as all data types took 4-bytes
per element resulting into full 32-byte register even in case of
SIMD8. Partial writes where all special and could be safely
ignored in various analysis passes.
Half precision floats, however, break this assumption. On SIMD8
full write with 16-bit elements results into half register.
I tried patching different passes examing partial writes one by
one but that started to get out hand. Moreover, just by looking
a register type size it is not safe to say if it really is full
write or not.
Solution here is to explicitly store this information into
registers: added new member fs_reg::pad_per_component.
Subsequently patching fs_reg::component_size() to take the
padding into account propagates the information to all users.
Patch 28 updates a few users to use component_size() instead
of open coded, 29 adds the actual support and 30-35 update
NIR -> FS to signal the padding (these are separated just for
review).
It should be noted that here one deals with virtual registers.
Final hardware allocator is separate and using full registers
in virtual space shouldn't prevent it from using thighter
packing.
Chema, this overlaps with your work, I hope you don't mind.
2) Booleans produced from 16-bit sources. Whereas for GLSL and for
NIR booleans are just booleans, on Intel hardware they are integers.
And the presentation depends on how they are produced. Comparisons
(flt, fge, feq and fne) with 32-bit sources produce 32-bit results
(0x0000000/0xFFFFFFFF) while with 16-bits one gets 16-bit results
(0x00000/0xFFFF).
I thought about introducing 16-bit boolean into NIR but that
felt too much hardware specific thing to do. Instead I patched
NIR -> FS to take the type of producing instruction into account
when setting up the SSA values. See patch 39 for the setup and
patches 36-38 for consulting the backend SSA store instead of
relying on NIR.
Another approach left to try is emitting additional moves into
32-bits (the same way we do for fp64). One could then add an
optimization pass that removes unnecessary moves and uses
strided sources instead.
3) Following up 2) GLSL -> NIR decides to emit integer typed
and/or/xor even for originally boolean typed logic ops. Patch 40
tries to cope with the case where the booleans are produced with
non-matching precision.
4) In Intel compiler backend and push/pull constant setup things are
relying on values being packed in 32-bit slots. Moreover, these
slots are typeless and the laoder doesn't know if it is dealing
with floats or integers let alone about precision. Patch 42
takes the first step and simply adds type information into the
backend. This is not particularly pretty but I had to start from
somewhere. This allows the loader to convert float values from
the 32-bit store in the core to 16-bits on the fly. Patch 43
adjusts compiler to use 32-bit slots.
Using 16-bit slots would require substantially more work.
I think there is no question about core using 32-bit values. And
even if the values there were 16-bit, backend would still need to
know types.
My feeling is that we just need to rewrite fair amount of the
Intel push/pull constant setup.
5) Patches 44-50 are all about the GLSL lowering pass. This is
really work-in-progress. What I have here is crude attempt to
do everything in one pass. It also has several hacks working
around shortcomings in the Intel backend.
Short story is that there are quite a few things which don't
have precision and compiler needs to analyze expressions
recursively in order to know what precision to use.
Take, for example, variables that don't have precision but are
referred to from multiple locations. These require the compiler
to examine all the expressions involved and use full precision
for the variable even if one of the expressions require it. This
in turn alters the requirements in the other expressions -
compiler would need to emit conversions for them. And I don't
think this can be done cleanly in one pass.
I also realized that there may be cases where the compiler
would need to use full precision instead of half in order to
submit the most optimal code. Such shaders sound just evil and
I don't even want to think about that now. There is more than
enough work to get even the rules covered...
This series doesn't touch hardware register allocator - it still
allocates one full register per 16-bit float component even in
case of SIMD8.
Patches can be found in (it is rebased on current master and
Igalia's work):
git://people.freedesktop.org/~tpohjola/mesa:16_bit_gles
There are also some simple shader runner tests I wrote along
the way:
git://people.freedesktop.org/~tpohjola/piglit:fp16
All feedback is very welcome. I'm prepared to keep on working on
this if people find it useful. Personally I'd be curious to add
fp16 for pln() and lrp() and see if 16-bits could beat 32-bits
performance wise. Proper push/pull constant support is another
thing on the list. Hardware register allocator with sub-register
support sounds both interesting and scary.
CC: Jose Maria Casanova Crespo <jmcasanova at igalia.com>
CC: Jason Ekstrand <jason at jlekstrand.net>
CC: Kenneth Graunke <kenneth at whitecape.org>
CC: Matt Turner <mattst88 at gmail.com>
CC: Ian Romanick <idr at freedesktop.org>
CC: Francisco Jerez <currojerez at riseup.net>
Topi Pohjolainen (51):
nir: Prepare constant folding for 16-bits
nir: Prepare constant lowering for 16-bits constants
nir: Add 16-bit float support into algebraic opts
glsl: Print 16-bit constants
nir: Print 16-bit constants
glsl: Add support for 16-bit float constants in nir-conversion
glsl: Add conversion ops to/from 16-bit floats
glsl: Add more conversion ops to/from 16-bit floats
glsl: Allow 16-bit neg() and dot()
glsl: Allow 16-bit math
glsl: Enable 16-bit texturing in nir-conversion
intel/compiler/disasm: Print 16-bit IMM values
intel/compiler/disasm: Print fp16 also for sampler messages
intel/compiler/fs: Support for dumping 16-bit IMM values
intel/compiler: Add support for loading 16-bit constants
intel/compiler: Move type_size_scalar() into brw_shader.cpp
intel/compiler: Prepare for glsl mediump float uniforms
intel/compiler: Allow 16-bit math
intel/compiler/fs: Add helpers for 16-bit null regs
intel/compiler/fs: Use two SIMD8 instructions for 16-bit math
intel/compiler/fs: Use 16-bit null dest with 16-bit math
intel/compiler/fs: Use 16-bit null dest with 16-bit compare
intel/compiler: Prepare for 16-bit 3-src ops
intel/compiler: Add support for negating 16-bit floats
intel/compiler/fs: Support for combining 16-bit immediates
intel/compiler/fs: Set 16-bit sampler return format
intel/compiler/fs: Set tex type for generator to flag fp16
intel/compiler/fs: Use component_size() instead of open coded
intel/compiler/fs: Add register padding support
intel/compiler/fs: Pad 16-bit texture return payloads
intel/compiler/fs: Pad 16-bit output (store/fb write) payloads
intel/compiler/fs: Pad 16-bit nir vec* components into full reg
intel/compiler/fs: Pad 16-bit nir intrinsic dest into full reg
intel/compiler/fs: Pad 16-bit const loads into full regs
intel/compiler/fs: Pad 16-bit payload lowering
intel/compiler/fs: Prepare nir_emit_if() for 16-bit sources
intel/compiler/fs: Consider original sizes when retyping alu ops
intel/compiler/fs: Use original reg size when retyping nir src
intel/compiler/fs: Consider logic ops on 16-bit booleans
intel/compiler/fs: Prepare 16-bit and/or/xor for 32-bit src
intel/compiler/eu: Take stride into account in 16-bit ops
i965: WIP: Support for uploading 16-bit uniforms from 32-bit store
intel/compiler/fs: WIP: Use 32-bit slots for 16-bit uniforms
glsl: WIP: Add lowering pass for treating mediump as float16
glsl: Use 16-bit constants if operation is otherwise 16-bit
glsl: Lower float conversions to mediump
glsl: HACK: Force texture return into 16-bits
glsl: HACK: Treat input varyings as 16-bits by conversion
glsl: HACK: Lower builtin float outputs to 16-bits by default
glsl: HACK: Lower all temporary float variables to 16-bits
i965/fs: Lower gles mediump floats into 16-bits
src/compiler/Makefile.sources | 1 +
src/compiler/glsl/glsl_to_nir.cpp | 20 ++
src/compiler/glsl/ir.cpp | 8 +
src/compiler/glsl/ir_expression_operation.py | 17 +
src/compiler/glsl/ir_optimization.h | 1 +
src/compiler/glsl/ir_print_visitor.cpp | 1 +
src/compiler/glsl/ir_validate.cpp | 48 ++-
src/compiler/glsl/lower_mediump.cpp | 405 ++++++++++++++++++++++
src/compiler/nir/nir_lower_load_const_to_scalar.c | 6 +-
src/compiler/nir/nir_opt_constant_folding.c | 2 +
src/compiler/nir/nir_print.c | 5 +
src/compiler/nir/nir_search.c | 4 +
src/intel/compiler/brw_compiler.h | 9 +
src/intel/compiler/brw_disasm.c | 8 +-
src/intel/compiler/brw_eu_emit.c | 27 +-
src/intel/compiler/brw_eu_validate.c | 3 +
src/intel/compiler/brw_fs.cpp | 103 +++---
src/intel/compiler/brw_fs.h | 4 +-
src/intel/compiler/brw_fs_builder.h | 37 +-
src/intel/compiler/brw_fs_combine_constants.cpp | 84 ++++-
src/intel/compiler/brw_fs_copy_propagation.cpp | 5 +-
src/intel/compiler/brw_fs_generator.cpp | 10 +-
src/intel/compiler/brw_fs_nir.cpp | 220 ++++++++++--
src/intel/compiler/brw_fs_visitor.cpp | 1 +
src/intel/compiler/brw_inst.h | 4 +
src/intel/compiler/brw_ir_fs.h | 16 +
src/intel/compiler/brw_reg_type.c | 2 +
src/intel/compiler/brw_shader.cpp | 64 +++-
src/intel/compiler/brw_vec4.cpp | 8 +
src/intel/compiler/brw_vec4_gs_visitor.cpp | 8 +
src/intel/compiler/brw_vec4_visitor.cpp | 4 +
src/mesa/drivers/dri/i965/brw_cs.c | 2 +
src/mesa/drivers/dri/i965/brw_curbe.c | 2 +
src/mesa/drivers/dri/i965/brw_disk_cache.c | 14 +
src/mesa/drivers/dri/i965/brw_gs.c | 2 +
src/mesa/drivers/dri/i965/brw_link.cpp | 3 +
src/mesa/drivers/dri/i965/brw_nir_uniforms.cpp | 10 +
src/mesa/drivers/dri/i965/brw_program.c | 12 +-
src/mesa/drivers/dri/i965/brw_state.h | 1 +
src/mesa/drivers/dri/i965/brw_tcs.c | 2 +
src/mesa/drivers/dri/i965/brw_tes.c | 2 +
src/mesa/drivers/dri/i965/brw_vs.c | 2 +
src/mesa/drivers/dri/i965/brw_wm.c | 2 +
src/mesa/drivers/dri/i965/gen6_constant_state.c | 17 +-
src/mesa/program/ir_to_mesa.cpp | 8 +
src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 9 +
46 files changed, 1112 insertions(+), 111 deletions(-)
create mode 100644 src/compiler/glsl/lower_mediump.cpp
--
2.11.0
More information about the mesa-dev
mailing list