[Mesa-dev] intel: WIP: Support for using 16-bits for mediump

Tue Nov 6 13:08:32 UTC 2018

On Tue, Nov 6, 2018 at 1:45 PM Pohjolainen, Topi
<topi.pohjolainen at gmail.com> wrote:
>
> On Tue, Nov 06, 2018 at 11:31:58AM +0100, Connor Abbott wrote:
> > On Tue, Nov 6, 2018 at 11:14 AM Pohjolainen, Topi
> > <topi.pohjolainen at gmail.com> wrote:
> > >
> > > On Tue, Nov 06, 2018 at 10:45:52AM +0100, Connor Abbott wrote:
> > > > As far as I understand, mediump handling can be split into two parts:
> > > >
> > > > 1. Figuring out which operations (instructions or SSA values in NIR)
> > > > can use relaxed precision.
> > > > 2. Deciding which relaxed-precision operations to actually compute in
> > > > 16-bit precision.
> > > >
> > > > At least for GLSL, #1 is pretty well nailed down by the GLSL spec,
> > > > where it's specified in terms of the source expressions. For example,
> > > > something like:
> > > >
> > > > mediump float a = ...;
> > > > mediump float b = ...;
> > > > float c = a + b;
> > > > float d = c + 2.0;
> > > >
> > > > the last addition must be performed in full precision, whereas for:
> > > >
> > > >
> > > > mediump float a = ...;
> > > > mediump float b = ...;
> > > > float d = (a + b) + 2.0;
> > > >
> > > > it can be lowered to 16-bit. This information gets lost during
> > > > expression grafting in GLSL IR, or vars-to-SSA in NIR, and even the
> > > > AST -> GLSL IR transform will sometimes split up expressions, so it
> > > > seems like both are too low-level for this. The analysis described by
> > > > the spec (the paragraph in section 4.7.3 "Precision Qualifiers" of the
> > > > GLSL ES 3.20 spec) has to happen on the AST after type checking but
> > > > before lowering to GLSL IR in order to be correct and not overly
> > > > conservative. If you want to do it in NIR since #2 is easier with SSA,
> > > > then sure... but we can't mix them up and do both at the same time.
> > > > We'll have to add support for annotating ir_expression's and nir_instr
> > > > (or maybe nir_ssa_def's) with a relaxed precision, and filter that
> > > > information down through the pipeline. Hopefully that also works
> > > > better for SPIR-V, where you can annotate individual instructions as
> > > > being RelaxedPrecision, and afaik (hopefully) #1 is handled by
> > > > glslang.
> > >
> > > I tried to describe the logic I used and my interpretation of the spec in
> > > the accompanying patch:
> > >
> > > https://lists.freedesktop.org/archives/mesa-dev/2018-November/208683.html
> > >
> > > Does it make any sense?
> >
> > It seems incorrect, since it will make the addition in my example
> > operate in 16 bit precision when it shouldn't. As I explained above,
> > it's impossible to do this correctly in NIR.
> >
> > Also, abusing a 16-bit bitsize in NIR to mean mediump is not ok. There
> > are other vulkan/glsl extensions out there that provide actual fp16
> > support, where the result is guaranteed to be calculated as a
> > half-float, and these obviously won't work properly with this pass. We
> > need to add a flag to the SSA def, or Jason's idea a long time ago was
> > to add a fake "24-bit" bitsize. Part of #2 will involve converting the
> > bitsize to be 16-bit and removing the flag.
>
> I wrote small test shader-runner test:
>
> [require]
> GL ES >= 2.0
> GLSL ES >= 1.00
>
> [vertex shader]
> #version 100
>
> precision highp float;
>
> attribute vec4 piglit_vertex;
>
> void main()
> {
>     gl_Position = piglit_vertex;
> }
>
> [fragment shader]
> #version 100
> precision highp float;
>
> uniform mediump float a;
> uniform mediump float b;
>
> void main()
> {
>     float c = a + b;
>     float d = c + 0.4;
>
>     gl_FragColor = vec4(a, b, c, d);
> }
>
> [test]
> uniform float a 0.1
> uniform float b 0.2
> draw rect -1 -1 2 2
> probe all rgba 0.1 0.2 0.3 0.7
>
>
> And that made me realize another short-coming in my implementation - I lowered
> variable precision after running nir_lower_var_copies() loosing precision for
> local temporaries. Moving it before gave me:
>
> NIR (final form) for fragment shader:
> shader: MESA_SHADER_FRAGMENT
> name: GLSL3
> inputs: 0
> outputs: 0
> uniforms: 8
> shared: 0
> decl_var uniform INTERP_MODE_NONE float16_t a (0, 0, 0)
> decl_var uniform INTERP_MODE_NONE float16_t b (1, 4, 0)
> decl_var shader_out INTERP_MODE_NONE vec4 gl_FragColor (FRAG_RESULT_COLOR, 4, 0)
> decl_function main (0 params)
>
> impl main {
>         block block_0:
>         /* preds: */
>         vec1 32 ssa_0 = load_const (0x3ecccccd /* 0.400000 */)
>         vec1 32 ssa_1 = load_const (0x00000000 /* 0.000000 */)
>         vec1 16 ssa_2 = intrinsic load_uniform (ssa_1) (0, 4) /* base=0 */ /* range=4 */        /* a */
>         vec1 16 ssa_3 = intrinsic load_uniform (ssa_1) (4, 4) /* base=4 */ /* range=4 */        /* b */
>         vec1 16 ssa_4 = fadd ssa_2, ssa_3
>         vec1 32 ssa_5 = f2f32 ssa_4
>         vec1 32 ssa_6 = f2f32 ssa_2
>         vec1 32 ssa_7 = f2f32 ssa_3
>         vec1 32 ssa_8 = fadd ssa_5, ssa_0
>         vec4 32 ssa_9 = vec4 ssa_6, ssa_7, ssa_5, ssa_8
>         intrinsic store_output (ssa_9, ssa_1) (4, 15, 0) /* base=4 */ /* wrmask=xyzw */ /* component=0 */       /* gl_FragColor */
>         /* succs: block_1 */
>         block block_1:
> }
>
>
> which looks to do the right thing.

Your piglit test uses c in the output, preventing GLSL IR tree
grafting from kicking in. If you remove that, then it'll be broken
again since GLSL IR will rewrite it to a single expression before you
get a chance to see it.

The GLSL spec really wants you to think of operations like plus,
times, etc. as having a precision qualifier attached to them, as well
as input/output variables. This is also how SPIR-V works. The
propagation algorithm is just a small syntactic sugar, so the deeper
you push it down the pipeline, the more whack-a-mole games like this
you're going to have to play, and if you have to start disabling
optimizations like tree grafting, or you have to choose the final
bitsize before you have all the information SSA gives you (like moving
it before nir_lower_var_copies() and nir_vars_to_ssa()), then
performance will be worse.

>
>
> Now, you may still be very correct in general that my approach is flawed. I've
> been hoping to work on item 1) in my TODO list (found in patch number 61 in
> the list). There has just been too many parts that I had to address (including
> Intel backend) to get performance numbers for benchmarks.
>
> I'm not very good at saying on paper if something works or not - I rather try
> it out to understand it better. My idea is to use deref loads and stores as
> recursion starting points.