[Mesa-dev] intel: WIP: Support for using 16-bits for mediump

Tue Nov 6 13:44:42 UTC 2018

On Tue, Nov 06, 2018 at 02:08:32PM +0100, Connor Abbott wrote:
> On Tue, Nov 6, 2018 at 1:45 PM Pohjolainen, Topi
> <topi.pohjolainen at gmail.com> wrote:
> >
> > On Tue, Nov 06, 2018 at 11:31:58AM +0100, Connor Abbott wrote:
> > > On Tue, Nov 6, 2018 at 11:14 AM Pohjolainen, Topi
> > > <topi.pohjolainen at gmail.com> wrote:
> > > >
> > > > On Tue, Nov 06, 2018 at 10:45:52AM +0100, Connor Abbott wrote:
> > > > > As far as I understand, mediump handling can be split into two parts:
> > > > >
> > > > > 1. Figuring out which operations (instructions or SSA values in NIR)
> > > > > can use relaxed precision.
> > > > > 2. Deciding which relaxed-precision operations to actually compute in
> > > > > 16-bit precision.
> > > > >
> > > > > At least for GLSL, #1 is pretty well nailed down by the GLSL spec,
> > > > > where it's specified in terms of the source expressions. For example,
> > > > > something like:
> > > > >
> > > > > mediump float a = ...;
> > > > > mediump float b = ...;
> > > > > float c = a + b;
> > > > > float d = c + 2.0;
> > > > >
> > > > > the last addition must be performed in full precision, whereas for:
> > > > >
> > > > >
> > > > > mediump float a = ...;
> > > > > mediump float b = ...;
> > > > > float d = (a + b) + 2.0;
> > > > >
> > > > > it can be lowered to 16-bit. This information gets lost during
> > > > > expression grafting in GLSL IR, or vars-to-SSA in NIR, and even the
> > > > > AST -> GLSL IR transform will sometimes split up expressions, so it
> > > > > seems like both are too low-level for this. The analysis described by
> > > > > the spec (the paragraph in section 4.7.3 "Precision Qualifiers" of the
> > > > > GLSL ES 3.20 spec) has to happen on the AST after type checking but
> > > > > before lowering to GLSL IR in order to be correct and not overly
> > > > > conservative. If you want to do it in NIR since #2 is easier with SSA,
> > > > > then sure... but we can't mix them up and do both at the same time.
> > > > > We'll have to add support for annotating ir_expression's and nir_instr
> > > > > (or maybe nir_ssa_def's) with a relaxed precision, and filter that
> > > > > information down through the pipeline. Hopefully that also works
> > > > > better for SPIR-V, where you can annotate individual instructions as
> > > > > being RelaxedPrecision, and afaik (hopefully) #1 is handled by
> > > > > glslang.
> > > >
> > > > I tried to describe the logic I used and my interpretation of the spec in
> > > > the accompanying patch:
> > > >
> > > > https://lists.freedesktop.org/archives/mesa-dev/2018-November/208683.html
> > > >
> > > > Does it make any sense?
> > >
> > > It seems incorrect, since it will make the addition in my example
> > > operate in 16 bit precision when it shouldn't. As I explained above,
> > > it's impossible to do this correctly in NIR.
> > >
> > > Also, abusing a 16-bit bitsize in NIR to mean mediump is not ok. There
> > > are other vulkan/glsl extensions out there that provide actual fp16
> > > support, where the result is guaranteed to be calculated as a
> > > half-float, and these obviously won't work properly with this pass. We
> > > need to add a flag to the SSA def, or Jason's idea a long time ago was
> > > to add a fake "24-bit" bitsize. Part of #2 will involve converting the
> > > bitsize to be 16-bit and removing the flag.
> >
> > I wrote small test shader-runner test:
> >
> > [require]
> > GL ES >= 2.0
> > GLSL ES >= 1.00
> >
> > [vertex shader]
> > #version 100
> >
> > precision highp float;
> >
> > attribute vec4 piglit_vertex;
> >
> > void main()
> > {
> >     gl_Position = piglit_vertex;
> > }
> >
> > [fragment shader]
> > #version 100
> > precision highp float;
> >
> > uniform mediump float a;
> > uniform mediump float b;
> >
> > void main()
> > {
> >     float c = a + b;
> >     float d = c + 0.4;
> >
> >     gl_FragColor = vec4(a, b, c, d);
> > }
> >
> > [test]
> > uniform float a 0.1
> > uniform float b 0.2
> > draw rect -1 -1 2 2
> > probe all rgba 0.1 0.2 0.3 0.7
> >
> >
> > And that made me realize another short-coming in my implementation - I lowered
> > variable precision after running nir_lower_var_copies() loosing precision for
> > local temporaries. Moving it before gave me:
> >
> > NIR (final form) for fragment shader:
> > shader: MESA_SHADER_FRAGMENT
> > name: GLSL3
> > inputs: 0
> > outputs: 0
> > uniforms: 8
> > shared: 0
> > decl_var uniform INTERP_MODE_NONE float16_t a (0, 0, 0)
> > decl_var uniform INTERP_MODE_NONE float16_t b (1, 4, 0)
> > decl_var shader_out INTERP_MODE_NONE vec4 gl_FragColor (FRAG_RESULT_COLOR, 4, 0)
> > decl_function main (0 params)
> >
> > impl main {
> >         block block_0:
> >         /* preds: */
> >         vec1 32 ssa_0 = load_const (0x3ecccccd /* 0.400000 */)
> >         vec1 32 ssa_1 = load_const (0x00000000 /* 0.000000 */)
> >         vec1 16 ssa_2 = intrinsic load_uniform (ssa_1) (0, 4) /* base=0 */ /* range=4 */        /* a */
> >         vec1 16 ssa_3 = intrinsic load_uniform (ssa_1) (4, 4) /* base=4 */ /* range=4 */        /* b */
> >         vec1 16 ssa_4 = fadd ssa_2, ssa_3
> >         vec1 32 ssa_5 = f2f32 ssa_4
> >         vec1 32 ssa_6 = f2f32 ssa_2
> >         vec1 32 ssa_7 = f2f32 ssa_3
> >         vec1 32 ssa_8 = fadd ssa_5, ssa_0
> >         vec4 32 ssa_9 = vec4 ssa_6, ssa_7, ssa_5, ssa_8
> >         intrinsic store_output (ssa_9, ssa_1) (4, 15, 0) /* base=4 */ /* wrmask=xyzw */ /* component=0 */       /* gl_FragColor */
> >         /* succs: block_1 */
> >         block block_1:
> > }
> >
> >
> > which looks to do the right thing.
> 
> Your piglit test uses c in the output, preventing GLSL IR tree
> grafting from kicking in. If you remove that, then it'll be broken
> again since GLSL IR will rewrite it to a single expression before you
> get a chance to see it.

Right, after replacing 

        gl_FragColor = vec4(a, b, c, d);

with

        gl_FragColor = vec4(a, b, b, d);

I still got the right end result:

NIR (SSA form) for fragment shader:
shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 0
outputs: 0
uniforms: 8
shared: 0
decl_var uniform INTERP_MODE_NONE float16_t a (0, 0, 0)
decl_var uniform INTERP_MODE_NONE float16_t b (1, 4, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragColor (FRAG_RESULT_COLOR, 4, 0)
decl_function main (0 params)

impl main {
        block block_0:
        /* preds: */
        vec1 32 ssa_0 = load_const (0x3ecccccd /* 0.400000 */)
        vec1 32 ssa_1 = load_const (0x00000000 /* 0.000000 */)
        vec1 16 ssa_2 = intrinsic load_uniform (ssa_1) (0, 4) /* base=0 */ /* range=4 */        /* a */
        vec1 32 ssa_3 = f2f32 ssa_2
        vec1 16 ssa_4 = intrinsic load_uniform (ssa_1) (4, 4) /* base=4 */ /* range=4 */        /* b */
        vec1 32 ssa_5 = f2f32 ssa_4
        vec1 16 ssa_6 = fadd ssa_2, ssa_4
        vec1 32 ssa_7 = f2f32 ssa_6
        vec1 32 ssa_8 = fadd ssa_7, ssa_0
        vec4 32 ssa_9 = vec4 ssa_3, ssa_5, ssa_5, ssa_8
        intrinsic store_output (ssa_9, ssa_1) (4, 15, 0) /* base=4 */ /* wrmask=xyzw */ /* component=0 */       /* gl_FragColor */
        /* succs: block_1 */
        block block_1:
}

But I think this is just an example where it happens to work. I need to think
about this some more...

> 
> The GLSL spec really wants you to think of operations like plus,
> times, etc. as having a precision qualifier attached to them, as well
> as input/output variables. This is also how SPIR-V works. The
> propagation algorithm is just a small syntactic sugar, so the deeper
> you push it down the pipeline, the more whack-a-mole games like this
> you're going to have to play, and if you have to start disabling
> optimizations like tree grafting, or you have to choose the final
> bitsize before you have all the information SSA gives you (like moving
> it before nir_lower_var_copies() and nir_vars_to_ssa()), then
> performance will be worse.
> 
> >
> >
> > Now, you may still be very correct in general that my approach is flawed. I've
> > been hoping to work on item 1) in my TODO list (found in patch number 61 in
> > the list). There has just been too many parts that I had to address (including
> > Intel backend) to get performance numbers for benchmarks.
> >
> > I'm not very good at saying on paper if something works or not - I rather try
> > it out to understand it better. My idea is to use deref loads and stores as
> > recursion starting points.