[Mesa-dev] intel: WIP: Support for using 16-bits for mediump

Tue Nov 6 12:45:40 UTC 2018

On Tue, Nov 06, 2018 at 11:31:58AM +0100, Connor Abbott wrote:
> On Tue, Nov 6, 2018 at 11:14 AM Pohjolainen, Topi
> <topi.pohjolainen at gmail.com> wrote:
> >
> > On Tue, Nov 06, 2018 at 10:45:52AM +0100, Connor Abbott wrote:
> > > As far as I understand, mediump handling can be split into two parts:
> > >
> > > 1. Figuring out which operations (instructions or SSA values in NIR)
> > > can use relaxed precision.
> > > 2. Deciding which relaxed-precision operations to actually compute in
> > > 16-bit precision.
> > >
> > > At least for GLSL, #1 is pretty well nailed down by the GLSL spec,
> > > where it's specified in terms of the source expressions. For example,
> > > something like:
> > >
> > > mediump float a = ...;
> > > mediump float b = ...;
> > > float c = a + b;
> > > float d = c + 2.0;
> > >
> > > the last addition must be performed in full precision, whereas for:
> > >
> > >
> > > mediump float a = ...;
> > > mediump float b = ...;
> > > float d = (a + b) + 2.0;
> > >
> > > it can be lowered to 16-bit. This information gets lost during
> > > expression grafting in GLSL IR, or vars-to-SSA in NIR, and even the
> > > AST -> GLSL IR transform will sometimes split up expressions, so it
> > > seems like both are too low-level for this. The analysis described by
> > > the spec (the paragraph in section 4.7.3 "Precision Qualifiers" of the
> > > GLSL ES 3.20 spec) has to happen on the AST after type checking but
> > > before lowering to GLSL IR in order to be correct and not overly
> > > conservative. If you want to do it in NIR since #2 is easier with SSA,
> > > then sure... but we can't mix them up and do both at the same time.
> > > We'll have to add support for annotating ir_expression's and nir_instr
> > > (or maybe nir_ssa_def's) with a relaxed precision, and filter that
> > > information down through the pipeline. Hopefully that also works
> > > better for SPIR-V, where you can annotate individual instructions as
> > > being RelaxedPrecision, and afaik (hopefully) #1 is handled by
> > > glslang.
> >
> > I tried to describe the logic I used and my interpretation of the spec in
> > the accompanying patch:
> >
> > https://lists.freedesktop.org/archives/mesa-dev/2018-November/208683.html
> >
> > Does it make any sense?
> 
> It seems incorrect, since it will make the addition in my example
> operate in 16 bit precision when it shouldn't. As I explained above,
> it's impossible to do this correctly in NIR.
> 
> Also, abusing a 16-bit bitsize in NIR to mean mediump is not ok. There
> are other vulkan/glsl extensions out there that provide actual fp16
> support, where the result is guaranteed to be calculated as a
> half-float, and these obviously won't work properly with this pass. We
> need to add a flag to the SSA def, or Jason's idea a long time ago was
> to add a fake "24-bit" bitsize. Part of #2 will involve converting the
> bitsize to be 16-bit and removing the flag.

I wrote small test shader-runner test:

[require]
GL ES >= 2.0
GLSL ES >= 1.00

[vertex shader]
#version 100

precision highp float;

attribute vec4 piglit_vertex;

void main()
{   
    gl_Position = piglit_vertex;
}

[fragment shader]
#version 100
precision highp float;

uniform mediump float a;
uniform mediump float b;

void main()
{   
    float c = a + b;
    float d = c + 0.4;

    gl_FragColor = vec4(a, b, c, d);
}

[test]
uniform float a 0.1
uniform float b 0.2
draw rect -1 -1 2 2
probe all rgba 0.1 0.2 0.3 0.7

And that made me realize another short-coming in my implementation - I lowered
variable precision after running nir_lower_var_copies() loosing precision for
local temporaries. Moving it before gave me:

NIR (final form) for fragment shader:
shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 0
outputs: 0
uniforms: 8
shared: 0
decl_var uniform INTERP_MODE_NONE float16_t a (0, 0, 0)
decl_var uniform INTERP_MODE_NONE float16_t b (1, 4, 0)
decl_var shader_out INTERP_MODE_NONE vec4 gl_FragColor (FRAG_RESULT_COLOR, 4, 0)
decl_function main (0 params)

impl main {
        block block_0:
        /* preds: */
        vec1 32 ssa_0 = load_const (0x3ecccccd /* 0.400000 */)
        vec1 32 ssa_1 = load_const (0x00000000 /* 0.000000 */)
        vec1 16 ssa_2 = intrinsic load_uniform (ssa_1) (0, 4) /* base=0 */ /* range=4 */        /* a */
        vec1 16 ssa_3 = intrinsic load_uniform (ssa_1) (4, 4) /* base=4 */ /* range=4 */        /* b */
        vec1 16 ssa_4 = fadd ssa_2, ssa_3
        vec1 32 ssa_5 = f2f32 ssa_4
        vec1 32 ssa_6 = f2f32 ssa_2
        vec1 32 ssa_7 = f2f32 ssa_3
        vec1 32 ssa_8 = fadd ssa_5, ssa_0
        vec4 32 ssa_9 = vec4 ssa_6, ssa_7, ssa_5, ssa_8
        intrinsic store_output (ssa_9, ssa_1) (4, 15, 0) /* base=4 */ /* wrmask=xyzw */ /* component=0 */       /* gl_FragColor */
        /* succs: block_1 */
        block block_1:
}

which looks to do the right thing.

Now, you may still be very correct in general that my approach is flawed. I've
been hoping to work on item 1) in my TODO list (found in patch number 61 in
the list). There has just been too many parts that I had to address (including
Intel backend) to get performance numbers for benchmarks.

I'm not very good at saying on paper if something works or not - I rather try
it out to understand it better. My idea is to use deref loads and stores as
recursion starting points.