[Mesa-dev] [PATCH 1/2] gallium: add TGSI_OPCODE_FMA

Mon Mar 2 10:46:19 PST 2015

On Mon, Mar 2, 2015 at 5:48 PM, Roland Scheidegger <sroland at vmware.com> wrote:
> Am 02.03.2015 um 17:12 schrieb Marek Olšák:
>> On Mon, Mar 2, 2015 at 4:55 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>> Am 02.03.2015 um 12:52 schrieb Marek Olšák:
>>>> From: Marek Olšák <marek.olsak at amd.com>
>>>>
>>>> Needed by ARB_gpu_shader5.
>>>> ---
>>>>  src/gallium/auxiliary/gallivm/lp_bld_limits.h    |  1 +
>>>>  src/gallium/auxiliary/tgsi/tgsi_exec.h           |  1 +
>>>>  src/gallium/auxiliary/tgsi/tgsi_info.c           |  2 +-
>>>>  src/gallium/auxiliary/tgsi/tgsi_util.c           |  1 +
>>>>  src/gallium/docs/source/screen.rst               |  1 +
>>>>  src/gallium/docs/source/tgsi.rst                 | 23 +++++++++++++++++++++++
>>>>  src/gallium/drivers/freedreno/freedreno_screen.c |  1 +
>>>>  src/gallium/drivers/i915/i915_screen.c           |  1 +
>>>>  src/gallium/drivers/nouveau/nv30/nv30_screen.c   |  2 ++
>>>>  src/gallium/drivers/nouveau/nv50/nv50_screen.c   |  1 +
>>>>  src/gallium/drivers/nouveau/nvc0/nvc0_screen.c   |  1 +
>>>>  src/gallium/drivers/r300/r300_screen.c           |  2 ++
>>>>  src/gallium/drivers/r600/r600_pipe.c             |  1 +
>>>>  src/gallium/drivers/r600/r600_shader.c           |  6 +++---
>>>>  src/gallium/drivers/radeonsi/si_pipe.c           |  1 +
>>>>  src/gallium/drivers/svga/svga_screen.c           |  2 ++
>>>>  src/gallium/drivers/vc4/vc4_screen.c             |  1 +
>>>>  src/gallium/include/pipe/p_defines.h             |  1 +
>>>>  src/gallium/include/pipe/p_shader_tokens.h       |  2 +-
>>>>  src/mesa/state_tracker/st_glsl_to_tgsi.cpp       | 12 ++++++++----
>>>>  20 files changed, 54 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/src/gallium/auxiliary/gallivm/lp_bld_limits.h b/src/gallium/auxiliary/gallivm/lp_bld_limits.h
>>>> index 2962360..c5c51c1 100644
>>>> --- a/src/gallium/auxiliary/gallivm/lp_bld_limits.h
>>>> +++ b/src/gallium/auxiliary/gallivm/lp_bld_limits.h
>>>> @@ -129,6 +129,7 @@ gallivm_get_shader_param(enum pipe_shader_cap param)
>>>>     case PIPE_SHADER_CAP_DOUBLES:
>>>>     case PIPE_SHADER_CAP_TGSI_DROUND_SUPPORTED:
>>>>     case PIPE_SHADER_CAP_TGSI_DFRACEXP_DLDEXP_SUPPORTED:
>>>> +   case PIPE_SHADER_CAP_TGSI_FMA_SUPPORTED:
>>>>        return 0;
>>>>     }
>>>>     /* if we get here, we missed a shader cap above (and should have seen
>>>> diff --git a/src/gallium/auxiliary/tgsi/tgsi_exec.h b/src/gallium/auxiliary/tgsi/tgsi_exec.h
>>>> index 609c81b..0e59b88 100644
>>>> --- a/src/gallium/auxiliary/tgsi/tgsi_exec.h
>>>> +++ b/src/gallium/auxiliary/tgsi/tgsi_exec.h
>>>> @@ -459,6 +459,7 @@ tgsi_exec_get_shader_param(enum pipe_shader_cap param)
>>>>     case PIPE_SHADER_CAP_TGSI_DFRACEXP_DLDEXP_SUPPORTED:
>>>>        return 1;
>>>>     case PIPE_SHADER_CAP_TGSI_DROUND_SUPPORTED:
>>>> +   case PIPE_SHADER_CAP_TGSI_FMA_SUPPORTED:
>>>>        return 0;
>>>>     }
>>>>     /* if we get here, we missed a shader cap above (and should have seen
>>>> diff --git a/src/gallium/auxiliary/tgsi/tgsi_info.c b/src/gallium/auxiliary/tgsi/tgsi_info.c
>>>> index 4d838fd..e6e0a60 100644
>>>> --- a/src/gallium/auxiliary/tgsi/tgsi_info.c
>>>> +++ b/src/gallium/auxiliary/tgsi/tgsi_info.c
>>>> @@ -56,7 +56,7 @@ static const struct tgsi_opcode_info opcode_info[TGSI_OPCODE_LAST] =
>>>>     { 1, 3, 0, 0, 0, 0, COMP, "MAD", TGSI_OPCODE_MAD },
>>>>     { 1, 2, 0, 0, 0, 0, COMP, "SUB", TGSI_OPCODE_SUB },
>>>>     { 1, 3, 0, 0, 0, 0, COMP, "LRP", TGSI_OPCODE_LRP },
>>>> -   { 0, 0, 0, 0, 0, 0, NONE, "", 19 },      /* removed */
>>>> +   { 1, 3, 0, 0, 0, 0, COMP, "FMA", TGSI_OPCODE_FMA },
>>>>     { 1, 1, 0, 0, 0, 0, REPL, "SQRT", TGSI_OPCODE_SQRT },
>>>>     { 1, 3, 0, 0, 0, 0, REPL, "DP2A", TGSI_OPCODE_DP2A },
>>>>     { 0, 0, 0, 0, 0, 0, NONE, "", 22 },      /* removed */
>>>> diff --git a/src/gallium/auxiliary/tgsi/tgsi_util.c b/src/gallium/auxiliary/tgsi/tgsi_util.c
>>>> index d572ff0..e5b8427 100644
>>>> --- a/src/gallium/auxiliary/tgsi/tgsi_util.c
>>>> +++ b/src/gallium/auxiliary/tgsi/tgsi_util.c
>>>> @@ -193,6 +193,7 @@ tgsi_util_get_inst_usage_mask(const struct tgsi_full_instruction *inst,
>>>>     case TGSI_OPCODE_MAD:
>>>>     case TGSI_OPCODE_SUB:
>>>>     case TGSI_OPCODE_LRP:
>>>> +   case TGSI_OPCODE_FMA:
>>>>     case TGSI_OPCODE_FRC:
>>>>     case TGSI_OPCODE_CEIL:
>>>>     case TGSI_OPCODE_CLAMP:
>>>> diff --git a/src/gallium/docs/source/screen.rst b/src/gallium/docs/source/screen.rst
>>>> index e0fd1a2..dd7a012 100644
>>>> --- a/src/gallium/docs/source/screen.rst
>>>> +++ b/src/gallium/docs/source/screen.rst
>>>> @@ -336,6 +336,7 @@ to be 0.
>>>>    is supported. If it is, DTRUNC/DCEIL/DFLR/DROUND opcodes may be used.
>>>>  * ``PIPE_SHADER_CAP_TGSI_DFRACEXP_DLDEXP_SUPPORTED``: Whether DFRACEXP and
>>>>    DLDEXP are supported.
>>>> +* ``PIPE_SHADER_CAP_TGSI_FMA_SUPPORTED``: Whether TGSI_OPCODE_FMA is supported.
>>>>
>>>>
>>>>  .. _pipe_compute_cap:
>>>> diff --git a/src/gallium/docs/source/tgsi.rst b/src/gallium/docs/source/tgsi.rst
>>>> index b0a975a..6871676 100644
>>>> --- a/src/gallium/docs/source/tgsi.rst
>>>> +++ b/src/gallium/docs/source/tgsi.rst
>>>> @@ -272,6 +272,29 @@ This instruction replicates its result.
>>>>    dst.w = src0.w \times src1.w + (1 - src0.w) \times src2.w
>>>>
>>>>
>>>> +.. opcode:: FMA - Fused Multiply-Add
>>>> +
>>>> +The results may not be identical to evaluating the expression (a*b)+c,
>>>> +because the computation may be performed in a single operation with
>>>> +intermediate precision different from that used to compute a non-FMA
>>>> +expression.
>>>> +
>>>> +The results of FMA are guaranteed to be invariant given fixed inputs
>>>> +<src0>, <src1>, and <src2>. That means the implementation is not allowed
>>>> +to expand the opcode to MUL+ADD and apply algebraic optimizations affecting
>>>> +the floating-point results.
>>> I think these paragraphs are slightly confusing,  especially "because
>>> the computation may be performed in a single operation with intermediate
>>> precision different from that used to compute a non-FMA expression".
>>> Would be more obvious to say something along the lines that (in contrast
>>> to MAD) no intermediate rounding is happening. Otherwise this sounds
>>> like it would be allowed to do some sort of intermediate rounding, as
>>> long as the intermediate precision is larger than what you'd get by
>>> separate mul+mad, which I don't think is what you wanted.
>>
>> Well, it's partially copied from the extension spec and it just states
>> that the intermediate precision is different. I guess the main point
>> is that the result is invariant with regard to inputs.
> Hmm frankly I find the wording confusing, spec or not. Makes me think
> though it was worded on purpose like that, maybe not quite all chips can
> actually guarantee "correct" fma results (correct as in opencl fma
> specification which is a lot better imho ("Returns the correctly rounded
> floating-point representation of the sum of c with the infinitely
> precise product of a and b. Rounding of intermediate products shall not
> occur. Edge case behavior is per the IEEE 754-2008 standard.")
> glsl also has a quite different wording but there the meaning is
> somewhat different - https://www.opengl.org/sdk/docs/man/html/fma.xhtml.
> In other words, if you don't have precise attribute, it's just the same
> as a MAD. With precise though it seems to imply I think (because it's
> considered a single operation, not "may be performed in a single
> operation" like in arb_gpu_shader5) that there's no intermediate
> rounding, just as what opencl expects.
>
> Roland
>
>
>
>>
>>> (FWIW I don't think we really clarified MAD wrt intermediate rounding, I
>>> particularly like opencl convention that FMA = no rounding, MUL + ADD =
>>> rounding, MAD = do whatever is fastest (because optimizing backends can
>>> fuse back MUL+ADD back into a MAD themselves if the hw can do that with
>>> intermediate rounding) but traditionally of course MAD always did
>>> intermediate rounding.)
>>
>> Also MAD doesn't support denormals (on radeon), while FMA does. IIRC,
>> FMA is the slower one of the two.
>>
>
> Interesting. I thought most gpus wouldn't handle denorms at all for
> single precision floats for all operations, hence there wouldn't be much
> point supporting it for just fma. Or can you enable that explicitly for
> most operations just not for MAD?

Yeah, there is a global switch that sets the initial behavior and a
special shader instruction that can change it.

Marek