[Mesa-dev] [PATCH v2 9/9] nv50/ir/tgsi: split mad to mul+add

Tue Jun 13 12:47:26 UTC 2017

On 13/06/17 15:43, Ilia Mirkin wrote:
> On Tue, Jun 13, 2017 at 8:18 AM, Roland Scheidegger <sroland at vmware.com> wrote:
>> Am 13.06.2017 um 08:57 schrieb Karol Herbst:
>>> On Tue, Jun 13, 2017 at 2:17 AM, Roland Scheidegger <sroland at vmware.com> wrote:
>>>> I am actually also thinking this should be different.
>>>>
>>>> e.g. imho MAD means the operation can be either fused or unfused.
>>>> This is the "traditional" definition of MAD - opencl for instance will
>>>> follow this too, albeit this isn't mentioned in the gallium docs (it
>>>> probably should be).
>>>> (OpenCL says: "Whether or how the product of a * b is rounded and how
>>>> supernormal or subnormal intermediate products are handled is not
>>>> defined. mad is intended to be used where speed is preferred over
>>>> accuracy.")
>>>> I think doing something different here in gallium can only lead to
>>>> madness long term - glsl doesn't have mad in the first place, and as far
>>>> as I can tell d3d10 is ok with fused/unfused mad too (the docs stating
>>>> "Fused operations (such as mad, dp3) produce results that are no less
>>>> accurate than the worst possible serial ordering of evaluation of the
>>>> unfused expansion of the operation.")
>>>>
>>>> This means that mul+add cannot be fused anywhere to a mad if precise is
>>>> specified, and therefore you should never have to worry about doing a
>>>> fused or unfused mul/add in the driver with a mad - it's enough if you
>>>> just don't fuse mul+add in the driver itself (if you can't do unfused mad).
>>>>
>>>> Roland
>>>>
>>>
>>> well there is a TGSI peephole doing this mul+add=>mad optimisation,
>>> because it isn't wrong, because mad != fma and mul+add==mad, but on
>>> Fermi+ Nvidia hardware there is no mad, only fma and because mad != fma,
>>> we need to split it up again.
>>>
>>> So either TGSI doesn't merge it if the Instruction is flagged as precise (which
>>> it is in those tests mentioned) allthough it is correct, or we lower
>>> something in
>>> the driver, because the Instruction isn't supported by the hardware all along.
>>
>> Yes, I think the TGSI peephole shouldn't merge mul+add to mad with
>> precise. You say this isn't wrong, but imho it clearly is, because noone
>> ever said MAD can't be a fused add - it is multiply + add, yes, but if
>> there's intermediate rounding or not isn't specified. FWIW gallivm code
>> also assumes this, and will use llvm.fmuladd for implementation (which
>> is exactly the same "mul+add" story as opencl mad, and will use fma on
>> cpus supporting it and separate mul+add otherwise, save some bugs in
>> older llvm versions apparently).
>> So we should just clarify that in the tgsi docs - mad is multiply + add,
>> with undefined intermediate rounding, it can be a fused mul+add or an
>> unfused one (technically it could also be something in-between I suppose
>> since the apis just specify the accuracy isn't worse than a unfused
>> multiply + add). Every driver gets to use what it can do fastest for it,
>> and because there's no specified intermediate rounding for it, precise
>> doesn't change anything there.
>>
>> That's at least my opinion what TGSI_OPCODE_MAD should be (of course,
>> older gpus always used unfused mad, but this wasn't a requirement).
> 
> BTW, irrespective of how this conversation turns out, I think it's a
> good idea to split MAD into mul + add in the nv50 backend on input,
> unconditionally.

I seem to remember that using MAD introduced a performance regression on 
my nv86 for some benchmarks. I will need to get the setup working again 
for mesa testing.

Martin