[Mesa-dev] [PATCH v2 9/9] nv50/ir/tgsi: split mad to mul+add

Tue Jun 13 12:43:14 UTC 2017

On Tue, Jun 13, 2017 at 8:18 AM, Roland Scheidegger <sroland at vmware.com> wrote:
> Am 13.06.2017 um 08:57 schrieb Karol Herbst:
>> On Tue, Jun 13, 2017 at 2:17 AM, Roland Scheidegger <sroland at vmware.com> wrote:
>>> I am actually also thinking this should be different.
>>>
>>> e.g. imho MAD means the operation can be either fused or unfused.
>>> This is the "traditional" definition of MAD - opencl for instance will
>>> follow this too, albeit this isn't mentioned in the gallium docs (it
>>> probably should be).
>>> (OpenCL says: "Whether or how the product of a * b is rounded and how
>>> supernormal or subnormal intermediate products are handled is not
>>> defined. mad is intended to be used where speed is preferred over
>>> accuracy.")
>>> I think doing something different here in gallium can only lead to
>>> madness long term - glsl doesn't have mad in the first place, and as far
>>> as I can tell d3d10 is ok with fused/unfused mad too (the docs stating
>>> "Fused operations (such as mad, dp3) produce results that are no less
>>> accurate than the worst possible serial ordering of evaluation of the
>>> unfused expansion of the operation.")
>>>
>>> This means that mul+add cannot be fused anywhere to a mad if precise is
>>> specified, and therefore you should never have to worry about doing a
>>> fused or unfused mul/add in the driver with a mad - it's enough if you
>>> just don't fuse mul+add in the driver itself (if you can't do unfused mad).
>>>
>>> Roland
>>>
>>
>> well there is a TGSI peephole doing this mul+add=>mad optimisation,
>> because it isn't wrong, because mad != fma and mul+add==mad, but on
>> Fermi+ Nvidia hardware there is no mad, only fma and because mad != fma,
>> we need to split it up again.
>>
>> So either TGSI doesn't merge it if the Instruction is flagged as precise (which
>> it is in those tests mentioned) allthough it is correct, or we lower
>> something in
>> the driver, because the Instruction isn't supported by the hardware all along.
>
> Yes, I think the TGSI peephole shouldn't merge mul+add to mad with
> precise. You say this isn't wrong, but imho it clearly is, because noone
> ever said MAD can't be a fused add - it is multiply + add, yes, but if
> there's intermediate rounding or not isn't specified. FWIW gallivm code
> also assumes this, and will use llvm.fmuladd for implementation (which
> is exactly the same "mul+add" story as opencl mad, and will use fma on
> cpus supporting it and separate mul+add otherwise, save some bugs in
> older llvm versions apparently).
> So we should just clarify that in the tgsi docs - mad is multiply + add,
> with undefined intermediate rounding, it can be a fused mul+add or an
> unfused one (technically it could also be something in-between I suppose
> since the apis just specify the accuracy isn't worse than a unfused
> multiply + add). Every driver gets to use what it can do fastest for it,
> and because there's no specified intermediate rounding for it, precise
> doesn't change anything there.
>
> That's at least my opinion what TGSI_OPCODE_MAD should be (of course,
> older gpus always used unfused mad, but this wasn't a requirement).

BTW, irrespective of how this conversation turns out, I think it's a
good idea to split MAD into mul + add in the nv50 backend on input,
unconditionally.

Cheers,

  -ilia