[Mesa-dev] [PATCH v2 9/9] nv50/ir/tgsi: split mad to mul+add

Tue Jun 13 13:00:28 UTC 2017

On Tue, Jun 13, 2017 at 8:47 AM, Martin Peres <martin.peres at free.fr> wrote:
>
>
> On 13/06/17 15:43, Ilia Mirkin wrote:
>>
>> On Tue, Jun 13, 2017 at 8:18 AM, Roland Scheidegger <sroland at vmware.com>
>> wrote:
>>>
>>> Am 13.06.2017 um 08:57 schrieb Karol Herbst:
>>>>
>>>> On Tue, Jun 13, 2017 at 2:17 AM, Roland Scheidegger <sroland at vmware.com>
>>>> wrote:
>>>>>
>>>>> I am actually also thinking this should be different.
>>>>>
>>>>> e.g. imho MAD means the operation can be either fused or unfused.
>>>>> This is the "traditional" definition of MAD - opencl for instance will
>>>>> follow this too, albeit this isn't mentioned in the gallium docs (it
>>>>> probably should be).
>>>>> (OpenCL says: "Whether or how the product of a * b is rounded and how
>>>>> supernormal or subnormal intermediate products are handled is not
>>>>> defined. mad is intended to be used where speed is preferred over
>>>>> accuracy.")
>>>>> I think doing something different here in gallium can only lead to
>>>>> madness long term - glsl doesn't have mad in the first place, and as
>>>>> far
>>>>> as I can tell d3d10 is ok with fused/unfused mad too (the docs stating
>>>>> "Fused operations (such as mad, dp3) produce results that are no less
>>>>> accurate than the worst possible serial ordering of evaluation of the
>>>>> unfused expansion of the operation.")
>>>>>
>>>>> This means that mul+add cannot be fused anywhere to a mad if precise is
>>>>> specified, and therefore you should never have to worry about doing a
>>>>> fused or unfused mul/add in the driver with a mad - it's enough if you
>>>>> just don't fuse mul+add in the driver itself (if you can't do unfused
>>>>> mad).
>>>>>
>>>>> Roland
>>>>>
>>>>
>>>> well there is a TGSI peephole doing this mul+add=>mad optimisation,
>>>> because it isn't wrong, because mad != fma and mul+add==mad, but on
>>>> Fermi+ Nvidia hardware there is no mad, only fma and because mad != fma,
>>>> we need to split it up again.
>>>>
>>>> So either TGSI doesn't merge it if the Instruction is flagged as precise
>>>> (which
>>>> it is in those tests mentioned) allthough it is correct, or we lower
>>>> something in
>>>> the driver, because the Instruction isn't supported by the hardware all
>>>> along.
>>>
>>>
>>> Yes, I think the TGSI peephole shouldn't merge mul+add to mad with
>>> precise. You say this isn't wrong, but imho it clearly is, because noone
>>> ever said MAD can't be a fused add - it is multiply + add, yes, but if
>>> there's intermediate rounding or not isn't specified. FWIW gallivm code
>>> also assumes this, and will use llvm.fmuladd for implementation (which
>>> is exactly the same "mul+add" story as opencl mad, and will use fma on
>>> cpus supporting it and separate mul+add otherwise, save some bugs in
>>> older llvm versions apparently).
>>> So we should just clarify that in the tgsi docs - mad is multiply + add,
>>> with undefined intermediate rounding, it can be a fused mul+add or an
>>> unfused one (technically it could also be something in-between I suppose
>>> since the apis just specify the accuracy isn't worse than a unfused
>>> multiply + add). Every driver gets to use what it can do fastest for it,
>>> and because there's no specified intermediate rounding for it, precise
>>> doesn't change anything there.
>>>
>>> That's at least my opinion what TGSI_OPCODE_MAD should be (of course,
>>> older gpus always used unfused mad, but this wasn't a requirement).
>>
>>
>> BTW, irrespective of how this conversation turns out, I think it's a
>> good idea to split MAD into mul + add in the nv50 backend on input,
>> unconditionally.
>
>
> I seem to remember that using MAD introduced a performance regression on my
> nv86 for some benchmarks. I will need to get the setup working again for
> mesa testing.

It did, but on Fermi or Kepler, I thought. Using IMAD is apparently
not a great idea. But that's entirely separate from what's in the
TGSI.