[Mesa-dev] [PATCH v2 9/9] nv50/ir/tgsi: split mad to mul+add

Wed Jun 14 10:11:43 UTC 2017

On 13/06/17 16:00, Ilia Mirkin wrote:
> On Tue, Jun 13, 2017 at 8:47 AM, Martin Peres <martin.peres at free.fr> wrote:
>>
>>
>> On 13/06/17 15:43, Ilia Mirkin wrote:
>>>
>>> On Tue, Jun 13, 2017 at 8:18 AM, Roland Scheidegger <sroland at vmware.com>
>>> wrote:
>>>>
>>>> Am 13.06.2017 um 08:57 schrieb Karol Herbst:
>>>>>
>>>>> On Tue, Jun 13, 2017 at 2:17 AM, Roland Scheidegger <sroland at vmware.com>
>>>>> wrote:
>>>>>>
>>>>>> I am actually also thinking this should be different.
>>>>>>
>>>>>> e.g. imho MAD means the operation can be either fused or unfused.
>>>>>> This is the "traditional" definition of MAD - opencl for instance will
>>>>>> follow this too, albeit this isn't mentioned in the gallium docs (it
>>>>>> probably should be).
>>>>>> (OpenCL says: "Whether or how the product of a * b is rounded and how
>>>>>> supernormal or subnormal intermediate products are handled is not
>>>>>> defined. mad is intended to be used where speed is preferred over
>>>>>> accuracy.")
>>>>>> I think doing something different here in gallium can only lead to
>>>>>> madness long term - glsl doesn't have mad in the first place, and as
>>>>>> far
>>>>>> as I can tell d3d10 is ok with fused/unfused mad too (the docs stating
>>>>>> "Fused operations (such as mad, dp3) produce results that are no less
>>>>>> accurate than the worst possible serial ordering of evaluation of the
>>>>>> unfused expansion of the operation.")
>>>>>>
>>>>>> This means that mul+add cannot be fused anywhere to a mad if precise is
>>>>>> specified, and therefore you should never have to worry about doing a
>>>>>> fused or unfused mul/add in the driver with a mad - it's enough if you
>>>>>> just don't fuse mul+add in the driver itself (if you can't do unfused
>>>>>> mad).
>>>>>>
>>>>>> Roland
>>>>>>
>>>>>
>>>>> well there is a TGSI peephole doing this mul+add=>mad optimisation,
>>>>> because it isn't wrong, because mad != fma and mul+add==mad, but on
>>>>> Fermi+ Nvidia hardware there is no mad, only fma and because mad != fma,
>>>>> we need to split it up again.
>>>>>
>>>>> So either TGSI doesn't merge it if the Instruction is flagged as precise
>>>>> (which
>>>>> it is in those tests mentioned) allthough it is correct, or we lower
>>>>> something in
>>>>> the driver, because the Instruction isn't supported by the hardware all
>>>>> along.
>>>>
>>>>
>>>> Yes, I think the TGSI peephole shouldn't merge mul+add to mad with
>>>> precise. You say this isn't wrong, but imho it clearly is, because noone
>>>> ever said MAD can't be a fused add - it is multiply + add, yes, but if
>>>> there's intermediate rounding or not isn't specified. FWIW gallivm code
>>>> also assumes this, and will use llvm.fmuladd for implementation (which
>>>> is exactly the same "mul+add" story as opencl mad, and will use fma on
>>>> cpus supporting it and separate mul+add otherwise, save some bugs in
>>>> older llvm versions apparently).
>>>> So we should just clarify that in the tgsi docs - mad is multiply + add,
>>>> with undefined intermediate rounding, it can be a fused mul+add or an
>>>> unfused one (technically it could also be something in-between I suppose
>>>> since the apis just specify the accuracy isn't worse than a unfused
>>>> multiply + add). Every driver gets to use what it can do fastest for it,
>>>> and because there's no specified intermediate rounding for it, precise
>>>> doesn't change anything there.
>>>>
>>>> That's at least my opinion what TGSI_OPCODE_MAD should be (of course,
>>>> older gpus always used unfused mad, but this wasn't a requirement).
>>>
>>>
>>> BTW, irrespective of how this conversation turns out, I think it's a
>>> good idea to split MAD into mul + add in the nv50 backend on input,
>>> unconditionally.
>>
>>
>> I seem to remember that using MAD introduced a performance regression on my
>> nv86 for some benchmarks. I will need to get the setup working again for
>> mesa testing.
> 
> It did, but on Fermi or Kepler, I thought. Using IMAD is apparently
> not a great idea. But that's entirely separate from what's in the
> TGSI.

Oh, right, it may have been on my nvd9.