[Mesa-dev] [PATCH v2 9/9] nv50/ir/tgsi: split mad to mul+add

Tue Jun 13 14:55:55 UTC 2017

On Tue, Jun 13, 2017 at 4:12 PM, Roland Scheidegger <sroland at vmware.com> wrote:
> Am 13.06.2017 um 15:11 schrieb Karol Herbst:
>> On Tue, Jun 13, 2017 at 2:18 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>> Am 13.06.2017 um 08:57 schrieb Karol Herbst:
>>>> On Tue, Jun 13, 2017 at 2:17 AM, Roland Scheidegger <sroland at vmware.com> wrote:
>>>>> I am actually also thinking this should be different.
>>>>>
>>>>> e.g. imho MAD means the operation can be either fused or unfused.
>>>>> This is the "traditional" definition of MAD - opencl for instance will
>>>>> follow this too, albeit this isn't mentioned in the gallium docs (it
>>>>> probably should be).
>>>>> (OpenCL says: "Whether or how the product of a * b is rounded and how
>>>>> supernormal or subnormal intermediate products are handled is not
>>>>> defined. mad is intended to be used where speed is preferred over
>>>>> accuracy.")
>>>>> I think doing something different here in gallium can only lead to
>>>>> madness long term - glsl doesn't have mad in the first place, and as far
>>>>> as I can tell d3d10 is ok with fused/unfused mad too (the docs stating
>>>>> "Fused operations (such as mad, dp3) produce results that are no less
>>>>> accurate than the worst possible serial ordering of evaluation of the
>>>>> unfused expansion of the operation.")
>>>>>
>>>>> This means that mul+add cannot be fused anywhere to a mad if precise is
>>>>> specified, and therefore you should never have to worry about doing a
>>>>> fused or unfused mul/add in the driver with a mad - it's enough if you
>>>>> just don't fuse mul+add in the driver itself (if you can't do unfused mad).
>>>>>
>>>>> Roland
>>>>>
>>>>
>>>> well there is a TGSI peephole doing this mul+add=>mad optimisation,
>>>> because it isn't wrong, because mad != fma and mul+add==mad, but on
>>>> Fermi+ Nvidia hardware there is no mad, only fma and because mad != fma,
>>>> we need to split it up again.
>>>>
>>>> So either TGSI doesn't merge it if the Instruction is flagged as precise (which
>>>> it is in those tests mentioned) allthough it is correct, or we lower
>>>> something in
>>>> the driver, because the Instruction isn't supported by the hardware all along.
>>>
>>> Yes, I think the TGSI peephole shouldn't merge mul+add to mad with
>>> precise. You say this isn't wrong, but imho it clearly is, because noone
>>> ever said MAD can't be a fused add - it is multiply + add, yes, but if
>>> there's intermediate rounding or not isn't specified. FWIW gallivm code
>>> also assumes this, and will use llvm.fmuladd for implementation (which
>>> is exactly the same "mul+add" story as opencl mad, and will use fma on
>>> cpus supporting it and separate mul+add otherwise, save some bugs in
>>> older llvm versions apparently).
>>> So we should just clarify that in the tgsi docs - mad is multiply + add,
>>> with undefined intermediate rounding, it can be a fused mul+add or an
>>> unfused one (technically it could also be something in-between I suppose
>>> since the apis just specify the accuracy isn't worse than a unfused
>>> multiply + add). Every driver gets to use what it can do fastest for it,
>>> and because there's no specified intermediate rounding for it, precise
>>> doesn't change anything there.
>>>
>>> That's at least my opinion what TGSI_OPCODE_MAD should be (of course,
>>> older gpus always used unfused mad, but this wasn't a requirement).
>>>
>>> Roland
>>>
>>
>> I think the best idea would be to specify that:
>> TGSI_OPCODE_MAD is unfused mu+add
>> TGSI_OPCODE_FMA is fused mul+add
>>
>> Having TGSI_OPCODE_MAD being unfused and fused adds an ambiguity
>> without providing any advantages imho.
>>
>> This way it's clear what both is. The backend can still decide that it can use
>> FMA to implement TGSI_OPCODE_MAD or that it can't use MAD and splits it
>> up, but then the backend decides and the choice is explicit and respects
>> limitations of the hardware, which Gallium/TGSI doesn't know about.
>
> I just don't agree with that. There's lots of apis which have such an
> ambigous mad, with precisely the intention of it being as fast as
> possible, with undefined intermediate rounding. I think there's a reason
> that d3d10 mad, opencl mad, llvm fmuladd all are exactly like that. Why
> should tgsi mad be different?
> It exists because you otherwise cannot say you don't want to allow
> unsafe math generally, but are ok if a mad is either fused or not. If
> you require a fused one, use fma. If you require an unfused
> multiply+add, just use mul and add. If you don't care, use mad.
> Granted, arguably with per-instruction precise modifier, mul + add
> without the modifier works as well.
>

okay, so I think the most sane thing to do now is to adjust the peephole inside
TGSI to not merge mul+add into a mad if either the mul or the add have that
precise modifier.

>>
>> Or we remove TGSI_OPCODE_MAD and let the backends do the opts.
> This would be a possibility, but backends might not be prepared for it
> (e.g. I don't think gallivm would let llvm emit fused fmas for mul + add
> sequence). Plus mad being so common makes the tgsi look nicer.
>
> Roland