[Mesa-dev] [PATCH v2 9/9] nv50/ir/tgsi: split mad to mul+add

Roland Scheidegger sroland at vmware.com
Tue Jun 13 14:12:59 UTC 2017


Am 13.06.2017 um 15:11 schrieb Karol Herbst:
> On Tue, Jun 13, 2017 at 2:18 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>> Am 13.06.2017 um 08:57 schrieb Karol Herbst:
>>> On Tue, Jun 13, 2017 at 2:17 AM, Roland Scheidegger <sroland at vmware.com> wrote:
>>>> I am actually also thinking this should be different.
>>>>
>>>> e.g. imho MAD means the operation can be either fused or unfused.
>>>> This is the "traditional" definition of MAD - opencl for instance will
>>>> follow this too, albeit this isn't mentioned in the gallium docs (it
>>>> probably should be).
>>>> (OpenCL says: "Whether or how the product of a * b is rounded and how
>>>> supernormal or subnormal intermediate products are handled is not
>>>> defined. mad is intended to be used where speed is preferred over
>>>> accuracy.")
>>>> I think doing something different here in gallium can only lead to
>>>> madness long term - glsl doesn't have mad in the first place, and as far
>>>> as I can tell d3d10 is ok with fused/unfused mad too (the docs stating
>>>> "Fused operations (such as mad, dp3) produce results that are no less
>>>> accurate than the worst possible serial ordering of evaluation of the
>>>> unfused expansion of the operation.")
>>>>
>>>> This means that mul+add cannot be fused anywhere to a mad if precise is
>>>> specified, and therefore you should never have to worry about doing a
>>>> fused or unfused mul/add in the driver with a mad - it's enough if you
>>>> just don't fuse mul+add in the driver itself (if you can't do unfused mad).
>>>>
>>>> Roland
>>>>
>>>
>>> well there is a TGSI peephole doing this mul+add=>mad optimisation,
>>> because it isn't wrong, because mad != fma and mul+add==mad, but on
>>> Fermi+ Nvidia hardware there is no mad, only fma and because mad != fma,
>>> we need to split it up again.
>>>
>>> So either TGSI doesn't merge it if the Instruction is flagged as precise (which
>>> it is in those tests mentioned) allthough it is correct, or we lower
>>> something in
>>> the driver, because the Instruction isn't supported by the hardware all along.
>>
>> Yes, I think the TGSI peephole shouldn't merge mul+add to mad with
>> precise. You say this isn't wrong, but imho it clearly is, because noone
>> ever said MAD can't be a fused add - it is multiply + add, yes, but if
>> there's intermediate rounding or not isn't specified. FWIW gallivm code
>> also assumes this, and will use llvm.fmuladd for implementation (which
>> is exactly the same "mul+add" story as opencl mad, and will use fma on
>> cpus supporting it and separate mul+add otherwise, save some bugs in
>> older llvm versions apparently).
>> So we should just clarify that in the tgsi docs - mad is multiply + add,
>> with undefined intermediate rounding, it can be a fused mul+add or an
>> unfused one (technically it could also be something in-between I suppose
>> since the apis just specify the accuracy isn't worse than a unfused
>> multiply + add). Every driver gets to use what it can do fastest for it,
>> and because there's no specified intermediate rounding for it, precise
>> doesn't change anything there.
>>
>> That's at least my opinion what TGSI_OPCODE_MAD should be (of course,
>> older gpus always used unfused mad, but this wasn't a requirement).
>>
>> Roland
>>
> 
> I think the best idea would be to specify that:
> TGSI_OPCODE_MAD is unfused mu+add
> TGSI_OPCODE_FMA is fused mul+add
> 
> Having TGSI_OPCODE_MAD being unfused and fused adds an ambiguity
> without providing any advantages imho.
> 
> This way it's clear what both is. The backend can still decide that it can use
> FMA to implement TGSI_OPCODE_MAD or that it can't use MAD and splits it
> up, but then the backend decides and the choice is explicit and respects
> limitations of the hardware, which Gallium/TGSI doesn't know about.

I just don't agree with that. There's lots of apis which have such an
ambigous mad, with precisely the intention of it being as fast as
possible, with undefined intermediate rounding. I think there's a reason
that d3d10 mad, opencl mad, llvm fmuladd all are exactly like that. Why
should tgsi mad be different?
It exists because you otherwise cannot say you don't want to allow
unsafe math generally, but are ok if a mad is either fused or not. If
you require a fused one, use fma. If you require an unfused
multiply+add, just use mul and add. If you don't care, use mad.
Granted, arguably with per-instruction precise modifier, mul + add
without the modifier works as well.

> 
> Or we remove TGSI_OPCODE_MAD and let the backends do the opts.
This would be a possibility, but backends might not be prepared for it
(e.g. I don't think gallivm would let llvm emit fused fmas for mul + add
sequence). Plus mad being so common makes the tgsi look nicer.

Roland


More information about the mesa-dev mailing list