[Mesa-dev] [PATCH 00/47] WIP: fp64 support for r600g

Wed Aug 23 13:45:06 UTC 2017

On 23 August 2017 at 14:31, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> On 23.08.2017 15:26, Emil Velikov wrote:
>>
>> On 23 August 2017 at 13:23, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
>>>
>>> On 23.08.2017 13:07, Elie Tournier wrote:
>>>>
>>>>
>>>> From: Elie Tournier <elie.tournier at collabora.com>
>>>>
>>>> TL;DR
>>>> This series is a "status update" of my work done for adding fp64 support
>>>> on r600g.
>>>> One of the biggest issue is due to a lake of accuracy on the rcp
>>>> implementation.
>>>> Divide relay on rcp.
>>>>
>>>> A branch is available on
>>>> https://github.com/Hopetech/mesa/tree/glsl_arb_gpu_shader_fp64_v3
>>>> Comments and reviews are welcome.
>>>>
>>>> Patches 1-18:
>>>> These few patches implement the basic fp64 operations.
>>>>
>>>> Patches 19-47:
>>>> Lower operations using the builtin functions previously implemented.
>>>>
>>>> Known issues:
>>>> - operations on matrix crash the system.
>>>> - sqrt and d2f are not accurate enought so the piglit tests are
>>>> failling.
>>>>     But sqrt and d2f are working correctly using softpipe.
>>>>     However, implementing sqrt64 as f2d(sqrt32(d2f()) seems to be good
>>>> enought for Piglit.
>>>> - rcp is define as pow(pow(x, -0.5), 2)
>>>>     NIR and NV convert the input in a fp32, realize a rcp, convert back
>>>> to
>>>> a fp64 and realize some Newton-Raphson step.
>>>>     This is not possible with GLSL IR because using fma will generate a
>>>> massive builtin_float64.h file.
>>>
>>>
>>>
>>> I don't understand this part. You need multiplication and addition
>>> anyway.
>>> So if it's only fma which is the problem (why?), then why not just use
>>> non-fused multiply-add? It may end up being slightly less accurate, but
>>> we
>>> don't give any strong guarantees about rcp accuracy anyway, do we?
>>>
>> Pardon for dropping it like that. I'll try to explain things in a
>> slightly different way.
>>
>> Due to the fp64 <> fp32 conversion the accuracy of RCP is pretty bad.
>>
>> Thus a couple of Newton-Ralphson steps are used. Each one implemented via
>> fma.
>> There's no native fma thus we use normal multiply and add.
>>
>> As those get added to the generated file of built-ins
>> (builtin_float64.h), it grows by ~20k LoC making compilation/linking
>> quite slow.
>> Noticeably bloating the final binary size as well (Elie has some crazy
>> numbers from the very first experiments).
>
>
> Oh, I think I get it now. The issue is that the mul+add gets inlined into
> the rcp in builtin_float64.h?
Precisely. Note that pretty much _everything_ gets inlined. Which is
why the file is so big at the moment 20k.

> Can that be avoided?
AFAICT that's not possible atm.

> Although I guess that
> just bloats the final shader, to questionable effects...
>
Haven't looked at the final shader - Elie should have some numbers here.

At some point the binary size of generate_ir.cpp (the one that
includes builtin_float64.h) was ~1/3 of the total driver size.

> Thanks for helping me get it :)
>
Yw. I'm pretty sure Elie will correct me since, I'm not that expert in
the stuff.
Just helping him out see the light [at the end of the tunnel].

-Emil