[Mesa-dev] [PATCH 00/47] WIP: fp64 support for r600g

Wed Aug 23 13:26:52 UTC 2017

On 23 August 2017 at 13:23, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> On 23.08.2017 13:07, Elie Tournier wrote:
>>
>> From: Elie Tournier <elie.tournier at collabora.com>
>>
>> TL;DR
>> This series is a "status update" of my work done for adding fp64 support
>> on r600g.
>> One of the biggest issue is due to a lake of accuracy on the rcp
>> implementation.
>> Divide relay on rcp.
>>
>> A branch is available on
>> https://github.com/Hopetech/mesa/tree/glsl_arb_gpu_shader_fp64_v3
>> Comments and reviews are welcome.
>>
>> Patches 1-18:
>> These few patches implement the basic fp64 operations.
>>
>> Patches 19-47:
>> Lower operations using the builtin functions previously implemented.
>>
>> Known issues:
>> - operations on matrix crash the system.
>> - sqrt and d2f are not accurate enought so the piglit tests are failling.
>>    But sqrt and d2f are working correctly using softpipe.
>>    However, implementing sqrt64 as f2d(sqrt32(d2f()) seems to be good
>> enought for Piglit.
>> - rcp is define as pow(pow(x, -0.5), 2)
>>    NIR and NV convert the input in a fp32, realize a rcp, convert back to
>> a fp64 and realize some Newton-Raphson step.
>>    This is not possible with GLSL IR because using fma will generate a
>> massive builtin_float64.h file.
>
>
> I don't understand this part. You need multiplication and addition anyway.
> So if it's only fma which is the problem (why?), then why not just use
> non-fused multiply-add? It may end up being slightly less accurate, but we
> don't give any strong guarantees about rcp accuracy anyway, do we?
>
Pardon for dropping it like that. I'll try to explain things in a
slightly different way.

Due to the fp64 <> fp32 conversion the accuracy of RCP is pretty bad.

Thus a couple of Newton-Ralphson steps are used. Each one implemented via fma.
There's no native fma thus we use normal multiply and add.

As those get added to the generated file of built-ins
(builtin_float64.h), it grows by ~20k LoC making compilation/linking
quite slow.
Noticeably bloating the final binary size as well (Elie has some crazy
numbers from the very first experiments).

-Emil