[Mesa-dev] [PATCH 00/47] WIP: fp64 support for r600g

Wed Aug 23 13:31:20 UTC 2017

On 23.08.2017 15:26, Emil Velikov wrote:
> On 23 August 2017 at 13:23, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
>> On 23.08.2017 13:07, Elie Tournier wrote:
>>>
>>> From: Elie Tournier <elie.tournier at collabora.com>
>>>
>>> TL;DR
>>> This series is a "status update" of my work done for adding fp64 support
>>> on r600g.
>>> One of the biggest issue is due to a lake of accuracy on the rcp
>>> implementation.
>>> Divide relay on rcp.
>>>
>>> A branch is available on
>>> https://github.com/Hopetech/mesa/tree/glsl_arb_gpu_shader_fp64_v3
>>> Comments and reviews are welcome.
>>>
>>> Patches 1-18:
>>> These few patches implement the basic fp64 operations.
>>>
>>> Patches 19-47:
>>> Lower operations using the builtin functions previously implemented.
>>>
>>> Known issues:
>>> - operations on matrix crash the system.
>>> - sqrt and d2f are not accurate enought so the piglit tests are failling.
>>>     But sqrt and d2f are working correctly using softpipe.
>>>     However, implementing sqrt64 as f2d(sqrt32(d2f()) seems to be good
>>> enought for Piglit.
>>> - rcp is define as pow(pow(x, -0.5), 2)
>>>     NIR and NV convert the input in a fp32, realize a rcp, convert back to
>>> a fp64 and realize some Newton-Raphson step.
>>>     This is not possible with GLSL IR because using fma will generate a
>>> massive builtin_float64.h file.
>>
>>
>> I don't understand this part. You need multiplication and addition anyway.
>> So if it's only fma which is the problem (why?), then why not just use
>> non-fused multiply-add? It may end up being slightly less accurate, but we
>> don't give any strong guarantees about rcp accuracy anyway, do we?
>>
> Pardon for dropping it like that. I'll try to explain things in a
> slightly different way.
> 
> Due to the fp64 <> fp32 conversion the accuracy of RCP is pretty bad.
> 
> Thus a couple of Newton-Ralphson steps are used. Each one implemented via fma.
> There's no native fma thus we use normal multiply and add.
> 
> As those get added to the generated file of built-ins
> (builtin_float64.h), it grows by ~20k LoC making compilation/linking
> quite slow.
> Noticeably bloating the final binary size as well (Elie has some crazy
> numbers from the very first experiments).

Oh, I think I get it now. The issue is that the mul+add gets inlined 
into the rcp in builtin_float64.h? Can that be avoided? Although I guess 
that just bloats the final shader, to questionable effects...

Thanks for helping me get it :)

Cheers,
Nicolai

> 
> -Emil
> 

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.