calc: faster sums ...

Tomaž Vajngerl quikee at
Thu Oct 29 13:10:18 PDT 2015


On Thu, Oct 29, 2015 at 3:21 PM, Michael Meeks
<michael.meeks at> wrote:
> Hi Kohei,
>         I'd love some input (if you have a minute) on the attached. The
> punch-line is, that if we want to do really fast arithmetic, we start to
> need to do some odd things; while I suspect that this piece of unrolling
> can be done with the iterator - the next step I'm poking at (SSE3
> assembler ;-) is not going to like that.

You don't need SSE3 assembler for that - just use SSE(3) intrinsics..

SSE uses 128 registers so you can do 2 doubles at the same time.
Best is to have a twosums as __m128d and then sum the two doubles in the end.

__m128d twosums = _mm_set_pd (0.0, 0.0);

then do a similar unrolled for loop to sum 8 values at a time:
__m128d first = _mm_load_pd1(p[i]);
__m128d second = _mm_load_pd1(p[i]+2);

_mm_add_pd(twosums, first);
_mm_add_pd(twosums, second);

in the end just sum the two doubles in twosums and handle the rest of
corner cases...

Even faster it would be if the array is aligned to 16 byte boundary -
then you can use _mm_load_pd.

>         ATB,
>                 Michael.

Regards, Tomaž

More information about the LibreOffice mailing list