calc: faster sums ...
quikee at gmail.com
Thu Oct 29 13:10:18 PDT 2015
On Thu, Oct 29, 2015 at 3:21 PM, Michael Meeks
<michael.meeks at collabora.com> wrote:
> Hi Kohei,
> I'd love some input (if you have a minute) on the attached. The
> punch-line is, that if we want to do really fast arithmetic, we start to
> need to do some odd things; while I suspect that this piece of unrolling
> can be done with the iterator - the next step I'm poking at (SSE3
> assembler ;-) is not going to like that.
You don't need SSE3 assembler for that - just use SSE(3) intrinsics..
SSE uses 128 registers so you can do 2 doubles at the same time.
Best is to have a twosums as __m128d and then sum the two doubles in the end.
__m128d twosums = _mm_set_pd (0.0, 0.0);
then do a similar unrolled for loop to sum 8 values at a time:
__m128d first = _mm_load_pd1(p[i]);
__m128d second = _mm_load_pd1(p[i]+2);
in the end just sum the two doubles in twosums and handle the rest of
Even faster it would be if the array is aligned to 16 byte boundary -
then you can use _mm_load_pd.
More information about the LibreOffice