[Liboil] scalar{min,max}imum_f32
Stephane Fillod
f8cfe at free.fr
Tue Dec 13 12:56:08 PST 2005
On Mon, Dec 12, 2005 at 11:57:03PM -0800, Eric Anholt wrote:
> In libpcg for one function I wanted to clamp values to a minimum of 0.0.
> So I wrote a scalarminimum/maximum, including SSE intrinsics code.
> Patch attached for review to make sure I'm going about adding a new
> function right, am stylistically sane, and that it's something we want.
>
> x t-scalarmax-before
> + t-scalarmax-after
> +--------------------------------------------------------------------------+
> | x + + |
> | x x + + |
> |x + x x x x * x + + + + |
> | |_____________AM__________||______________A_____M_________||
> +--------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 10 57.2 59.2 58.4 58.33 0.53343749
> + 10 57.9 59.9 59.7 59.45 0.62583278
> Difference at 95.0% confidence
> 1.12 +/- 0.54635
> 1.92011% +/- 0.936653%
> (Student's t, pooled s = 0.581473)
Where did you get that nice ASCII graph? Is it an oil-test like program?
[...]
> Index: liboil/sse/math_sse.c
> ===================================================================
> RCS file: /cvs/liboil/liboil/liboil/sse/math_sse.c,v
> retrieving revision 1.1
> diff -u -r1.1 math_sse.c
> --- liboil/sse/math_sse.c 13 Dec 2005 06:26:44 -0000 1.1
> +++ liboil/sse/math_sse.c 13 Dec 2005 07:17:46 -0000
> @@ -303,3 +303,68 @@
> }
> OIL_DEFINE_IMPL_FULL (scalarmultiply_f32_ns_sse, scalarmultiply_f32_ns, OIL_IMPL_FLAG_SSE);
>
> +static void
> +scalarminimum_f32_sse (float *dest, float *src1, float *val, int n)
Defining "const float* src1, const float *val " may help some compilers.
> +{
> + __m128 xmm1;
> + float valtmp[4];
> + int i;
> +
> + valtmp[0] = *val;
> + valtmp[1] = *val;
> + valtmp[2] = *val;
> + valtmp[3] = *val;
> + xmm1 = _mm_loadu_ps(valtmp);
What about "xmm1 = _mm_load_ps1(val);" instead the valtmp thing?
> + /* Initial operations to align the destination pointer */
> + for (i = 0; i < ((long)dest & 15) >> 2; i++) {
> + *dest++ = *src1 < *val ? *src1 : *val;
> + src1++;
> + }
> + for (; i < n - 3; i += 4) {
> + __m128 xmm0;
> + xmm0 = _mm_loadu_ps(src1);
> + xmm0 = _mm_min_ps(xmm0, xmm1);
> + _mm_store_ps(dest, xmm0);
> + dest += 4;
> + src1 += 4;
> + }
What about unrolling the main loop one or two times?
You may check whether src1 is also aligned, and with a dual path (outer
if), use _mm_load_ps for src1. src1 and dest may have the same alignment.
No idea if it's worth it though.
> + for(;i<n;i++){
> + *dest++ = *src1 < *val ? *src1 : *val;
> + src1++;
> + }
> +}
> +OIL_DEFINE_IMPL_FULL (scalarminimum_f32_sse, scalarminimum_f32, OIL_IMPL_FLAG_SSE);
> +
> +static void
> +scalarmaximum_f32_sse (float *dest, float *src1, float *val, int n)
> +{
idem..
--
Stephane
More information about the Liboil
mailing list