[Mesa-dev] [PATCH RFC] mesa: add SSE optimisation for glDrawElements
Matt Turner
mattst88 at gmail.com
Wed Oct 22 22:30:19 PDT 2014
On Wed, Oct 22, 2014 at 9:02 PM, Timothy Arceri <t_arceri at yahoo.com.au> wrote:
> Makes use of SSE to speed up compute of min and max elements
>
> Callgrind cpu usage results from pts benchmarks:
>
> Openarena 0.8.8: 3.67% -> 1.03%
> UrbanTerror: 2.36% -> 0.81%
>
> Signed-off-by: Timothy Arceri <t_arceri at yahoo.com.au>
> ---
> src/mesa/Makefile.am | 3 +-
> src/mesa/main/sse_minmax.c | 75 +++++++++++++++++++++++++++++++++++++++++++
> src/mesa/main/sse_minmax.h | 29 +++++++++++++++++
> src/mesa/vbo/vbo_exec_array.c | 8 +++++
> 4 files changed, 114 insertions(+), 1 deletion(-)
> create mode 100644 src/mesa/main/sse_minmax.c
> create mode 100644 src/mesa/main/sse_minmax.h
>
> I almost wasn't going to bother sending this out since it uses SSE4.1
> and its recommended to use glDrawRangeElements anyway. But since these games
> are still ofter used for benchmarking I thought I'd see if anyone is
> interested in this. I only optimised GL_UNSIGNED_INT as that was the
> only place these games were hitting but I guess it wouldn't hurt
> to optimse the other cases too.
I think it's kind of neat!
It might also be fun to try to do this with OpenMP. OpenMP 3.1
(supported since gcc-4.7) supports min/max reduction operators.
I suppose doing that may allow this code to work on more than just SSE
4.1 (e.g., could also use AVX2).
> I think it would probably make sense too to just combine
> streaming-load-memcpy.c and sse_minmax.c into a single file
> something like sse_opt.c for example.
I don't see any reason to do that.
> As far a frame rates go I couldn't see any concrete improments
> on my Ivybridge machine. Maybe it would help more on a high end radeon
> card where cpu apparently is more of a concern??
>
> Finally its seems to run fine but its only the second time I've tried
> these type of optimisations so let me know if there are any obvious
> mistakes or improvements.
Overall the patch looks pretty good. Some comments below.
> diff --git a/src/mesa/Makefile.am b/src/mesa/Makefile.am
> index e71bccb..932db4f 100644
> --- a/src/mesa/Makefile.am
> +++ b/src/mesa/Makefile.am
> @@ -151,7 +151,8 @@ libmesagallium_la_LIBADD = \
> $(ARCH_LIBS)
>
> libmesa_sse41_la_SOURCES = \
> - main/streaming-load-memcpy.c
> + main/streaming-load-memcpy.c \
> + main/sse_minmax.c
> libmesa_sse41_la_CFLAGS = $(AM_CFLAGS) -msse4.1
>
> pkgconfigdir = $(libdir)/pkgconfig
> diff --git a/src/mesa/main/sse_minmax.c b/src/mesa/main/sse_minmax.c
> new file mode 100644
> index 0000000..1625407
> --- /dev/null
> +++ b/src/mesa/main/sse_minmax.c
> @@ -0,0 +1,75 @@
> +/*
> + * Copyright © 2014 Timothy Arceri
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + * Author:
> + * Timothy Arceri <t_arceri at yahoo.com.au>
> + *
> + */
> +
> +#ifdef __SSE4_1__
> +#include "main/glheader.h"
> +#include "main/sse_minmax.h"
> +#include <smmintrin.h>
> +
> +void
> +sse_minmax(const GLuint *ui_indices, GLuint *min_index, GLuint *max_index, const GLuint count)
> +{
> + GLuint i = 0;
> + GLuint max_ui = 0;
> + GLuint min_ui = ~0U;
> + GLuint max_arr[4] = {0};
> + GLuint min_arr[4] = {0};
> + GLuint vec_count;
I probably wouldn't use GL* types here.
> + __m128i max_ui4 = _mm_setzero_si128();
> + __m128i min_ui4 = _mm_set1_epi32(~0U);
> + __m128i ui_indices4;
> + __m128i *ui_indices_ptr;
> +
> + if (count >= 4) {
> + vec_count = count - (count % 4);
vec_count = count & ~0x3;
> + ui_indices_ptr = (__m128i*)ui_indices;
> + for (i = 0; i < vec_count/4; i++) {
Spaces around operators.
> + ui_indices4 = _mm_loadu_si128(&ui_indices_ptr[i]);
> + max_ui4 = _mm_max_epu32(ui_indices4, max_ui4);
> + min_ui4 = _mm_min_epu32(ui_indices4, min_ui4);
> + }
> +
> + _mm_store_ps((float*)max_arr, _mm_castsi128_ps(max_ui4));
> + _mm_store_ps((float*)min_arr, _mm_castsi128_ps(min_ui4));
I think you just want to use _mm_store_si128() here.
Do we have some guarantee that min_arr/max_arr are actually 16-byte
aligned, or do we need to use a compiler attribute to ensure that?
> +
> + for (i = 0; i < 4; i++) {
> + if (max_arr[i] > max_ui) max_ui = max_arr[i];
> + if (min_arr[i] < min_ui) min_ui = min_arr[i];
> + }
> + i = vec_count;
> + }
> +
> + for (; i < count; i++) {
> + if (ui_indices[i] > max_ui) max_ui = ui_indices[i];
> + if (ui_indices[i] < min_ui) min_ui = ui_indices[i];
> + }
> +
> + *min_index = min_ui;
> + *max_index = max_ui;
> +}
> +
> +#endif
> diff --git a/src/mesa/main/sse_minmax.h b/src/mesa/main/sse_minmax.h
> new file mode 100644
> index 0000000..d14db26
> --- /dev/null
> +++ b/src/mesa/main/sse_minmax.h
> @@ -0,0 +1,29 @@
> +/*
> + * Copyright © 2014 Timothy Arceri
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + * Authors:
> + * Timothy Arceri <t_arceri at yahoo.com.au>
> + *
> + */
> +
> +void
> +sse_minmax(const GLuint *ui_indices, GLuint *min_index, GLuint *max_index, const GLuint count);
> diff --git a/src/mesa/vbo/vbo_exec_array.c b/src/mesa/vbo/vbo_exec_array.c
> index 045dbb5..0399348 100644
> --- a/src/mesa/vbo/vbo_exec_array.c
> +++ b/src/mesa/vbo/vbo_exec_array.c
> @@ -36,6 +36,9 @@
> #include "main/enums.h"
> #include "main/macros.h"
> #include "main/transformfeedback.h"
> +#ifdef __SSE4_1__
> +#include "main/sse_minmax.h"
> +#endif
>
> #include "vbo_context.h"
>
> @@ -110,6 +113,7 @@ vbo_get_minmax_index(struct gl_context *ctx,
> const GLuint *ui_indices = (const GLuint *)indices;
> GLuint max_ui = 0;
> GLuint min_ui = ~0U;
> +
Stray whitespace change.
> if (restart) {
> for (i = 0; i < count; i++) {
> if (ui_indices[i] != restartIndex) {
> @@ -119,10 +123,14 @@ vbo_get_minmax_index(struct gl_context *ctx,
> }
> }
> else {
> +#ifdef __SSE4_1__
> + sse_minmax(ui_indices, &min_ui, &max_ui, count);
> +#else
We should do a runtime check here, rather than relying on this file
being built with appropriate CFLAGS.
It's just the cpu_has_sse4_1 macro from x86/common_x86_asm.h. With
that you should remove the __SSE4_1__ checks around the header
inclusion as well.
More information about the mesa-dev
mailing list