[Mesa-dev] [PATCH V4] mesa: add SSE optimisation for glDrawElements

Sat Nov 1 16:15:02 PDT 2014

On Wed, 2014-10-29 at 23:09 +1100, Timothy Arceri wrote:
> On Wed, 2014-10-29 at 16:58 +1100, Timothy Arceri wrote:
> > On Tue, 2014-10-28 at 22:14 +0000, Bruno Jimenez wrote:
> > > Hi,
> > > 
> > > I haven't had time to play yet with OpenMP, but I have seen the assembly
> > > it produces in my computer. If I enable SSE2 it can use it, and if I
> > > enable SSE4.1 it uses the parallel max. But it uses unaligned loads, and
> > > since we are trying to avoid them I don't know if we want to use just
> > > OpenMP.
> > > 
> > > Processing the first unaligned elements by hand and using the
> > > __builtin_assume_aligned as the article you link allows OpenMP to use
> > > aligned loads.
> > > 
> > > Also, it must be noted that I am using GCC, I don't know what Clang may
> > > produce (for the plain algorithms, as it doesn't support OpenMP if I
> > > recall correctly).
> > > 
> > > When I have time I'll try to collect all the variants and make some kind
> > > of benchmark between them to see if it is worthy to use any of them.
> > > 
> > > Also, do we know if 'count' has an upper bound? or if we can force the
> > > array to be aligned so we don't have to worry about the first items?
> > 
> > I'm not sure about count but the indices array comes from the
> > glDrawRangeElements() call so I don't think we can do anything about the
> > alignment.
> 
> Sorry I meant glDrawElements()

Ok, Thanks for the information.

At last I have had time to put everything in place and make some
benchmarks for the naive algorithm, what OpenMP generates, SSE2 and
SSE4.1 code, for both aligned and unaligned scenarios.

As a summary for my CPU:

Limited to SSE2:
  - If the array is aligned:
    + For low values (~20) the naive algorithm is faster
    + For higher values the SSE2 code is faster
  - If the array is unaligned:
    + For low values (~24) the naive algorithm is faster
    + For higher values, the SSE2 code is faster

Using -march=native (SSE4.1 support)
  - If the array is aligned:
    + For very low values (~8) the naive algorithm is faster
    + For low values (12-32) the SSE4.1 algorithm is faster
    + For higher values, the OpenMP generated code is faster
  - If the array is unaligned:
    + For very low values (~12) the naive algorithm is faster
    + For low values (16-32) the SSE4.1 algorithm is faster
    + For higher values, the OpenMP generated code is faster

For the OpenMP generated code there's not much differences between the
naive and the aligned, although the second is a little bit faster.

But that said, I don't know how the different algorithm thresholds will
be for other CPUs.

Also, I have done the benchmark with GCC 4.9.1, maybe we would get
different results with clang or other compilers.

Hope it is useful, if I can help adding SSE to anything else just ask.

- Bruno

> 
> >  
> > 
> > > 
> > > BTW, Thanks for the article!
> > > - Bruno
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > - Bruno
> > > > > 
> > > > > > 
> > > > > > > - Bruno
> > > > > > > 
> > > > > > >> +      unsigned max_arr[4] __attribute__ ((aligned (16)));
> > > > > > >> +      unsigned min_arr[4] __attribute__ ((aligned (16)));
> > > > > > >> +      unsigned vec_count;
> > > > > > >> +      __m128i max_ui4 = _mm_setzero_si128();
> > > > > > >> +      __m128i min_ui4 = _mm_set1_epi32(~0U);
> > > > > > >> +      __m128i ui_indices4;
> > > > > > >> +      __m128i *ui_indices_ptr;
> > > > > > >> +
> > > > > [snip]
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: min_max_array.c
Type: text/x-csrc
Size: 10161 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20141101/1ae410aa/attachment-0001.c>