[Mesa-dev] [PATCH V4] mesa: add SSE optimisation for glDrawElements

Sun Nov 2 03:13:54 PST 2014

On Sat, 2014-11-01 at 23:15 +0000, Bruno Jimenez wrote:
> On Wed, 2014-10-29 at 23:09 +1100, Timothy Arceri wrote:
> > On Wed, 2014-10-29 at 16:58 +1100, Timothy Arceri wrote:
> > > On Tue, 2014-10-28 at 22:14 +0000, Bruno Jimenez wrote:
> > > > Hi,
> > > > 
> > > > I haven't had time to play yet with OpenMP, but I have seen the assembly
> > > > it produces in my computer. If I enable SSE2 it can use it, and if I
> > > > enable SSE4.1 it uses the parallel max. But it uses unaligned loads, and
> > > > since we are trying to avoid them I don't know if we want to use just
> > > > OpenMP.
> > > > 
> > > > Processing the first unaligned elements by hand and using the
> > > > __builtin_assume_aligned as the article you link allows OpenMP to use
> > > > aligned loads.
> > > > 
> > > > Also, it must be noted that I am using GCC, I don't know what Clang may
> > > > produce (for the plain algorithms, as it doesn't support OpenMP if I
> > > > recall correctly).
> > > > 
> > > > When I have time I'll try to collect all the variants and make some kind
> > > > of benchmark between them to see if it is worthy to use any of them.
> > > > 
> > > > Also, do we know if 'count' has an upper bound? or if we can force the
> > > > array to be aligned so we don't have to worry about the first items?
> > > 
> > > I'm not sure about count but the indices array comes from the
> > > glDrawRangeElements() call so I don't think we can do anything about the
> > > alignment.
> > 
> > Sorry I meant glDrawElements()
> 
> Ok, Thanks for the information.
> 
> At last I have had time to put everything in place and make some
> benchmarks for the naive algorithm, what OpenMP generates, SSE2 and
> SSE4.1 code, for both aligned and unaligned scenarios.
> 
> As a summary for my CPU:
> 
> Limited to SSE2:
>   - If the array is aligned:
>     + For low values (~20) the naive algorithm is faster
>     + For higher values the SSE2 code is faster
>   - If the array is unaligned:
>     + For low values (~24) the naive algorithm is faster
>     + For higher values, the SSE2 code is faster
> 
> Using -march=native (SSE4.1 support)
>   - If the array is aligned:
>     + For very low values (~8) the naive algorithm is faster
>     + For low values (12-32) the SSE4.1 algorithm is faster
>     + For higher values, the OpenMP generated code is faster
>   - If the array is unaligned:
>     + For very low values (~12) the naive algorithm is faster
>     + For low values (16-32) the SSE4.1 algorithm is faster
>     + For higher values, the OpenMP generated code is faster
> 
> For the OpenMP generated code there's not much differences between the
> naive and the aligned, although the second is a little bit faster.
> 
> But that said, I don't know how the different algorithm thresholds will
> be for other CPUs.
> 
> Also, I have done the benchmark with GCC 4.9.1, maybe we would get
> different results with clang or other compilers.
>  
> Hope it is useful, if I can help adding SSE to anything else just ask.

It's very easy to find places for improvement by running Phoronix Test
Suite benchmarks in a tool like callgrind, and looking at the output in
Kcachegrind

I've just setup a wiki on my website so I can keep track of places that
could benefit from these kind of optimisations. Feel free to make use of
it, or have a go at improving anything I put on there.

http://www.itsqueeze.com/wiki/index.php?title=Main_Page 
> 
> - Bruno
> 
> > 
> > >  
> > > 
> > > > 
> > > > BTW, Thanks for the article!
> > > > - Bruno
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > - Bruno
> > > > > > 
> > > > > > > 
> > > > > > > > - Bruno
> > > > > > > > 
> > > > > > > >> +      unsigned max_arr[4] __attribute__ ((aligned (16)));
> > > > > > > >> +      unsigned min_arr[4] __attribute__ ((aligned (16)));
> > > > > > > >> +      unsigned vec_count;
> > > > > > > >> +      __m128i max_ui4 = _mm_setzero_si128();
> > > > > > > >> +      __m128i min_ui4 = _mm_set1_epi32(~0U);
> > > > > > > >> +      __m128i ui_indices4;
> > > > > > > >> +      __m128i *ui_indices_ptr;
> > > > > > > >> +
> > > > > > [snip]
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > > _______________________________________________
> > > mesa-dev mailing list
> > > mesa-dev at lists.freedesktop.org
> > > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > 
> > 
>