[Mesa-dev] [PATCH V4] mesa: add SSE optimisation for glDrawElements

Tue Oct 28 15:14:31 PDT 2014

On Wed, 2014-10-29 at 07:49 +1100, Timothy Arceri wrote:
> On Mon, 2014-10-27 at 20:04 +0000, Bruno Jimenez wrote:
> > [snip]
> > > >> +
> > > >> +   if (aligned_count >= 4) {
> > > >                           ^^
> > > > 
> > > > Hi,
> > > > 
> > > > I have been thinking and I think that you can change that 4 for an 8. In
> > > > the case aligned_count == 4 there's no gain in using SSE, as you will
> > > > have to do a final reduction from 4 elements to 1.
> > > 
> > > Either that or the initial check for whether or not to call the SSE
> > > version could be predicated on size.  Something like
> > > 
> > >         if (cpu_has_sse4_1 && count >= 8) {
> > > 
> > > The actual threshold may be higher than 8 (and is certainly higher than
> > > 4, as you point out).  It would take some careful microbenchmarks and
> > > measurement to find the actual tipping point.  Maybe just use 8 and add
> > > a FINISHME comment for now?
> > 
> > In fact, with enough 'bad' luck the threshold could be 10: 3 unaligned
> > at the begining, 4 ready for SSE and 3 at the end. Just as a side
> > thought... why don't we force the alignment in cases when we potentially
> > may use SSE and avoid the problem from the begining?
> > 
> > Also, I have been toying a bit and I have written a couple of functions
> > for finding max values in arrays. They can be easily extended to add a
> > find_min at the same time. The good part is that I have also found a way
> > to only use SSE2, paralellizing a branchless algorithm for finding
> > maximums. Maybe we could ship it for x84_64 at least.
> > 
> > If you want to play with them the code is attached. If you need anything
> > else, just ask.
> 
> I haven't had a chance to play with this yet, but were you able to test
> your SSE2 code against what OpenMP produces? As per Matt's earlier
> suggestion.
> Maybe it would be possible to always enable OpenMP on x84_64? Then the
> code wouldn't have to change at all much to add SSE2 for this. I don't
> know enough about OpenMP though to be sure if enabling it would have any
> other side effects.
>  
> [1] http://locklessinc.com/articles/vectorize/

Hi,

I haven't had time to play yet with OpenMP, but I have seen the assembly
it produces in my computer. If I enable SSE2 it can use it, and if I
enable SSE4.1 it uses the parallel max. But it uses unaligned loads, and
since we are trying to avoid them I don't know if we want to use just
OpenMP.

Processing the first unaligned elements by hand and using the
__builtin_assume_aligned as the article you link allows OpenMP to use
aligned loads.

Also, it must be noted that I am using GCC, I don't know what Clang may
produce (for the plain algorithms, as it doesn't support OpenMP if I
recall correctly).

When I have time I'll try to collect all the variants and make some kind
of benchmark between them to see if it is worthy to use any of them.

Also, do we know if 'count' has an upper bound? or if we can force the
array to be aligned so we don't have to worry about the first items?

BTW, Thanks for the article!
- Bruno

> 
> 
> > 
> > - Bruno
> > 
> > > 
> > > > - Bruno
> > > > 
> > > >> +      unsigned max_arr[4] __attribute__ ((aligned (16)));
> > > >> +      unsigned min_arr[4] __attribute__ ((aligned (16)));
> > > >> +      unsigned vec_count;
> > > >> +      __m128i max_ui4 = _mm_setzero_si128();
> > > >> +      __m128i min_ui4 = _mm_set1_epi32(~0U);
> > > >> +      __m128i ui_indices4;
> > > >> +      __m128i *ui_indices_ptr;
> > > >> +
> > [snip]
> 
>