[Mesa-dev] [PATCH V4] mesa: add SSE optimisation for glDrawElements

Tue Oct 28 22:58:28 PDT 2014

On Tue, 2014-10-28 at 22:14 +0000, Bruno Jimenez wrote:
> On Wed, 2014-10-29 at 07:49 +1100, Timothy Arceri wrote:
> > On Mon, 2014-10-27 at 20:04 +0000, Bruno Jimenez wrote:
> > > [snip]
> > > > >> +
> > > > >> +   if (aligned_count >= 4) {
> > > > >                           ^^
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > I have been thinking and I think that you can change that 4 for an 8. In
> > > > > the case aligned_count == 4 there's no gain in using SSE, as you will
> > > > > have to do a final reduction from 4 elements to 1.
> > > > 
> > > > Either that or the initial check for whether or not to call the SSE
> > > > version could be predicated on size.  Something like
> > > > 
> > > >         if (cpu_has_sse4_1 && count >= 8) {
> > > > 
> > > > The actual threshold may be higher than 8 (and is certainly higher than
> > > > 4, as you point out).  It would take some careful microbenchmarks and
> > > > measurement to find the actual tipping point.  Maybe just use 8 and add
> > > > a FINISHME comment for now?
> > > 
> > > In fact, with enough 'bad' luck the threshold could be 10: 3 unaligned
> > > at the begining, 4 ready for SSE and 3 at the end. Just as a side
> > > thought... why don't we force the alignment in cases when we potentially
> > > may use SSE and avoid the problem from the begining?
> > > 
> > > Also, I have been toying a bit and I have written a couple of functions
> > > for finding max values in arrays. They can be easily extended to add a
> > > find_min at the same time. The good part is that I have also found a way
> > > to only use SSE2, paralellizing a branchless algorithm for finding
> > > maximums. Maybe we could ship it for x84_64 at least.
> > > 
> > > If you want to play with them the code is attached. If you need anything
> > > else, just ask.
> > 
> > I haven't had a chance to play with this yet, but were you able to test
> > your SSE2 code against what OpenMP produces? As per Matt's earlier
> > suggestion.
> > Maybe it would be possible to always enable OpenMP on x84_64? Then the
> > code wouldn't have to change at all much to add SSE2 for this. I don't
> > know enough about OpenMP though to be sure if enabling it would have any
> > other side effects.
> >  
> > [1] http://locklessinc.com/articles/vectorize/
> 
> Hi,
> 
> I haven't had time to play yet with OpenMP, but I have seen the assembly
> it produces in my computer. If I enable SSE2 it can use it, and if I
> enable SSE4.1 it uses the parallel max. But it uses unaligned loads, and
> since we are trying to avoid them I don't know if we want to use just
> OpenMP.
> 
> Processing the first unaligned elements by hand and using the
> __builtin_assume_aligned as the article you link allows OpenMP to use
> aligned loads.
> 
> Also, it must be noted that I am using GCC, I don't know what Clang may
> produce (for the plain algorithms, as it doesn't support OpenMP if I
> recall correctly).
> 
> When I have time I'll try to collect all the variants and make some kind
> of benchmark between them to see if it is worthy to use any of them.
> 
> Also, do we know if 'count' has an upper bound? or if we can force the
> array to be aligned so we don't have to worry about the first items?

I'm not sure about count but the indices array comes from the
glDrawRangeElements() call so I don't think we can do anything about the
alignment. 

> 
> BTW, Thanks for the article!
> - Bruno
> 
> > 
> > 
> > > 
> > > - Bruno
> > > 
> > > > 
> > > > > - Bruno
> > > > > 
> > > > >> +      unsigned max_arr[4] __attribute__ ((aligned (16)));
> > > > >> +      unsigned min_arr[4] __attribute__ ((aligned (16)));
> > > > >> +      unsigned vec_count;
> > > > >> +      __m128i max_ui4 = _mm_setzero_si128();
> > > > >> +      __m128i min_ui4 = _mm_set1_epi32(~0U);
> > > > >> +      __m128i ui_indices4;
> > > > >> +      __m128i *ui_indices_ptr;
> > > > >> +
> > > [snip]
> > 
> > 
> 
>