[Mesa-dev] [PATCH 2/2] mesa: Speedup the xrgb -> argb special case in fast_read_rgba_pixels_memcpy

Mon Mar 11 11:58:08 PDT 2013

----- Original Message -----
> On 03/11/2013 11:30 AM, Jose Fonseca wrote:
> > ----- Original Message -----
> >> On 03/11/2013 07:56 AM, Jose Fonseca wrote:
> >>> I'm surprised this is is faster.
> >>>
> >>> In particular, for big things we'll be touching memory twice.
> >>>
> >>> Did you measure the speed up?
> >>
> >> The second hit is cache-hot, so it may not be too expensive.
> >
> > Yes, but the size in question is 1900x1200, ie, 9MB, which will trash L1-L2
> > caches, and won't even fit on the L3 cache of several processors.
> 
> But it's doing it line-by-line, right?  So 1900 * 4bpp is only ~8kb.

Oh I missed that. That looks quite sensible then.

> > I'm afraid we'd be optimizing some cases at expense of others.
> 
> That is probably true either way.  To optimize this for everything, we'd
> need a lot more tests.
> 
> > I think that at very least we should do this in 16KB/32KB or so chunks to
> > avoid trashing the lower level caches.
> >
> >> I suspect
> >> memcpy is optimized to fill the cache in a more efficient manner than
> >> the old loop.  Since the old loop did a read and a bit-wise or, it's
> >> also possible the compiler generated some really dumb code.  We'd have
> >> to look at the assembly output to know.
> >>
> >> As Patrick suggests, there's probably an SSE2 method to do this even
> >> faster.  That may be worth investigating.
> >
> > An SSE2 is quite easy with intrinsics:
> >
> >    _m128i pixels = _mm_loadu_si128((const __m128i *)src); // could use
> >    _mm_load_si128 with some checks
> >    pixels = _mm_or_si128(pixels, _mm_set1_epi32(0xff000000));
> >    _mm_storeu_si128((__m128i *)dst, pixels);
> >    src += sizeof(__m128i) / sizeof *src;
> >    dst += sizeof(__m128i) / sizeof *dst;
> >
> > the hard part is the runtime check for sse2 support...
> 
> We could start by doing something like this for 64-bit builds.  SSE2 is
> always available there. :)  If we're using the intrinsics anyway, it's
> probably even better to use PREFETCHNTA on the read.

Yes, that would avoid trashing the cache with one-time reads.

> Mesa has some code for detecting CPU capabilities, but I don't think it
> has been updated in ages...  It looks like src/mesa/x86/common_x86.c
> detects MMX and SSE, but there's no code for anything after that.

Gallium too. Should move that into somwhere shareable...

Jose