[Mesa-dev] [PATCH 2/2] mesa: Speedup the xrgb -> argb special case in fast_read_rgba_pixels_memcpy

Mon Mar 11 11:37:32 PDT 2013

On Mon, Mar 11, 2013 at 1:30 PM, Jose Fonseca <jfonseca at vmware.com> wrote:

> ----- Original Message -----
> > On 03/11/2013 07:56 AM, Jose Fonseca wrote:
> > > I'm surprised this is is faster.
> > >
> > > In particular, for big things we'll be touching memory twice.
> > >
> > > Did you measure the speed up?
> >
> > The second hit is cache-hot, so it may not be too expensive.
>
> Yes, but the size in question is 1900x1200, ie, 9MB, which will trash
> L1-L2 caches, and won't even fit on the L3 cache of several processors.
>
> I'm afraid we'd be optimizing some cases at expense of others.
>
> I think that at very least we should do this in 16KB/32KB or so chunks to
> avoid trashing the lower level caches.
>
> > I suspect
> > memcpy is optimized to fill the cache in a more efficient manner than
> > the old loop.  Since the old loop did a read and a bit-wise or, it's
> > also possible the compiler generated some really dumb code.  We'd have
> > to look at the assembly output to know.
> >
> > As Patrick suggests, there's probably an SSE2 method to do this even
> > faster.  That may be worth investigating.
>
> An SSE2 is quite easy with intrinsics:
>
>   _m128i pixels = _mm_loadu_si128((const __m128i *)src); // could use
> _mm_load_si128 with some checks
>   pixels = _mm_or_si128(pixels, _mm_set1_epi32(0xff000000));
>   _mm_storeu_si128((__m128i *)dst, pixels);
>   src += sizeof(__m128i) / sizeof *src;
>   dst += sizeof(__m128i) / sizeof *dst;
>
> the hard part is the runtime check for sse2 support...
>
>
At least for x86-64, there is no runtime check required as SSE2 is
required. The mesa/x86 folder contains runtime CPU code detection already;
I was just browsing it.

Patrick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130311/0f5fae63/attachment.html>