[Mesa-dev] [PATCH 2/4] i965: Provide sse2 version for rgba8 <-> bgra8 swizzle

Thu Jan 28 17:18:02 PST 2016

Am 29.01.2016 um 00:31 schrieb Matt Turner:
> On Sun, Jan 17, 2016 at 1:49 PM,  <sroland at vmware.com> wrote:
>> From: Roland Scheidegger <sroland at vmware.com>
>>
>> The existing code used ssse3, and because it isn't compiled in a separate
>> file compiled with that, it is usually not used (that, of course, could
>> be fixed...), whereas sse2 is always present at least with 64bit builds.
>> It is actually trivial to do with sse2 without pshufb, on some cpus (I'm
>> looking at you, atom!) it might not even be slower.
>> This is compile-tested only, it doesn't actually do what I really want
>> (which is glReadPixels without doing byte access from an uncached region,
>> which is what you'll get on intel chips not having llc, if your cpu doesn't
>> support sse41 (in which case the rb would be copied over with movntdqa instead
>> of mapped, so mesa format_utils byte swapping conversion will then access
>> the cached region instead of the uncached one) - really need sse2-optimized
>> convert_ubyte functions for a proper fix, otherwise google maps in firefox is
>> reduced to fps below 1 fps), but hey why not. I don't have a gpu which could
>> possibly hit this, albeit I succesfully used the exact same code elsewhere.
> 
> Bah, wall of text!
> 
> I don't think we need all this. The commit title says everything I
> think it needs to say.
> 
>> v2: fix andnot argument order, add comments
>> ---
>>  src/mesa/drivers/dri/i965/intel_mipmap_tree.c  | 18 +++++++
>>  src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 73 +++++++++++++++++++++-----
>>  2 files changed, 79 insertions(+), 12 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
>> index 108dd87..5fc4212 100644
>> --- a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
>> +++ b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
>> @@ -2773,6 +2773,24 @@ intel_miptree_map(struct brw_context *brw,
>>     } else if (!(mode & GL_MAP_WRITE_BIT) &&
>>                !mt->compressed && cpu_has_sse4_1 &&
>>                (mt->pitch % 16 == 0)) {
>> +      /*
>> +       * XXX: without sse4_1, in some situations still really want to copy
>> +       * regardless. Presumably this is not done for performance reasons -
>> +       * with movntdqa thanks to the 64byte streaming load buffer the
>> +       * uncached->cached copy followed by cached->cached later is always
>> +       * faster than doing "ordinary" uncached->cached copy.
>> +       * Without movntdqa, of course an additional copy doesn't help, albeit
>> +       * it has to be said the uncached->cached one is an order of magnitude
>> +       * slower than the later cached->cached one in any case.
>> +       * But mesa may not do a simple memcpy on that memory later - some
>> +       * glReadPixels paths for instance will well hit per-byte access which
>> +       * is a complete performance killer on uncached memory. So in these
>> +       * cases really want to copy regardless, unless the map below could
>> +       * play some tricks making the memory cached.
>> +       * (Or otherwise ensure mesa can't hit these paths often, for instance
>> +       * glReadPixels requiring conversions could be handled by meta, so in
>> +       * end it really would be just memcpy.)
>> +       */
> 
> Walls of text are really hard to read...
> 
> I don't think adding this comment to the code is particularly
> valuable, but it does make me wonder if we should just add a memcpy
> fallback after this path? Maybe we don't care once we have this patch.

Ok I'll remove that. It was rather difficult for me to figure out how
mesa ends up doing the per-byte access out of uncached memory, which is
why I wrote it.
I don't know if the always memcpy is an acceptable solution (hence
didn't do anything about it). It will only add a quite small extra cost
but OTOH that perf hit will be on the fast path which you probably
expect to hit often (i.e. mesa accessing things with a memcpy anyway).
My guess is the fastest solution would be trying to remap as cacheable
somehow, but I don't really know much about the driver (but I know
tricks like that get ugly real fast with all the fences, clflushes etc.)

> 
>>        intel_miptree_map_movntdqa(brw, mt, map, level, slice);
>>  #endif
>>     } else {
>> diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>> index 2383401..42fdde1 100644
>> --- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>> +++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>> @@ -36,10 +36,13 @@
>>  #include "brw_context.h"
>>  #include "intel_tiled_memcpy.h"
>>
>> -#ifdef __SSSE3__
>> +#if defined(__SSSE3__)
>>  #include <tmmintrin.h>
>> +#elif defined(__SSE2__)
>> +#include <emmintrin.h>
>>  #endif
>>
>> +
>>  #define FILE_DEBUG_FLAG DEBUG_TEXTURE
>>
>>  #define ALIGN_DOWN(a, b) ROUND_DOWN_TO(a, b)
>> @@ -56,23 +59,69 @@ static const uint32_t ytile_width = 128;
>>  static const uint32_t ytile_height = 32;
>>  static const uint32_t ytile_span = 16;
>>
>> -#ifdef __SSSE3__
>> +#if defined(__SSSE3__)
>>  static const uint8_t rgba8_permutation[16] =
>>     { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 };
>>
>>  /* NOTE: dst must be 16-byte aligned. src may be unaligned. */
>> -#define rgba8_copy_16_aligned_dst(dst, src)                            \
>> -   _mm_store_si128((__m128i *)(dst),                                   \
>> -                   _mm_shuffle_epi8(_mm_loadu_si128((__m128i *)(src)), \
>> -                                    *(__m128i *) rgba8_permutation))
>> +static inline void
>> +rgba8_copy_16_aligned_dst(void *dst, const void *src)
>> +{
>> +   __m128i reg;
>> +   reg = _mm_loadu_si128((__m128i *)src);
>> +   reg = _mm_shuffle_epi8(reg, *(__m128i *)rgba8_permutation);
>> +   _mm_store_si128((__m128i *)dst, reg);
> 
> Here, and...
> 
>> +}
>>
>>  /* NOTE: src must be 16-byte aligned. dst may be unaligned. */
>> -#define rgba8_copy_16_aligned_src(dst, src)                            \
>> -   _mm_storeu_si128((__m128i *)(dst),                                  \
>> -                    _mm_shuffle_epi8(_mm_load_si128((__m128i *)(src)), \
>> -                                     *(__m128i *) rgba8_permutation))
>> +static inline void
>> +rgba8_copy_16_aligned_src(void *dst, const void *src)
>> +{
>> +   __m128i reg;
>> +   reg = _mm_load_si128((__m128i *)src);
>> +   reg = _mm_shuffle_epi8(reg, *(__m128i *)rgba8_permutation);
>> +   _mm_storeu_si128((__m128i *)dst, reg);
> 
> ... here, I'd just make these single statements like they were before.
> They fit easily within 80 columns.

Ok I can leave them alone. I just got rid of the macros because I hate
them (at least certainly if there's no reason to use them in the first
place), and I just find the nested intrinsics not very readable but
maybe that's just me.

> 
>> +}
>> +
>> +#elif defined(__SSE2__)
>> +static inline void
>> +rgba8_copy_16_aligned_dst(void *dst, const void *src)
>> +{
>> +   __m128i srcreg, dstreg, agmask, ag, rb, r, b;
>> +
>> +   agmask = _mm_set1_epi32(0xFF00FF00);
>> +   srcreg = _mm_loadu_si128((__m128i *)src);
> 
> Okay, so we're starting with 0xAABBGGRR.
> 
>> +
>> +   rb = _mm_andnot_si128(agmask, srcreg);
> 
> Produces 0x00BB00GG. Looks good.
Surely you meant 0x00BB00RR :-).

> 
>> +   ag = _mm_and_si128(agmask, srcreg);
> 
> Produces 0xAA00GG00. Looks good.
> 
>> +   r = _mm_srli_epi32(rb, 16);
> 
> Produces 0x000000BB. Looks wrong -- should be b, not r.
> 
>> +   b = _mm_slli_epi32(rb, 16);
> 
> Produces 0x00RR0000. Looks wrong -- should be r, not b.
Ahh sharp eyes indeed. Not that it would change the result ;-).

> 
>> +   dstreg = _mm_or_si128(ag, r);
>> +   dstreg = _mm_or_si128(dstreg, b);
> 
> I think we can do this better. Instead of two shifts and two ORs,
> recognizing that the R and B values in rb have 8-bits of zero padding,
> we can use _mm_shufflehi_epi16/_mm_shufflelo_epi16 and then a single
> OR:
> 
> br = _mm_shufflehi_epi16(_mm_shufflelo_epi16(rb, _MM_SHUFFLE(2, 3, 0,
> 1)), _MM_SHUFFLE(2, 3, 0, 1));
> dstreg = _mm_or_si128(ag, br);
> 
> (appropriately line wrapped)
Right I missed the optimal solution. Not that it would make a real
difference, but it's certainly better.

> 
>> +
>> +   _mm_store_si128((__m128i *)dst, dstreg);
>> +}
>> +
>> +static inline void
>> +rgba8_copy_16_aligned_src(void *dst, const void *src)
>> +{
>> +   __m128i srcreg, dstreg, agmask, ag, rb, r, b;
>> +
>> +   agmask = _mm_set1_epi32(0xFF00FF00);
>> +   srcreg = _mm_load_si128((__m128i *)src);
>> +
>> +   rb = _mm_andnot_si128(agmask, srcreg);
>> +   ag = _mm_and_si128(agmask, srcreg);
>> +   r = _mm_srli_epi32(rb, 16);
>> +   b = _mm_slli_epi32(rb, 16);
>> +   dstreg = _mm_or_si128(ag, r);
>> +   dstreg = _mm_or_si128(dstreg, b);
> 
> Same comments here.
> 
> With the walls of text removed, and the other couple of things
> addressed, this patch would be
> 
> Reviewed-by: Matt Turner <mattst88 at gmail.com>
> 

Thanks for the review!

Roland