[Mesa-dev] [PATCH 2/4] i965: Provide sse2 version for rgba8 <-> bgra8 swizzle

Lofstedt, Marta marta.lofstedt at intel.com
Tue Feb 2 08:29:30 UTC 2016


Hi Roland and Matt,

I am sorry, but this patch segfaults/asserts Kronos CTS - ES2-CTS.gtf.GL2FixedTests.scissor.scissor

I created this bug:
https://bugs.freedesktop.org/show_bug.cgi?id=93962

I also added the bug to the OpenGL ES 3.1 tracker bug, since it blocks CTS testing.
https://bugs.freedesktop.org/show_bug.cgi?id=92788


/Marta

> -----Original Message-----
> From: mesa-dev [mailto:mesa-dev-bounces at lists.freedesktop.org] On
> Behalf Of Roland Scheidegger
> Sent: Friday, January 29, 2016 2:18 AM
> To: Matt Turner
> Cc: mesa-dev at lists.freedesktop.org
> Subject: Re: [Mesa-dev] [PATCH 2/4] i965: Provide sse2 version for rgba8 <->
> bgra8 swizzle
> 
> Am 29.01.2016 um 00:31 schrieb Matt Turner:
> > On Sun, Jan 17, 2016 at 1:49 PM,  <sroland at vmware.com> wrote:
> >> From: Roland Scheidegger <sroland at vmware.com>
> >>
> >> The existing code used ssse3, and because it isn't compiled in a
> >> separate file compiled with that, it is usually not used (that, of
> >> course, could be fixed...), whereas sse2 is always present at least with
> 64bit builds.
> >> It is actually trivial to do with sse2 without pshufb, on some cpus
> >> (I'm looking at you, atom!) it might not even be slower.
> >> This is compile-tested only, it doesn't actually do what I really
> >> want (which is glReadPixels without doing byte access from an
> >> uncached region, which is what you'll get on intel chips not having
> >> llc, if your cpu doesn't support sse41 (in which case the rb would be
> >> copied over with movntdqa instead of mapped, so mesa format_utils
> >> byte swapping conversion will then access the cached region instead
> >> of the uncached one) - really need sse2-optimized convert_ubyte
> >> functions for a proper fix, otherwise google maps in firefox is
> >> reduced to fps below 1 fps), but hey why not. I don't have a gpu which
> could possibly hit this, albeit I succesfully used the exact same code
> elsewhere.
> >
> > Bah, wall of text!
> >
> > I don't think we need all this. The commit title says everything I
> > think it needs to say.
> >
> >> v2: fix andnot argument order, add comments
> >> ---
> >>  src/mesa/drivers/dri/i965/intel_mipmap_tree.c  | 18 +++++++
> >> src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 73
> >> +++++++++++++++++++++-----
> >>  2 files changed, 79 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
> >> b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
> >> index 108dd87..5fc4212 100644
> >> --- a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
> >> +++ b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
> >> @@ -2773,6 +2773,24 @@ intel_miptree_map(struct brw_context *brw,
> >>     } else if (!(mode & GL_MAP_WRITE_BIT) &&
> >>                !mt->compressed && cpu_has_sse4_1 &&
> >>                (mt->pitch % 16 == 0)) {
> >> +      /*
> >> +       * XXX: without sse4_1, in some situations still really want to copy
> >> +       * regardless. Presumably this is not done for performance reasons -
> >> +       * with movntdqa thanks to the 64byte streaming load buffer the
> >> +       * uncached->cached copy followed by cached->cached later is always
> >> +       * faster than doing "ordinary" uncached->cached copy.
> >> +       * Without movntdqa, of course an additional copy doesn't help,
> albeit
> >> +       * it has to be said the uncached->cached one is an order of
> magnitude
> >> +       * slower than the later cached->cached one in any case.
> >> +       * But mesa may not do a simple memcpy on that memory later -
> some
> >> +       * glReadPixels paths for instance will well hit per-byte access which
> >> +       * is a complete performance killer on uncached memory. So in these
> >> +       * cases really want to copy regardless, unless the map below could
> >> +       * play some tricks making the memory cached.
> >> +       * (Or otherwise ensure mesa can't hit these paths often, for instance
> >> +       * glReadPixels requiring conversions could be handled by meta, so in
> >> +       * end it really would be just memcpy.)
> >> +       */
> >
> > Walls of text are really hard to read...
> >
> > I don't think adding this comment to the code is particularly
> > valuable, but it does make me wonder if we should just add a memcpy
> > fallback after this path? Maybe we don't care once we have this patch.
> 
> Ok I'll remove that. It was rather difficult for me to figure out how mesa ends
> up doing the per-byte access out of uncached memory, which is why I wrote
> it.
> I don't know if the always memcpy is an acceptable solution (hence didn't do
> anything about it). It will only add a quite small extra cost but OTOH that perf
> hit will be on the fast path which you probably expect to hit often (i.e. mesa
> accessing things with a memcpy anyway).
> My guess is the fastest solution would be trying to remap as cacheable
> somehow, but I don't really know much about the driver (but I know tricks
> like that get ugly real fast with all the fences, clflushes etc.)
> 
> >
> >>        intel_miptree_map_movntdqa(brw, mt, map, level, slice);
> >> #endif
> >>     } else {
> >> diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> >> b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> >> index 2383401..42fdde1 100644
> >> --- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> >> +++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> >> @@ -36,10 +36,13 @@
> >>  #include "brw_context.h"
> >>  #include "intel_tiled_memcpy.h"
> >>
> >> -#ifdef __SSSE3__
> >> +#if defined(__SSSE3__)
> >>  #include <tmmintrin.h>
> >> +#elif defined(__SSE2__)
> >> +#include <emmintrin.h>
> >>  #endif
> >>
> >> +
> >>  #define FILE_DEBUG_FLAG DEBUG_TEXTURE
> >>
> >>  #define ALIGN_DOWN(a, b) ROUND_DOWN_TO(a, b) @@ -56,23 +59,69
> @@
> >> static const uint32_t ytile_width = 128;  static const uint32_t
> >> ytile_height = 32;  static const uint32_t ytile_span = 16;
> >>
> >> -#ifdef __SSSE3__
> >> +#if defined(__SSSE3__)
> >>  static const uint8_t rgba8_permutation[16] =
> >>     { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 };
> >>
> >>  /* NOTE: dst must be 16-byte aligned. src may be unaligned. */
> >> -#define rgba8_copy_16_aligned_dst(dst, src)                            \
> >> -   _mm_store_si128((__m128i *)(dst),                                   \
> >> -                   _mm_shuffle_epi8(_mm_loadu_si128((__m128i *)(src)), \
> >> -                                    *(__m128i *) rgba8_permutation))
> >> +static inline void
> >> +rgba8_copy_16_aligned_dst(void *dst, const void *src) {
> >> +   __m128i reg;
> >> +   reg = _mm_loadu_si128((__m128i *)src);
> >> +   reg = _mm_shuffle_epi8(reg, *(__m128i *)rgba8_permutation);
> >> +   _mm_store_si128((__m128i *)dst, reg);
> >
> > Here, and...
> >
> >> +}
> >>
> >>  /* NOTE: src must be 16-byte aligned. dst may be unaligned. */
> >> -#define rgba8_copy_16_aligned_src(dst, src)                            \
> >> -   _mm_storeu_si128((__m128i *)(dst),                                  \
> >> -                    _mm_shuffle_epi8(_mm_load_si128((__m128i *)(src)), \
> >> -                                     *(__m128i *) rgba8_permutation))
> >> +static inline void
> >> +rgba8_copy_16_aligned_src(void *dst, const void *src) {
> >> +   __m128i reg;
> >> +   reg = _mm_load_si128((__m128i *)src);
> >> +   reg = _mm_shuffle_epi8(reg, *(__m128i *)rgba8_permutation);
> >> +   _mm_storeu_si128((__m128i *)dst, reg);
> >
> > ... here, I'd just make these single statements like they were before.
> > They fit easily within 80 columns.
> 
> Ok I can leave them alone. I just got rid of the macros because I hate them
> (at least certainly if there's no reason to use them in the first place), and I just
> find the nested intrinsics not very readable but maybe that's just me.
> 
> 
> >
> >> +}
> >> +
> >> +#elif defined(__SSE2__)
> >> +static inline void
> >> +rgba8_copy_16_aligned_dst(void *dst, const void *src) {
> >> +   __m128i srcreg, dstreg, agmask, ag, rb, r, b;
> >> +
> >> +   agmask = _mm_set1_epi32(0xFF00FF00);
> >> +   srcreg = _mm_loadu_si128((__m128i *)src);
> >
> > Okay, so we're starting with 0xAABBGGRR.
> >
> >> +
> >> +   rb = _mm_andnot_si128(agmask, srcreg);
> >
> > Produces 0x00BB00GG. Looks good.
> Surely you meant 0x00BB00RR :-).
> 
> >
> >> +   ag = _mm_and_si128(agmask, srcreg);
> >
> > Produces 0xAA00GG00. Looks good.
> >
> >> +   r = _mm_srli_epi32(rb, 16);
> >
> > Produces 0x000000BB. Looks wrong -- should be b, not r.
> >
> >> +   b = _mm_slli_epi32(rb, 16);
> >
> > Produces 0x00RR0000. Looks wrong -- should be r, not b.
> Ahh sharp eyes indeed. Not that it would change the result ;-).
> 
> >
> >> +   dstreg = _mm_or_si128(ag, r);
> >> +   dstreg = _mm_or_si128(dstreg, b);
> >
> > I think we can do this better. Instead of two shifts and two ORs,
> > recognizing that the R and B values in rb have 8-bits of zero padding,
> > we can use _mm_shufflehi_epi16/_mm_shufflelo_epi16 and then a single
> > OR:
> >
> > br = _mm_shufflehi_epi16(_mm_shufflelo_epi16(rb, _MM_SHUFFLE(2, 3,
> 0,
> > 1)), _MM_SHUFFLE(2, 3, 0, 1)); dstreg = _mm_or_si128(ag, br);
> >
> > (appropriately line wrapped)
> Right I missed the optimal solution. Not that it would make a real difference,
> but it's certainly better.
> 
> 
> >
> >> +
> >> +   _mm_store_si128((__m128i *)dst, dstreg); }
> >> +
> >> +static inline void
> >> +rgba8_copy_16_aligned_src(void *dst, const void *src) {
> >> +   __m128i srcreg, dstreg, agmask, ag, rb, r, b;
> >> +
> >> +   agmask = _mm_set1_epi32(0xFF00FF00);
> >> +   srcreg = _mm_load_si128((__m128i *)src);
> >> +
> >> +   rb = _mm_andnot_si128(agmask, srcreg);
> >> +   ag = _mm_and_si128(agmask, srcreg);
> >> +   r = _mm_srli_epi32(rb, 16);
> >> +   b = _mm_slli_epi32(rb, 16);
> >> +   dstreg = _mm_or_si128(ag, r);
> >> +   dstreg = _mm_or_si128(dstreg, b);
> >
> > Same comments here.
> >
> > With the walls of text removed, and the other couple of things
> > addressed, this patch would be
> >
> > Reviewed-by: Matt Turner <mattst88 at gmail.com>
> >
> 
> Thanks for the review!
> 
> Roland
> 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev


More information about the mesa-dev mailing list