[Mesa-dev] [PATCH 2/4] i965: Provide sse2 version for rgba8 <-> bgra8 swizzle

Roland Scheidegger sroland at vmware.com
Tue Feb 2 14:44:02 UTC 2016


Ok I'm going to revert it.

Note though this hits an assertion which is completely independent of
the added code. This means it works before this patch because you didn't
compile with ssse3 enabled - apparently the existing code is broken and
just relies on compile flags to not get hit...
This should probably be fixed for real regardless...

Roland



Am 02.02.2016 um 09:29 schrieb Lofstedt, Marta:
> Hi Roland and Matt,
> 
> I am sorry, but this patch segfaults/asserts Kronos CTS - ES2-CTS.gtf.GL2FixedTests.scissor.scissor
> 
> I created this bug:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.freedesktop.org_show-5Fbug.cgi-3Fid-3D93962&d=BQIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=Vjtt0vs_iqoI31UfJxBl7yv9I2FeiaeAYgMTLKRBc_I&m=nGNQk9Ygig92I2slinye-TToGAmNP8dl5F9AsSX4ehA&s=SDe7EPcjhDRUb5DigMCXHdn9TJ9l7Y1ealds8cJo-PM&e= 
> 
> I also added the bug to the OpenGL ES 3.1 tracker bug, since it blocks CTS testing.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.freedesktop.org_show-5Fbug.cgi-3Fid-3D92788&d=BQIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=Vjtt0vs_iqoI31UfJxBl7yv9I2FeiaeAYgMTLKRBc_I&m=nGNQk9Ygig92I2slinye-TToGAmNP8dl5F9AsSX4ehA&s=uowtFMWBy9fKM5iN6ZsSGHoOLkYskpAvf4coEYs3ESg&e= 
> 
> 
> /Marta
> 
>> -----Original Message-----
>> From: mesa-dev [mailto:mesa-dev-bounces at lists.freedesktop.org] On
>> Behalf Of Roland Scheidegger
>> Sent: Friday, January 29, 2016 2:18 AM
>> To: Matt Turner
>> Cc: mesa-dev at lists.freedesktop.org
>> Subject: Re: [Mesa-dev] [PATCH 2/4] i965: Provide sse2 version for rgba8 <->
>> bgra8 swizzle
>>
>> Am 29.01.2016 um 00:31 schrieb Matt Turner:
>>> On Sun, Jan 17, 2016 at 1:49 PM,  <sroland at vmware.com> wrote:
>>>> From: Roland Scheidegger <sroland at vmware.com>
>>>>
>>>> The existing code used ssse3, and because it isn't compiled in a
>>>> separate file compiled with that, it is usually not used (that, of
>>>> course, could be fixed...), whereas sse2 is always present at least with
>> 64bit builds.
>>>> It is actually trivial to do with sse2 without pshufb, on some cpus
>>>> (I'm looking at you, atom!) it might not even be slower.
>>>> This is compile-tested only, it doesn't actually do what I really
>>>> want (which is glReadPixels without doing byte access from an
>>>> uncached region, which is what you'll get on intel chips not having
>>>> llc, if your cpu doesn't support sse41 (in which case the rb would be
>>>> copied over with movntdqa instead of mapped, so mesa format_utils
>>>> byte swapping conversion will then access the cached region instead
>>>> of the uncached one) - really need sse2-optimized convert_ubyte
>>>> functions for a proper fix, otherwise google maps in firefox is
>>>> reduced to fps below 1 fps), but hey why not. I don't have a gpu which
>> could possibly hit this, albeit I succesfully used the exact same code
>> elsewhere.
>>>
>>> Bah, wall of text!
>>>
>>> I don't think we need all this. The commit title says everything I
>>> think it needs to say.
>>>
>>>> v2: fix andnot argument order, add comments
>>>> ---
>>>>  src/mesa/drivers/dri/i965/intel_mipmap_tree.c  | 18 +++++++
>>>> src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 73
>>>> +++++++++++++++++++++-----
>>>>  2 files changed, 79 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
>>>> b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
>>>> index 108dd87..5fc4212 100644
>>>> --- a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
>>>> +++ b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
>>>> @@ -2773,6 +2773,24 @@ intel_miptree_map(struct brw_context *brw,
>>>>     } else if (!(mode & GL_MAP_WRITE_BIT) &&
>>>>                !mt->compressed && cpu_has_sse4_1 &&
>>>>                (mt->pitch % 16 == 0)) {
>>>> +      /*
>>>> +       * XXX: without sse4_1, in some situations still really want to copy
>>>> +       * regardless. Presumably this is not done for performance reasons -
>>>> +       * with movntdqa thanks to the 64byte streaming load buffer the
>>>> +       * uncached->cached copy followed by cached->cached later is always
>>>> +       * faster than doing "ordinary" uncached->cached copy.
>>>> +       * Without movntdqa, of course an additional copy doesn't help,
>> albeit
>>>> +       * it has to be said the uncached->cached one is an order of
>> magnitude
>>>> +       * slower than the later cached->cached one in any case.
>>>> +       * But mesa may not do a simple memcpy on that memory later -
>> some
>>>> +       * glReadPixels paths for instance will well hit per-byte access which
>>>> +       * is a complete performance killer on uncached memory. So in these
>>>> +       * cases really want to copy regardless, unless the map below could
>>>> +       * play some tricks making the memory cached.
>>>> +       * (Or otherwise ensure mesa can't hit these paths often, for instance
>>>> +       * glReadPixels requiring conversions could be handled by meta, so in
>>>> +       * end it really would be just memcpy.)
>>>> +       */
>>>
>>> Walls of text are really hard to read...
>>>
>>> I don't think adding this comment to the code is particularly
>>> valuable, but it does make me wonder if we should just add a memcpy
>>> fallback after this path? Maybe we don't care once we have this patch.
>>
>> Ok I'll remove that. It was rather difficult for me to figure out how mesa ends
>> up doing the per-byte access out of uncached memory, which is why I wrote
>> it.
>> I don't know if the always memcpy is an acceptable solution (hence didn't do
>> anything about it). It will only add a quite small extra cost but OTOH that perf
>> hit will be on the fast path which you probably expect to hit often (i.e. mesa
>> accessing things with a memcpy anyway).
>> My guess is the fastest solution would be trying to remap as cacheable
>> somehow, but I don't really know much about the driver (but I know tricks
>> like that get ugly real fast with all the fences, clflushes etc.)
>>
>>>
>>>>        intel_miptree_map_movntdqa(brw, mt, map, level, slice);
>>>> #endif
>>>>     } else {
>>>> diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>>>> b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>>>> index 2383401..42fdde1 100644
>>>> --- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>>>> +++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>>>> @@ -36,10 +36,13 @@
>>>>  #include "brw_context.h"
>>>>  #include "intel_tiled_memcpy.h"
>>>>
>>>> -#ifdef __SSSE3__
>>>> +#if defined(__SSSE3__)
>>>>  #include <tmmintrin.h>
>>>> +#elif defined(__SSE2__)
>>>> +#include <emmintrin.h>
>>>>  #endif
>>>>
>>>> +
>>>>  #define FILE_DEBUG_FLAG DEBUG_TEXTURE
>>>>
>>>>  #define ALIGN_DOWN(a, b) ROUND_DOWN_TO(a, b) @@ -56,23 +59,69
>> @@
>>>> static const uint32_t ytile_width = 128;  static const uint32_t
>>>> ytile_height = 32;  static const uint32_t ytile_span = 16;
>>>>
>>>> -#ifdef __SSSE3__
>>>> +#if defined(__SSSE3__)
>>>>  static const uint8_t rgba8_permutation[16] =
>>>>     { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 };
>>>>
>>>>  /* NOTE: dst must be 16-byte aligned. src may be unaligned. */
>>>> -#define rgba8_copy_16_aligned_dst(dst, src)                            \
>>>> -   _mm_store_si128((__m128i *)(dst),                                   \
>>>> -                   _mm_shuffle_epi8(_mm_loadu_si128((__m128i *)(src)), \
>>>> -                                    *(__m128i *) rgba8_permutation))
>>>> +static inline void
>>>> +rgba8_copy_16_aligned_dst(void *dst, const void *src) {
>>>> +   __m128i reg;
>>>> +   reg = _mm_loadu_si128((__m128i *)src);
>>>> +   reg = _mm_shuffle_epi8(reg, *(__m128i *)rgba8_permutation);
>>>> +   _mm_store_si128((__m128i *)dst, reg);
>>>
>>> Here, and...
>>>
>>>> +}
>>>>
>>>>  /* NOTE: src must be 16-byte aligned. dst may be unaligned. */
>>>> -#define rgba8_copy_16_aligned_src(dst, src)                            \
>>>> -   _mm_storeu_si128((__m128i *)(dst),                                  \
>>>> -                    _mm_shuffle_epi8(_mm_load_si128((__m128i *)(src)), \
>>>> -                                     *(__m128i *) rgba8_permutation))
>>>> +static inline void
>>>> +rgba8_copy_16_aligned_src(void *dst, const void *src) {
>>>> +   __m128i reg;
>>>> +   reg = _mm_load_si128((__m128i *)src);
>>>> +   reg = _mm_shuffle_epi8(reg, *(__m128i *)rgba8_permutation);
>>>> +   _mm_storeu_si128((__m128i *)dst, reg);
>>>
>>> ... here, I'd just make these single statements like they were before.
>>> They fit easily within 80 columns.
>>
>> Ok I can leave them alone. I just got rid of the macros because I hate them
>> (at least certainly if there's no reason to use them in the first place), and I just
>> find the nested intrinsics not very readable but maybe that's just me.
>>
>>
>>>
>>>> +}
>>>> +
>>>> +#elif defined(__SSE2__)
>>>> +static inline void
>>>> +rgba8_copy_16_aligned_dst(void *dst, const void *src) {
>>>> +   __m128i srcreg, dstreg, agmask, ag, rb, r, b;
>>>> +
>>>> +   agmask = _mm_set1_epi32(0xFF00FF00);
>>>> +   srcreg = _mm_loadu_si128((__m128i *)src);
>>>
>>> Okay, so we're starting with 0xAABBGGRR.
>>>
>>>> +
>>>> +   rb = _mm_andnot_si128(agmask, srcreg);
>>>
>>> Produces 0x00BB00GG. Looks good.
>> Surely you meant 0x00BB00RR :-).
>>
>>>
>>>> +   ag = _mm_and_si128(agmask, srcreg);
>>>
>>> Produces 0xAA00GG00. Looks good.
>>>
>>>> +   r = _mm_srli_epi32(rb, 16);
>>>
>>> Produces 0x000000BB. Looks wrong -- should be b, not r.
>>>
>>>> +   b = _mm_slli_epi32(rb, 16);
>>>
>>> Produces 0x00RR0000. Looks wrong -- should be r, not b.
>> Ahh sharp eyes indeed. Not that it would change the result ;-).
>>
>>>
>>>> +   dstreg = _mm_or_si128(ag, r);
>>>> +   dstreg = _mm_or_si128(dstreg, b);
>>>
>>> I think we can do this better. Instead of two shifts and two ORs,
>>> recognizing that the R and B values in rb have 8-bits of zero padding,
>>> we can use _mm_shufflehi_epi16/_mm_shufflelo_epi16 and then a single
>>> OR:
>>>
>>> br = _mm_shufflehi_epi16(_mm_shufflelo_epi16(rb, _MM_SHUFFLE(2, 3,
>> 0,
>>> 1)), _MM_SHUFFLE(2, 3, 0, 1)); dstreg = _mm_or_si128(ag, br);
>>>
>>> (appropriately line wrapped)
>> Right I missed the optimal solution. Not that it would make a real difference,
>> but it's certainly better.
>>
>>
>>>
>>>> +
>>>> +   _mm_store_si128((__m128i *)dst, dstreg); }
>>>> +
>>>> +static inline void
>>>> +rgba8_copy_16_aligned_src(void *dst, const void *src) {
>>>> +   __m128i srcreg, dstreg, agmask, ag, rb, r, b;
>>>> +
>>>> +   agmask = _mm_set1_epi32(0xFF00FF00);
>>>> +   srcreg = _mm_load_si128((__m128i *)src);
>>>> +
>>>> +   rb = _mm_andnot_si128(agmask, srcreg);
>>>> +   ag = _mm_and_si128(agmask, srcreg);
>>>> +   r = _mm_srli_epi32(rb, 16);
>>>> +   b = _mm_slli_epi32(rb, 16);
>>>> +   dstreg = _mm_or_si128(ag, r);
>>>> +   dstreg = _mm_or_si128(dstreg, b);
>>>
>>> Same comments here.
>>>
>>> With the walls of text removed, and the other couple of things
>>> addressed, this patch would be
>>>
>>> Reviewed-by: Matt Turner <mattst88 at gmail.com>
>>>
>>
>> Thanks for the review!
>>
>> Roland
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_mailman_listinfo_mesa-2Ddev&d=BQIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=Vjtt0vs_iqoI31UfJxBl7yv9I2FeiaeAYgMTLKRBc_I&m=nGNQk9Ygig92I2slinye-TToGAmNP8dl5F9AsSX4ehA&s=tRbq_iSLTET9A9J7FQDxTFd9mQthte3cRRiYM_yHFaY&e= 



More information about the mesa-dev mailing list