[Mesa-dev] [PATCH 3/5] llvmpipe: Optimize do_triangle_ccw for POWER8

Thu Dec 31 10:41:33 PST 2015

Am 31.12.2015 um 10:15 schrieb Oded Gabbay:
> On Thu, Dec 31, 2015 at 4:13 AM, Roland Scheidegger <sroland at vmware.com> wrote:
>>
>> Am 30.12.2015 um 10:59 schrieb Oded Gabbay:
>>> On Wed, Dec 30, 2015 at 1:17 AM, Roland Scheidegger <sroland at vmware.com> wrote:
>>>> The idea looks right to me.
>>>> Though frankly I don't like our current setup code too much - in
>>>> particular the mix between c, assembly, and jit code, with some
>>>> duplication (plus the lots of transpose everywhere). There's likely
>>>> optimization potential to be found there.
>>>>
>>>> Roland
>>>
>>> Agreed, if we can remove the transpose, that would be some 3-5% boost for sure.
>>>
>>
>> I actually forgot about this, but I'm really surprised you could measure
>> a difference as this code (but not the one in rasterization) is
>> effectively disabled in practice ever since the switch to 8 bit subpixel
>> precision (as the max fb size allowed now is really tiny, 128x128...)
>> Unless you were playing with those values...
>> Should have probably added some XXX comment back then...
>>
>> Roland
> 
> Indeed, glxgears and glxspheres don't reach this code. However,
> glmark2 has a couple of tests that do run this code (I put prints to
> check for it).  But I'll add the comment you added in the patch you
> just sent.
Ok, though you quoted increases  for openarena and xonotic as well :-).
The code really doesn't serve much practical purpose (we try to optimize
for the common case...), it's mostly there because it once was useful,
and because I planned to actually fix it up but never got around to do
it... Should be pretty obvious how in theory, albeit it will require
64bit operations - we already do the 32/32->64bit mul in any case for
sse, and would just need to follow up with 64bit subs for the plane c.
And probably then just skip the transpose - as we reload/store the
values manually anyway after that due to the 32/64bit mismatch it
probably didn't buy us much.

Roland


> 
>      Oded
>>
>>
>>>
>>>>
>>>> Am 29.12.2015 um 17:12 schrieb Oded Gabbay:
>>>>> This patch converts the SSE optimization done in do_triangle_ccw to
>>>>> VMX/VSX.
>>>>>
>>>>> I measured the results on POWER8 machine with 32 cores at 3.4GHz and
>>>>> 16GB of RAM.
>>>>>
>>>>>                       FPS/Score
>>>>>   Name            Before     After    Delta
>>>>> ------------------------------------------------
>>>>> glmark2 (score)   136.6      139.8    2.34%
>>>>> openarena         16.14      16.35    1.30%
>>>>> xonotic           4.655      4.707    1.11%
>>>>>
>>>>> Signed-off-by: Oded Gabbay <oded.gabbay at gmail.com>
>>>>> ---
>>>>>  src/gallium/drivers/llvmpipe/lp_setup_tri.c | 96 +++++++++++++++++++++++++++++
>>>>>  1 file changed, 96 insertions(+)
>>>>>
>>>>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup_tri.c b/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>>>> index b1671dd..cfa9874 100644
>>>>> --- a/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>>>> +++ b/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>>>> @@ -46,6 +46,9 @@
>>>>>
>>>>>  #if defined(PIPE_ARCH_SSE)
>>>>>  #include <emmintrin.h>
>>>>> +#elif defined(_ARCH_PWR8)
>>>>> +#include <altivec.h>
>>>>> +#include "util/u_pwr8.h"
>>>>>  #endif
>>>>>
>>>>>  static inline int
>>>>> @@ -462,6 +465,99 @@ do_triangle_ccw(struct lp_setup_context *setup,
>>>>>        STORE_PLANE(plane[2], p2);
>>>>>  #undef STORE_PLANE
>>>>>     } else
>>>>> +#elif defined(_ARCH_PWR8)
>>>>> +   if (setup->fb.width <= MAX_FIXED_LENGTH32 &&
>>>>> +       setup->fb.height <= MAX_FIXED_LENGTH32 &&
>>>>> +       (bbox.x1 - bbox.x0) <= MAX_FIXED_LENGTH32 &&
>>>>> +       (bbox.y1 - bbox.y0) <= MAX_FIXED_LENGTH32) {
>>>>> +      unsigned int bottom_edge;
>>>>> +      __m128i vertx, verty;
>>>>> +      __m128i shufx, shufy;
>>>>> +      __m128i dcdx, dcdy, c;
>>>>> +      __m128i unused;
>>>>> +      __m128i dcdx_neg_mask;
>>>>> +      __m128i dcdy_neg_mask;
>>>>> +      __m128i dcdx_zero_mask;
>>>>> +      __m128i top_left_flag;
>>>>> +      __m128i c_inc_mask, c_inc;
>>>>> +      __m128i eo, p0, p1, p2;
>>>>> +      __m128i_union vshuf_mask;
>>>>> +      __m128i zero = vec_splats((unsigned char) 0);
>>>>> +      PIPE_ALIGN_VAR(16) int32_t temp_vec[4];
>>>>> +
>>>>> +#ifdef PIPE_ARCH_LITTLE_ENDIAN
>>>>> +      vshuf_mask.i[0] = 0x07060504;
>>>>> +      vshuf_mask.i[1] = 0x0B0A0908;
>>>>> +      vshuf_mask.i[2] = 0x03020100;
>>>>> +      vshuf_mask.i[3] = 0x0F0E0D0C;
>>>>> +#else
>>>>> +      vshuf_mask.i[0] = 0x00010203;
>>>>> +      vshuf_mask.i[1] = 0x0C0D0E0F;
>>>>> +      vshuf_mask.i[2] = 0x04050607;
>>>>> +      vshuf_mask.i[3] = 0x08090A0B;
>>>>> +#endif
>>>>> +
>>>>> +      /* vertex x coords */
>>>>> +      vertx = vec_loadu_si128((const uint32_t *) position->x);
>>>>> +      /* vertex y coords */
>>>>> +      verty = vec_loadu_si128((const uint32_t *) position->y);
>>>>> +
>>>>> +      shufx = vec_perm (vertx, vertx, vshuf_mask.m128i);
>>>>> +      shufy = vec_perm (verty, verty, vshuf_mask.m128i);
>>>>> +
>>>>> +      dcdx = vec_sub_epi32(verty, shufy);
>>>>> +      dcdy = vec_sub_epi32(vertx, shufx);
>>>>> +
>>>>> +      dcdx_neg_mask = vec_srai_epi32(dcdx, 31);
>>>>> +      dcdx_zero_mask = vec_cmpeq_epi32(dcdx, zero);
>>>>> +      dcdy_neg_mask = vec_srai_epi32(dcdy, 31);
>>>>> +
>>>>> +      bottom_edge = (setup->bottom_edge_rule == 0) ? ~0 : 0;
>>>>> +      top_left_flag = (__m128i) vec_splats(bottom_edge);
>>>>> +
>>>>> +      c_inc_mask = vec_or(dcdx_neg_mask,
>>>>> +                                vec_and(dcdx_zero_mask,
>>>>> +                                              vec_xor(dcdy_neg_mask,
>>>>> +                                                            top_left_flag)));
>>>>> +
>>>>> +      c_inc = vec_srli_epi32(c_inc_mask, 31);
>>>>> +
>>>>> +      c = vec_sub_epi32(vec_mullo_epi32(dcdx, vertx),
>>>>> +                        vec_mullo_epi32(dcdy, verty));
>>>>> +
>>>>> +      c = vec_add_epi32(c, c_inc);
>>>>> +
>>>>> +      /* Scale up to match c:
>>>>> +       */
>>>>> +      dcdx = vec_slli_epi32(dcdx, FIXED_ORDER);
>>>>> +      dcdy = vec_slli_epi32(dcdy, FIXED_ORDER);
>>>>> +
>>>>> +      /* Calculate trivial reject values:
>>>>> +       */
>>>>> +      eo = vec_sub_epi32(vec_andc(dcdy_neg_mask, dcdy),
>>>>> +                         vec_and(dcdx_neg_mask, dcdx));
>>>>> +
>>>>> +      /* ei = _mm_sub_epi32(_mm_sub_epi32(dcdy, dcdx), eo); */
>>>>> +
>>>>> +      /* Pointless transpose which gets undone immediately in
>>>>> +       * rasterization:
>>>>> +       */
>>>>> +      transpose4_epi32(&c, &dcdx, &dcdy, &eo,
>>>>> +                       &p0, &p1, &p2, &unused);
>>>>> +
>>>>> +#define STORE_PLANE(plane, vec) do {                  \
>>>>> +         vec_store_si128((uint32_t *)&temp_vec, vec); \
>>>>> +         plane.c    = (int64_t)temp_vec[0];           \
>>>>> +         plane.dcdx = temp_vec[1];                    \
>>>>> +         plane.dcdy = temp_vec[2];                    \
>>>>> +         plane.eo   = temp_vec[3];                    \
>>>>> +      } while(0)
>>>>> +
>>>>> +      STORE_PLANE(plane[0], p0);
>>>>> +      STORE_PLANE(plane[1], p1);
>>>>> +      STORE_PLANE(plane[2], p2);
>>>>> +#undef STORE_PLANE
>>>>> +   } else
>>>>>  #endif
>>>>>     {
>>>>>        int i;
>>>>>
>>>>
>>