[Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy

Thu Nov 6 16:30:00 PST 2014

I tested your patch with the "teximage" program in mesa demos, the
same thing I used to benchmark when I developed this code.
As Matt and Chad point out, the odd-looking _faster functions are
there for a reason.  Your change causes a huge slowdown.
I tested on a sandybridge system with a "Intel(R) Celeron(R) CPU 857 @
1.20GHz."  Mesa compiled with -O2.

original code:
  TexSubImage(RGBA/ubyte 256 x 256): 9660.4 images/sec, 2415.1 MB/sec
  TexSubImage(RGBA/ubyte 1024 x 1024): 821.2 images/sec, 3284.7 MB/sec
  TexSubImage(RGBA/ubyte 4096 x 4096): 76.3 images/sec, 4884.9 MB/sec

  TexSubImage(BGRA/ubyte 256 x 256): 11307.1 images/sec, 2826.8 MB/sec
  TexSubImage(BGRA/ubyte 1024 x 1024): 944.6 images/sec, 3778.6 MB/sec
  TexSubImage(BGRA/ubyte 4096 x 4096): 76.7 images/sec, 4908.3 MB/sec

  TexSubImage(L/ubyte 256 x 256): 17847.5 images/sec, 1115.5 MB/sec
  TexSubImage(L/ubyte 1024 x 1024): 3068.2 images/sec, 3068.2 MB/sec
  TexSubImage(L/ubyte 4096 x 4096): 224.6 images/sec, 3593.0 MB/sec

your code:
  TexSubImage(RGBA/ubyte 256 x 256): 3271.6 images/sec, 817.9 MB/sec
  TexSubImage(RGBA/ubyte 1024 x 1024): 232.3 images/sec, 929.2 MB/sec
  TexSubImage(RGBA/ubyte 4096 x 4096): 47.5 images/sec, 3038.6 MB/sec

  TexSubImage(BGRA/ubyte 256 x 256): 2426.5 images/sec, 606.6 MB/sec
  TexSubImage(BGRA/ubyte 1024 x 1024): 164.1 images/sec, 656.4 MB/sec
  TexSubImage(BGRA/ubyte 4096 x 4096): 13.4 images/sec, 854.8 MB/sec

  TexSubImage(L/ubyte 256 x 256): 9514.5 images/sec, 594.7 MB/sec
  TexSubImage(L/ubyte 1024 x 1024): 864.1 images/sec, 864.1 MB/sec
  TexSubImage(L/ubyte 4096 x 4096): 59.7 images/sec, 955.2 MB/sec

This is just one run, not an average, but you can see it's slower
across the board up to a factor of around 6.
Also I couldn't configure the build after your patch.  I think you
left out a change to configure.ac to define SSSE3_SUPPORTED.

On Thu, Nov 6, 2014 at 6:08 PM, Chad Versace <chad.versace at intel.com> wrote:
> On Thu 06 Nov 2014, Timothy Arceri wrote:
>>
>> Also cleans up some if statements in the *faster functions.
>
>
> I have comments about the cleanup below.
>
>> diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
>> b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
>> index cb5738a..0deeb75 100644
>> --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
>> +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
>
>
> /**
> * Copy texture data from linear to X tile layout, faster.
> *
> * Same as \ref xtile_copy but faster, because it passes constant parameters
> * for common cases, allowing the compiler to inline code optimized for those
> * cases.
> *
> * \copydoc tile_copy_fn
> */
> static FLATTEN void
> xtile_copy_faster(...)
>
>> @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t
>> x2, uint32_t x3,
>>                   mem_copy_fn mem_copy)
>
>
>
>> {
>>    if (x0 == 0 && x3 == xtile_width && y0 == 0 && y1 == xtile_height) {
>> -      if (mem_copy == memcpy)
>> -         return xtile_copy(0, 0, xtile_width, xtile_width, 0,
>> xtile_height,
>> -                           dst, src, src_pitch, swizzle_bit, memcpy);
>> -      else if (mem_copy == rgba8_copy)
>> -         return xtile_copy(0, 0, xtile_width, xtile_width, 0,
>> xtile_height,
>> -                           dst, src, src_pitch, swizzle_bit, rgba8_copy);
>> -   } else {
>> -      if (mem_copy == memcpy)
>> -         return xtile_copy(x0, x1, x2, x3, y0, y1,
>> -                           dst, src, src_pitch, swizzle_bit, memcpy);
>> -      else if (mem_copy == rgba8_copy)
>> -         return xtile_copy(x0, x1, x2, x3, y0, y1,
>> -                           dst, src, src_pitch, swizzle_bit, rgba8_copy);
>> +      return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
>> +                        dst, src, src_pitch, swizzle_bit, mem_copy);
>>    }
>>    xtile_copy(x0, x1, x2, x3, y0, y1,
>>               dst, src, src_pitch, swizzle_bit, mem_copy);
>
>
> The "cleanup" of this if tree concerns me. Accoring the function
> comment, the original author of this function, fjhenigman, clearly created
> the weird 'if' tree with the intentation that the compiler would "inline
> code optimized for those cases".
>
> Without one of the following, I object to this cleanup:
>    - Frank's approval, or
>    - Proof that gcc never does the desired optimizations, or
>    - Proof that this change does not harm's Chrome's texture upload
> performance.