[Mesa-dev] Building with -fno-builtin-memcmp for improved performance

Tue Sep 20 07:46:49 PDT 2011

Am 20.09.2011 16:35, schrieb Roland Scheidegger:
> Am 20.09.2011 16:15, schrieb Keith Whitwell:
>> On Tue, 2011-09-20 at 16:02 +0200, Roland Scheidegger wrote:
>>> Am 20.09.2011 12:35, schrieb Keith Whitwell:
>>>> On Tue, 2011-09-20 at 10:59 +0200, Fabio wrote:
>>>>> There was a discussion some months ago about using -fno-builtin-memcmp for 
>>>>> improving memcmp performance:
>>>>> http://lists.freedesktop.org/archives/mesa-dev/2011-June/009078.html
>>>>>
>>>>> Since then, was it properly addressed in mesa or the flag is still 
>>>>> recommended? If so, what about adding it in configure.ac?
>>>>
>>>> I've been meaning to follow up on this too.  I don't know the answer,
>>>> but pinging Roland in case he does.
>>>
>>> I guess it is still recommended.
>>> Ideally this is really something which should be fixed in gcc - the
>>> compiler has all the knowledge about fixed alignment and size (if any)
>>> (and more importantly knows if only a binary answer is needed which
>>> makes this much easier) and doesn't need to do any function call.
>>> If you enable that flag and some platform just has the same primitive
>>> repz cmpsb sequence in the system library it will just get even slower,
>>> though I guess chances of that happening are slim (with the possible
>>> exception of windows).
>>> I think in most cases it won't make much difference, so nobody cared to
>>> implement that change. It is most likely still a good idea unless gcc
>>> addressed that in the meantime...
>>
>> Hmm, it seemed like it made a big difference in the earlier
>> discussion...
> Yes for llvmpipe and one app at least.
> But that struct being compared there is most likely the biggest (by far)
> anywhere (at least which is compared in a regular fashion).
> 
>> I should take a look at reducing the size of the struct (as mentioned
>> before), but surely there's some way to pull in a better memcmp??
> 
> Well, apart from using -fno-builtin-memcmp we could build our own
> memcmpxx, though the version I did there (returning binary only result
> and assuming 32bit alignment/size allowing gcc to optimize it) was still
> slower for large sizes than -fno-builtin-memcmp. Of course we could
> optimize it more (e.g. for 64bit aligned/sized things, or using
> hand-coded sse2 versions using 128bit at-a-time comparisons) but then it
> gets more complicated, so I wasn't sure it was worth it.
> 
> For reference here are the earlier numbers (ipers with llvmpipe):
> original ipers: 12.1 fps
> optimized struct compare: 16.8 fps
> -fno-builtin-memcmp: 18.1 fps
> 
> And this was the function I used for getting the numbers:
> 
> static INLINE int util_cmp_struct(const void *src1, const void *src2,
> unsigned count)
> {
>   /* hmm pointer casting is evil */
>   const uint32_t *src1_ptr = (uint32_t *)src1;
>   const uint32_t *src2_ptr = (uint32_t *)src2;
>   unsigned i;
>   assert(count % 4 == 0);
>   for (i = 0; i < count/4; i++) {
>     if (*src1_ptr != *src2_ptr) {
>       return 1;
>     }
>     src1_ptr++;
>     src2_ptr++;
>   }
>   return 0;
> }

Oh I also forgot to mention that it would get a bit more complicated
probably this was just a quick hack. We might only use this to compare
structs, but IIRC they are not guaranteed to always have a size or
alignment which is a multiple of 32bit (I think it is legal if a struct
which only has say shorts in it to be only 16bit aligned and its size to
be a multiple of 16bit). Of course we could just code this explicitly
and the compiler (hopefully) should be smart enough to optimize it away
(well unless optimization is disabled).

Roland