[Mesa-dev] Building with -fno-builtin-memcmp for improved performance

Roland Scheidegger sroland at vmware.com
Tue Sep 20 07:46:49 PDT 2011

Am 20.09.2011 16:35, schrieb Roland Scheidegger:
> Am 20.09.2011 16:15, schrieb Keith Whitwell:
>> On Tue, 2011-09-20 at 16:02 +0200, Roland Scheidegger wrote:
>>> Am 20.09.2011 12:35, schrieb Keith Whitwell:
>>>> On Tue, 2011-09-20 at 10:59 +0200, Fabio wrote:
>>>>> There was a discussion some months ago about using -fno-builtin-memcmp for 
>>>>> improving memcmp performance:
>>>>> http://lists.freedesktop.org/archives/mesa-dev/2011-June/009078.html
>>>>> Since then, was it properly addressed in mesa or the flag is still 
>>>>> recommended? If so, what about adding it in configure.ac?
>>>> I've been meaning to follow up on this too.  I don't know the answer,
>>>> but pinging Roland in case he does.
>>> I guess it is still recommended.
>>> Ideally this is really something which should be fixed in gcc - the
>>> compiler has all the knowledge about fixed alignment and size (if any)
>>> (and more importantly knows if only a binary answer is needed which
>>> makes this much easier) and doesn't need to do any function call.
>>> If you enable that flag and some platform just has the same primitive
>>> repz cmpsb sequence in the system library it will just get even slower,
>>> though I guess chances of that happening are slim (with the possible
>>> exception of windows).
>>> I think in most cases it won't make much difference, so nobody cared to
>>> implement that change. It is most likely still a good idea unless gcc
>>> addressed that in the meantime...
>> Hmm, it seemed like it made a big difference in the earlier
>> discussion...
> Yes for llvmpipe and one app at least.
> But that struct being compared there is most likely the biggest (by far)
> anywhere (at least which is compared in a regular fashion).
>> I should take a look at reducing the size of the struct (as mentioned
>> before), but surely there's some way to pull in a better memcmp??
> Well, apart from using -fno-builtin-memcmp we could build our own
> memcmpxx, though the version I did there (returning binary only result
> and assuming 32bit alignment/size allowing gcc to optimize it) was still
> slower for large sizes than -fno-builtin-memcmp. Of course we could
> optimize it more (e.g. for 64bit aligned/sized things, or using
> hand-coded sse2 versions using 128bit at-a-time comparisons) but then it
> gets more complicated, so I wasn't sure it was worth it.
> For reference here are the earlier numbers (ipers with llvmpipe):
> original ipers: 12.1 fps
> optimized struct compare: 16.8 fps
> -fno-builtin-memcmp: 18.1 fps
> And this was the function I used for getting the numbers:
> static INLINE int util_cmp_struct(const void *src1, const void *src2,
> unsigned count)
> {
>   /* hmm pointer casting is evil */
>   const uint32_t *src1_ptr = (uint32_t *)src1;
>   const uint32_t *src2_ptr = (uint32_t *)src2;
>   unsigned i;
>   assert(count % 4 == 0);
>   for (i = 0; i < count/4; i++) {
>     if (*src1_ptr != *src2_ptr) {
>       return 1;
>     }
>     src1_ptr++;
>     src2_ptr++;
>   }
>   return 0;
> }

Oh I also forgot to mention that it would get a bit more complicated
probably this was just a quick hack. We might only use this to compare
structs, but IIRC they are not guaranteed to always have a size or
alignment which is a multiple of 32bit (I think it is legal if a struct
which only has say shorts in it to be only 16bit aligned and its size to
be a multiple of 16bit). Of course we could just code this explicitly
and the compiler (hopefully) should be smart enough to optimize it away
(well unless optimization is disabled).


More information about the mesa-dev mailing list