[Mesa-dev] R: Re: Building with -fno-builtin-memcmp for improved performance

Wed Sep 21 08:15:10 PDT 2011

Am 21.09.2011 14:24, schrieb Fabio:
> 
> 
>> ----Messaggio originale----
>> Da: keithw at vmware.com
>> Data: 20/09/2011 16.45
>> A: "Roland Scheidegger"<sroland at vmware.com>
>> Cc: "Fabio"<fabio.ped at libero.it>, <mesa-dev at lists.freedesktop.org>
>> Ogg: Re: [Mesa-dev] Building with -fno-builtin-memcmp for improved 
> performance
>>
>> On Tue, 2011-09-20 at 16:35 +0200, Roland Scheidegger wrote:
>>> Am 20.09.2011 16:15, schrieb Keith Whitwell:
>>>> On Tue, 2011-09-20 at 16:02 +0200, Roland Scheidegger wrote:
>>>>> Am 20.09.2011 12:35, schrieb Keith Whitwell:
>>>>>> On Tue, 2011-09-20 at 10:59 +0200, Fabio wrote:
>>>>>>> There was a discussion some months ago about using -fno-builtin-memcmp 
> for 
>>>>>>> improving memcmp performance:
>>>>>>> http://lists.freedesktop.org/archives/mesa-dev/2011-June/009078.html
>>>>>>>
>>>>>>> Since then, was it properly addressed in mesa or the flag is still 
>>>>>>> recommended? If so, what about adding it in configure.ac?
>>>>>>
>>>>>> I've been meaning to follow up on this too.  I don't know the answer,
>>>>>> but pinging Roland in case he does.
>>>>>
>>>>> I guess it is still recommended.
>>>>> Ideally this is really something which should be fixed in gcc - the
>>>>> compiler has all the knowledge about fixed alignment and size (if any)
>>>>> (and more importantly knows if only a binary answer is needed which
>>>>> makes this much easier) and doesn't need to do any function call.
>>>>> If you enable that flag and some platform just has the same primitive
>>>>> repz cmpsb sequence in the system library it will just get even slower,
>>>>> though I guess chances of that happening are slim (with the possible
>>>>> exception of windows).
>>>>> I think in most cases it won't make much difference, so nobody cared to
>>>>> implement that change. It is most likely still a good idea unless gcc
>>>>> addressed that in the meantime...
>>>>
>>>> Hmm, it seemed like it made a big difference in the earlier
>>>> discussion...
>>> Yes for llvmpipe and one app at least.
>>> But that struct being compared there is most likely the biggest (by far)
>>> anywhere (at least which is compared in a regular fashion).
>>>
>>>> I should take a look at reducing the size of the struct (as mentioned
>>>> before), but surely there's some way to pull in a better memcmp??
>>>
>>> Well, apart from using -fno-builtin-memcmp we could build our own
>>> memcmpxx, though the version I did there (returning binary only result
>>> and assuming 32bit alignment/size allowing gcc to optimize it) was still
>>> slower for large sizes than -fno-builtin-memcmp. Of course we could
>>> optimize it more (e.g. for 64bit aligned/sized things, or using
>>> hand-coded sse2 versions using 128bit at-a-time comparisons) but then it
>>> gets more complicated, so I wasn't sure it was worth it.
>>>
>>> For reference here are the earlier numbers (ipers with llvmpipe):
>>> original ipers: 12.1 fps
>>> optimized struct compare: 16.8 fps
>>> -fno-builtin-memcmp: 18.1 fps
>>>
>>> And this was the function I used for getting the numbers:
>>>
>>> static INLINE int util_cmp_struct(const void *src1, const void *src2,
>>> unsigned count)
>>> {
>>>   /* hmm pointer casting is evil */
>>>   const uint32_t *src1_ptr = (uint32_t *)src1;
>>>   const uint32_t *src2_ptr = (uint32_t *)src2;
>>>   unsigned i;
>>>   assert(count % 4 == 0);
>>>   for (i = 0; i < count/4; i++) {
>>>     if (*src1_ptr != *src2_ptr) {
>>>       return 1;
>>>     }
>>>     src1_ptr++;
>>>     src2_ptr++;
>>>   }
>>>   return 0;
>>> }
>>
>> OK, maybe the first thing to do is fix the compared struct, then let's
>> see if there's anything significant left for a better memcmp to extract.
>>
>> I can find some time to do that in the next few days.
> 
> Was this problem ever reported to gcc devs BTW? I did a quick search and 
> didn't find anything. Maybe it could be properly fixed there.

There already was a bug about slow memcmp in gcc bugzilla (referenced in
the earlier thread), hence no need to report our own:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

Roland