[Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

Wed Jun 29 18:36:41 PDT 2011

Ok in fact there's a gcc bug about memcmp:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
In short gcc's memcmp builtin is totally lame and loses to glibc's
memcmp (including call overhead, no knowledge about alignment etc.) even
when comparing only very few bytes (and loses BIG time for lots of bytes
to compare). Oops. Well at least if the strings are the same (I'd guess
if the first byte is different it's hard to beat the gcc builtin...).
So this is really a gcc bug. The bug is quite old though with no fix in
sight apparently so might need to think about some workaround (but just
not doing the comparison doesn't look like the right idea, since
apparently it would be faster with the comparison if gcc's memcmp got
fixed).

Roland

Am 30.06.2011 01:47, schrieb Roland Scheidegger:
> I didn't even look at that was just curious why the memcmp (which is
> used a lot in other places) is slow. However, none of the other memcmp
> seem to show up prominently (cso functions are quite low in profiles,
> _mesa_search_program_cache uses memcmp too but it's not that high
> neither). So I guess those functions either aren't called that often or
> the sizes they compare are small.
> So should maybe just file a gcc bug for memcmp and look at that
> particular llvmpipe issue again :-).
> 
> Roland
> 
> Am 30.06.2011 01:16, schrieb Corbin Simpson:
>> Okay, so maybe I'm failing to recognize the exact situation here, but
>> wouldn't it be possible to mark the FS state with a serial number and
>> just compare those? Or are these FS states not CSO-cached?
>>
>> ~ C.
>>
>> On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>> Actually I ran some numbers here and tried out a optimized struct compare:
>>> original ipers: 12.1 fps
>>> ajax patch: 15.5 fps
>>> optimized struct compare: 16.8 fps
>>>
>>>
>>> This is the function I used for that (just enabled in that lp_setup
>>> function):
>>>
>>> static INLINE int util_cmp_struct(const void *src1, const void *src2,
>>> unsigned count)
>>> {
>>>  /* hmm pointer casting is evil */
>>>  const uint32_t *src1_ptr = (uint32_t *)src1;
>>>  const uint32_t *src2_ptr = (uint32_t *)src2;
>>>  unsigned i;
>>>  assert(count % 4 == 0);
>>>  for (i = 0; i < count/4; i++) {
>>>    if (*src1_ptr != *src2_ptr) {
>>>      return 1;
>>>    }
>>>    src1_ptr++;
>>>    src2_ptr++;
>>>  }
>>>  return 0;
>>> }
>>>
>>> (And no this doesn't use repz cmpsd here.)
>>>
>>> So, unless I made some mistake, memcmp is just dead slow (*), most of
>>> the slowness probably coming from the bytewise comparison (and
>>> apparently I was wrong in assuming the comparison there might never be
>>> the same for ipers).
>>> Of course, the optimized struct compare relies on structs really being
>>> dword aligned (I think this is always the case), and additionally it
>>> relies on the struct size being a whole multiple of dwords - likely
>>> struct needs padding to ensure that (at least I don't think this is
>>> always guaranteed for all structs).
>>> But since memcmp is used extensively (cso for one) maybe some
>>> optimization along these lines might be worth it (though of course for
>>> small structs the win isn't going to be as big - and can't beat the repz
>>> cmps in code size...).
>>>
>>> Roland
>>>
>>> (*) I actually found some references some implementations might be
>>> better they don't just use repz cmpsb but they split this up in parts
>>> which do dword (or qword even - well for really large structs could use
>>> sse2) comparisons for the parts where it's possible and only byte
>>> comparisons for the remaining bytes (and if the compiler does that it
>>> probably would know the size at compile time here hence could leave out
>>> much of the code). Of course memcmp requires that the return value isn't
>>> just a true or false value, hence there's more code needed once an
>>> unequal dword is found, though the compiler could optimize that away too
>>> in case it's not needed. Much the same as memcpy is optimized usually
>>> really, so blame gcc :-).
>>>
>>>
>>>
>>> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
>>>> Ohh that's interesting, you'd think the comparison shouldn't be that
>>>> expensive (though I guess in ipers case the comparison is never true).
>>>> memcmp is quite extensively used everywhere. Maybe we could replace that
>>>> with something faster (since we only ever care if the blocks are the
>>>> same but not care about the lexographic ordering and always compare
>>>> whole structs, should compare dwords instead of bytes for a 4 time
>>>> speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
>>>> Also I guess it would help if the values which are more likely to be
>>>> unequal are first in the struct (if we can tell that).
>>>> Of course though if it's unlikely to be the same as the compared value
>>>> anyway not comparing at all still might be a win (here).
>>>>
>>>> Roland
>>>>
>>>> Am 29.06.2011 19:19, schrieb Adam Jackson:
>>>>> Perversely, do this by eliminating the comparison between stored and
>>>>> current fs state.  On ipers, a perf trace showed try_update_scene_state
>>>>> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp.
>>>>> Taking that out takes try_update_scene_state down to 6.5% of the
>>>>> profile; more importantly, ipers goes from 10 to 14fps and gears goes
>>>>> from 790 to 830fps.
>>>>>
>>>>> Signed-off-by: Adam Jackson <ajax at redhat.com>
>>>>> ---
>>>>>  src/gallium/drivers/llvmpipe/lp_setup.c |   61 ++++++++++++++-----------------
>>>>>  1 files changed, 27 insertions(+), 34 deletions(-)
>>>>>
>>>>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c b/src/gallium/drivers/llvmpipe/lp_setup.c
>>>>> index cbe06e5..9118db5 100644
>>>>> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
>>>>> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
>>>>> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup )
>>>>>        setup->dirty |= LP_SETUP_NEW_FS;
>>>>>     }
>>>>>
>>>>> -
>>>>>     if (setup->dirty & LP_SETUP_NEW_FS) {
>>>>> -      if (!setup->fs.stored ||
>>>>> -          memcmp(setup->fs.stored,
>>>>> -                 &setup->fs.current,
>>>>> -                 sizeof setup->fs.current) != 0)
>>>>> -      {
>>>>> -         struct lp_rast_state *stored;
>>>>> -         uint i;
>>>>> -
>>>>> -         /* The fs state that's been stored in the scene is different from
>>>>> -          * the new, current state.  So allocate a new lp_rast_state object
>>>>> -          * and append it to the bin's setup data buffer.
>>>>> -          */
>>>>> -         stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored);
>>>>> -         if (!stored) {
>>>>> -            assert(!new_scene);
>>>>> -            return FALSE;
>>>>> -         }
>>>>> +      struct lp_rast_state *stored;
>>>>> +      uint i;
>>>>> +
>>>>> +      /* The fs state that's been stored in the scene is different from
>>>>> +       * the new, current state.  So allocate a new lp_rast_state object
>>>>> +       * and append it to the bin's setup data buffer.
>>>>> +       */
>>>>> +      stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored);
>>>>> +      if (!stored) {
>>>>> +         assert(!new_scene);
>>>>> +         return FALSE;
>>>>> +      }
>>>>>
>>>>> -         memcpy(stored,
>>>>> -                &setup->fs.current,
>>>>> -                sizeof setup->fs.current);
>>>>> -         setup->fs.stored = stored;
>>>>> -
>>>>> -         /* The scene now references the textures in the rasterization
>>>>> -          * state record.  Note that now.
>>>>> -          */
>>>>> -         for (i = 0; i < Elements(setup->fs.current_tex); i++) {
>>>>> -            if (setup->fs.current_tex[i]) {
>>>>> -               if (!lp_scene_add_resource_reference(scene,
>>>>> -                                                    setup->fs.current_tex[i],
>>>>> -                                                    new_scene)) {
>>>>> -                  assert(!new_scene);
>>>>> -                  return FALSE;
>>>>> -               }
>>>>> +      memcpy(stored,
>>>>> +             &setup->fs.current,
>>>>> +             sizeof setup->fs.current);
>>>>> +      setup->fs.stored = stored;
>>>>> +
>>>>> +      /* The scene now references the textures in the rasterization
>>>>> +       * state record.  Note that now.
>>>>> +       */
>>>>> +      for (i = 0; i < Elements(setup->fs.current_tex); i++) {
>>>>> +         if (setup->fs.current_tex[i]) {
>>>>> +            if (!lp_scene_add_resource_reference(scene,
>>>>> +                                                 setup->fs.current_tex[i],
>>>>> +                                                 new_scene)) {
>>>>> +               assert(!new_scene);
>>>>> +               return FALSE;
>>>>>              }
>>>>>           }
>>>>>        }
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>
>>
>>
>