[Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

Thu Jun 30 00:56:56 PDT 2011

Hmm.
Forgive my ignorance, but isn't memcmp() on structs pretty prone to give 
incorrect != results, given that there may be padding between members in 
structs and that IIRC gcc struct assignment is member-wise.

What happens if there's padding between the jit_context and variant 
members of struct lp_rast_state?

I seem to recall hitting similar issues a number of times in the past.

/Thomas

On 06/30/2011 03:36 AM, Roland Scheidegger wrote:
> Ok in fact there's a gcc bug about memcmp:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> In short gcc's memcmp builtin is totally lame and loses to glibc's
> memcmp (including call overhead, no knowledge about alignment etc.) even
> when comparing only very few bytes (and loses BIG time for lots of bytes
> to compare). Oops. Well at least if the strings are the same (I'd guess
> if the first byte is different it's hard to beat the gcc builtin...).
> So this is really a gcc bug. The bug is quite old though with no fix in
> sight apparently so might need to think about some workaround (but just
> not doing the comparison doesn't look like the right idea, since
> apparently it would be faster with the comparison if gcc's memcmp got
> fixed).
>
>
> Roland
>
>
>
> Am 30.06.2011 01:47, schrieb Roland Scheidegger:
>    
>> I didn't even look at that was just curious why the memcmp (which is
>> used a lot in other places) is slow. However, none of the other memcmp
>> seem to show up prominently (cso functions are quite low in profiles,
>> _mesa_search_program_cache uses memcmp too but it's not that high
>> neither). So I guess those functions either aren't called that often or
>> the sizes they compare are small.
>> So should maybe just file a gcc bug for memcmp and look at that
>> particular llvmpipe issue again :-).
>>
>> Roland
>>
>> Am 30.06.2011 01:16, schrieb Corbin Simpson:
>>      
>>> Okay, so maybe I'm failing to recognize the exact situation here, but
>>> wouldn't it be possible to mark the FS state with a serial number and
>>> just compare those? Or are these FS states not CSO-cached?
>>>
>>> ~ C.
>>>
>>> On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger<sroland at vmware.com>  wrote:
>>>        
>>>> Actually I ran some numbers here and tried out a optimized struct compare:
>>>> original ipers: 12.1 fps
>>>> ajax patch: 15.5 fps
>>>> optimized struct compare: 16.8 fps
>>>>
>>>>
>>>> This is the function I used for that (just enabled in that lp_setup
>>>> function):
>>>>
>>>> static INLINE int util_cmp_struct(const void *src1, const void *src2,
>>>> unsigned count)
>>>> {
>>>>   /* hmm pointer casting is evil */
>>>>   const uint32_t *src1_ptr = (uint32_t *)src1;
>>>>   const uint32_t *src2_ptr = (uint32_t *)src2;
>>>>   unsigned i;
>>>>   assert(count % 4 == 0);
>>>>   for (i = 0; i<  count/4; i++) {
>>>>     if (*src1_ptr != *src2_ptr) {
>>>>       return 1;
>>>>     }
>>>>     src1_ptr++;
>>>>     src2_ptr++;
>>>>   }
>>>>   return 0;
>>>> }
>>>>
>>>> (And no this doesn't use repz cmpsd here.)
>>>>
>>>> So, unless I made some mistake, memcmp is just dead slow (*), most of
>>>> the slowness probably coming from the bytewise comparison (and
>>>> apparently I was wrong in assuming the comparison there might never be
>>>> the same for ipers).
>>>> Of course, the optimized struct compare relies on structs really being
>>>> dword aligned (I think this is always the case), and additionally it
>>>> relies on the struct size being a whole multiple of dwords - likely
>>>> struct needs padding to ensure that (at least I don't think this is
>>>> always guaranteed for all structs).
>>>> But since memcmp is used extensively (cso for one) maybe some
>>>> optimization along these lines might be worth it (though of course for
>>>> small structs the win isn't going to be as big - and can't beat the repz
>>>> cmps in code size...).
>>>>
>>>> Roland
>>>>
>>>> (*) I actually found some references some implementations might be
>>>> better they don't just use repz cmpsb but they split this up in parts
>>>> which do dword (or qword even - well for really large structs could use
>>>> sse2) comparisons for the parts where it's possible and only byte
>>>> comparisons for the remaining bytes (and if the compiler does that it
>>>> probably would know the size at compile time here hence could leave out
>>>> much of the code). Of course memcmp requires that the return value isn't
>>>> just a true or false value, hence there's more code needed once an
>>>> unequal dword is found, though the compiler could optimize that away too
>>>> in case it's not needed. Much the same as memcpy is optimized usually
>>>> really, so blame gcc :-).
>>>>
>>>>
>>>>
>>>> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
>>>>          
>>>>> Ohh that's interesting, you'd think the comparison shouldn't be that
>>>>> expensive (though I guess in ipers case the comparison is never true).
>>>>> memcmp is quite extensively used everywhere. Maybe we could replace that
>>>>> with something faster (since we only ever care if the blocks are the
>>>>> same but not care about the lexographic ordering and always compare
>>>>> whole structs, should compare dwords instead of bytes for a 4 time
>>>>> speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
>>>>> Also I guess it would help if the values which are more likely to be
>>>>> unequal are first in the struct (if we can tell that).
>>>>> Of course though if it's unlikely to be the same as the compared value
>>>>> anyway not comparing at all still might be a win (here).
>>>>>
>>>>> Roland
>>>>>
>>>>> Am 29.06.2011 19:19, schrieb Adam Jackson:
>>>>>            
>>>>>> Perversely, do this by eliminating the comparison between stored and
>>>>>> current fs state.  On ipers, a perf trace showed try_update_scene_state
>>>>>> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp.
>>>>>> Taking that out takes try_update_scene_state down to 6.5% of the
>>>>>> profile; more importantly, ipers goes from 10 to 14fps and gears goes
>>>>>> from 790 to 830fps.
>>>>>>
>>>>>> Signed-off-by: Adam Jackson<ajax at redhat.com>
>>>>>> ---
>>>>>>   src/gallium/drivers/llvmpipe/lp_setup.c |   61 ++++++++++++++-----------------
>>>>>>   1 files changed, 27 insertions(+), 34 deletions(-)
>>>>>>
>>>>>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c b/src/gallium/drivers/llvmpipe/lp_setup.c
>>>>>> index cbe06e5..9118db5 100644
>>>>>> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
>>>>>> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
>>>>>> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup )
>>>>>>         setup->dirty |= LP_SETUP_NEW_FS;
>>>>>>      }
>>>>>>
>>>>>> -
>>>>>>      if (setup->dirty&  LP_SETUP_NEW_FS) {
>>>>>> -      if (!setup->fs.stored ||
>>>>>> -          memcmp(setup->fs.stored,
>>>>>> -&setup->fs.current,
>>>>>> -                 sizeof setup->fs.current) != 0)
>>>>>> -      {
>>>>>> -         struct lp_rast_state *stored;
>>>>>> -         uint i;
>>>>>> -
>>>>>> -         /* The fs state that's been stored in the scene is different from
>>>>>> -          * the new, current state.  So allocate a new lp_rast_state object
>>>>>> -          * and append it to the bin's setup data buffer.
>>>>>> -          */
>>>>>> -         stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored);
>>>>>> -         if (!stored) {
>>>>>> -            assert(!new_scene);
>>>>>> -            return FALSE;
>>>>>> -         }
>>>>>> +      struct lp_rast_state *stored;
>>>>>> +      uint i;
>>>>>> +
>>>>>> +      /* The fs state that's been stored in the scene is different from
>>>>>> +       * the new, current state.  So allocate a new lp_rast_state object
>>>>>> +       * and append it to the bin's setup data buffer.
>>>>>> +       */
>>>>>> +      stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored);
>>>>>> +      if (!stored) {
>>>>>> +         assert(!new_scene);
>>>>>> +         return FALSE;
>>>>>> +      }
>>>>>>
>>>>>> -         memcpy(stored,
>>>>>> -&setup->fs.current,
>>>>>> -                sizeof setup->fs.current);
>>>>>> -         setup->fs.stored = stored;
>>>>>> -
>>>>>> -         /* The scene now references the textures in the rasterization
>>>>>> -          * state record.  Note that now.
>>>>>> -          */
>>>>>> -         for (i = 0; i<  Elements(setup->fs.current_tex); i++) {
>>>>>> -            if (setup->fs.current_tex[i]) {
>>>>>> -               if (!lp_scene_add_resource_reference(scene,
>>>>>> -                                                    setup->fs.current_tex[i],
>>>>>> -                                                    new_scene)) {
>>>>>> -                  assert(!new_scene);
>>>>>> -                  return FALSE;
>>>>>> -               }
>>>>>> +      memcpy(stored,
>>>>>> +&setup->fs.current,
>>>>>> +             sizeof setup->fs.current);
>>>>>> +      setup->fs.stored = stored;
>>>>>> +
>>>>>> +      /* The scene now references the textures in the rasterization
>>>>>> +       * state record.  Note that now.
>>>>>> +       */
>>>>>> +      for (i = 0; i<  Elements(setup->fs.current_tex); i++) {
>>>>>> +         if (setup->fs.current_tex[i]) {
>>>>>> +            if (!lp_scene_add_resource_reference(scene,
>>>>>> +                                                 setup->fs.current_tex[i],
>>>>>> +                                                 new_scene)) {
>>>>>> +               assert(!new_scene);
>>>>>> +               return FALSE;
>>>>>>               }
>>>>>>            }
>>>>>>         }
>>>>>>              
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>            
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>
>>>>          
>>>
>>>
>>>        
>>      
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>