[Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup

Jose Fonseca jfonseca at vmware.com
Thu Jun 30 03:06:23 PDT 2011


Great work Roland! And thanks Ajax to finding this hot spot.

We use memcmp a lot -- all CSO caching, so we should use this everywhere.

We should also code a sse2 version with intrinsics for x86-64, which is guaranteed to always have SSE2.

Jose

----- Original Message -----
> Actually I ran some numbers here and tried out a optimized struct
> compare:
> original ipers: 12.1 fps
> ajax patch: 15.5 fps
> optimized struct compare: 16.8 fps
> 
> 
> This is the function I used for that (just enabled in that lp_setup
> function):
> 
> static INLINE int util_cmp_struct(const void *src1, const void *src2,
> unsigned count)
> {
>   /* hmm pointer casting is evil */
>   const uint32_t *src1_ptr = (uint32_t *)src1;
>   const uint32_t *src2_ptr = (uint32_t *)src2;
>   unsigned i;
>   assert(count % 4 == 0);
>   for (i = 0; i < count/4; i++) {
>     if (*src1_ptr != *src2_ptr) {
>       return 1;
>     }
>     src1_ptr++;
>     src2_ptr++;
>   }
>   return 0;
> }
> 
> (And no this doesn't use repz cmpsd here.)
> 
> So, unless I made some mistake, memcmp is just dead slow (*), most of
> the slowness probably coming from the bytewise comparison (and
> apparently I was wrong in assuming the comparison there might never
> be
> the same for ipers).
> Of course, the optimized struct compare relies on structs really
> being
> dword aligned (I think this is always the case), and additionally it
> relies on the struct size being a whole multiple of dwords - likely
> struct needs padding to ensure that (at least I don't think this is
> always guaranteed for all structs).
> But since memcmp is used extensively (cso for one) maybe some
> optimization along these lines might be worth it (though of course
> for
> small structs the win isn't going to be as big - and can't beat the
> repz
> cmps in code size...).
> 
> Roland
> 
> (*) I actually found some references some implementations might be
> better they don't just use repz cmpsb but they split this up in parts
> which do dword (or qword even - well for really large structs could
> use
> sse2) comparisons for the parts where it's possible and only byte
> comparisons for the remaining bytes (and if the compiler does that it
> probably would know the size at compile time here hence could leave
> out
> much of the code). Of course memcmp requires that the return value
> isn't
> just a true or false value, hence there's more code needed once an
> unequal dword is found, though the compiler could optimize that away
> too
> in case it's not needed. Much the same as memcpy is optimized usually
> really, so blame gcc :-).
> 
> 
> 
> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
> > Ohh that's interesting, you'd think the comparison shouldn't be
> > that
> > expensive (though I guess in ipers case the comparison is never
> > true).
> > memcmp is quite extensively used everywhere. Maybe we could replace
> > that
> > with something faster (since we only ever care if the blocks are
> > the
> > same but not care about the lexographic ordering and always compare
> > whole structs, should compare dwords instead of bytes for a 4 time
> > speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
> > Also I guess it would help if the values which are more likely to
> > be
> > unequal are first in the struct (if we can tell that).
> > Of course though if it's unlikely to be the same as the compared
> > value
> > anyway not comparing at all still might be a win (here).
> > 
> > Roland
> > 
> > Am 29.06.2011 19:19, schrieb Adam Jackson:
> >> Perversely, do this by eliminating the comparison between stored
> >> and
> >> current fs state.  On ipers, a perf trace showed
> >> try_update_scene_state
> >> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the
> >> memcmp.
> >> Taking that out takes try_update_scene_state down to 6.5% of the
> >> profile; more importantly, ipers goes from 10 to 14fps and gears
> >> goes
> >> from 790 to 830fps.
> >>
> >> Signed-off-by: Adam Jackson <ajax at redhat.com>
> >> ---
> >>  src/gallium/drivers/llvmpipe/lp_setup.c |   61
> >>  ++++++++++++++-----------------
> >>  1 files changed, 27 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c
> >> b/src/gallium/drivers/llvmpipe/lp_setup.c
> >> index cbe06e5..9118db5 100644
> >> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
> >> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
> >> @@ -839,42 +839,35 @@ try_update_scene_state( struct
> >> lp_setup_context *setup )
> >>        setup->dirty |= LP_SETUP_NEW_FS;
> >>     }
> >>  
> >> -
> >>     if (setup->dirty & LP_SETUP_NEW_FS) {
> >> -      if (!setup->fs.stored ||
> >> -          memcmp(setup->fs.stored,
> >> -                 &setup->fs.current,
> >> -                 sizeof setup->fs.current) != 0)
> >> -      {
> >> -         struct lp_rast_state *stored;
> >> -         uint i;
> >> -
> >> -         /* The fs state that's been stored in the scene is
> >> different from
> >> -          * the new, current state.  So allocate a new
> >> lp_rast_state object
> >> -          * and append it to the bin's setup data buffer.
> >> -          */
> >> -         stored = (struct lp_rast_state *) lp_scene_alloc(scene,
> >> sizeof *stored);
> >> -         if (!stored) {
> >> -            assert(!new_scene);
> >> -            return FALSE;
> >> -         }
> >> +      struct lp_rast_state *stored;
> >> +      uint i;
> >> +
> >> +      /* The fs state that's been stored in the scene is
> >> different from
> >> +       * the new, current state.  So allocate a new lp_rast_state
> >> object
> >> +       * and append it to the bin's setup data buffer.
> >> +       */
> >> +      stored = (struct lp_rast_state *) lp_scene_alloc(scene,
> >> sizeof *stored);
> >> +      if (!stored) {
> >> +         assert(!new_scene);
> >> +         return FALSE;
> >> +      }
> >>  
> >> -         memcpy(stored,
> >> -                &setup->fs.current,
> >> -                sizeof setup->fs.current);
> >> -         setup->fs.stored = stored;
> >> -
> >> -         /* The scene now references the textures in the
> >> rasterization
> >> -          * state record.  Note that now.
> >> -          */
> >> -         for (i = 0; i < Elements(setup->fs.current_tex); i++) {
> >> -            if (setup->fs.current_tex[i]) {
> >> -               if (!lp_scene_add_resource_reference(scene,
> >> -
> >>                                                    setup->fs.current_tex[i],
> >> -                                                    new_scene)) {
> >> -                  assert(!new_scene);
> >> -                  return FALSE;
> >> -               }
> >> +      memcpy(stored,
> >> +             &setup->fs.current,
> >> +             sizeof setup->fs.current);
> >> +      setup->fs.stored = stored;
> >> +
> >> +      /* The scene now references the textures in the
> >> rasterization
> >> +       * state record.  Note that now.
> >> +       */
> >> +      for (i = 0; i < Elements(setup->fs.current_tex); i++) {
> >> +         if (setup->fs.current_tex[i]) {
> >> +            if (!lp_scene_add_resource_reference(scene,
> >> +
> >>                                                 setup->fs.current_tex[i],
> >> +                                                 new_scene)) {
> >> +               assert(!new_scene);
> >> +               return FALSE;
> >>              }
> >>           }
> >>        }
> > 
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 


More information about the mesa-dev mailing list