[Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Jose Fonseca
jfonseca at vmware.com
Thu Jun 30 03:06:23 PDT 2011
Great work Roland! And thanks Ajax to finding this hot spot.
We use memcmp a lot -- all CSO caching, so we should use this everywhere.
We should also code a sse2 version with intrinsics for x86-64, which is guaranteed to always have SSE2.
Jose
----- Original Message -----
> Actually I ran some numbers here and tried out a optimized struct
> compare:
> original ipers: 12.1 fps
> ajax patch: 15.5 fps
> optimized struct compare: 16.8 fps
>
>
> This is the function I used for that (just enabled in that lp_setup
> function):
>
> static INLINE int util_cmp_struct(const void *src1, const void *src2,
> unsigned count)
> {
> /* hmm pointer casting is evil */
> const uint32_t *src1_ptr = (uint32_t *)src1;
> const uint32_t *src2_ptr = (uint32_t *)src2;
> unsigned i;
> assert(count % 4 == 0);
> for (i = 0; i < count/4; i++) {
> if (*src1_ptr != *src2_ptr) {
> return 1;
> }
> src1_ptr++;
> src2_ptr++;
> }
> return 0;
> }
>
> (And no this doesn't use repz cmpsd here.)
>
> So, unless I made some mistake, memcmp is just dead slow (*), most of
> the slowness probably coming from the bytewise comparison (and
> apparently I was wrong in assuming the comparison there might never
> be
> the same for ipers).
> Of course, the optimized struct compare relies on structs really
> being
> dword aligned (I think this is always the case), and additionally it
> relies on the struct size being a whole multiple of dwords - likely
> struct needs padding to ensure that (at least I don't think this is
> always guaranteed for all structs).
> But since memcmp is used extensively (cso for one) maybe some
> optimization along these lines might be worth it (though of course
> for
> small structs the win isn't going to be as big - and can't beat the
> repz
> cmps in code size...).
>
> Roland
>
> (*) I actually found some references some implementations might be
> better they don't just use repz cmpsb but they split this up in parts
> which do dword (or qword even - well for really large structs could
> use
> sse2) comparisons for the parts where it's possible and only byte
> comparisons for the remaining bytes (and if the compiler does that it
> probably would know the size at compile time here hence could leave
> out
> much of the code). Of course memcmp requires that the return value
> isn't
> just a true or false value, hence there's more code needed once an
> unequal dword is found, though the compiler could optimize that away
> too
> in case it's not needed. Much the same as memcpy is optimized usually
> really, so blame gcc :-).
>
>
>
> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
> > Ohh that's interesting, you'd think the comparison shouldn't be
> > that
> > expensive (though I guess in ipers case the comparison is never
> > true).
> > memcmp is quite extensively used everywhere. Maybe we could replace
> > that
> > with something faster (since we only ever care if the blocks are
> > the
> > same but not care about the lexographic ordering and always compare
> > whole structs, should compare dwords instead of bytes for a 4 time
> > speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
> > Also I guess it would help if the values which are more likely to
> > be
> > unequal are first in the struct (if we can tell that).
> > Of course though if it's unlikely to be the same as the compared
> > value
> > anyway not comparing at all still might be a win (here).
> >
> > Roland
> >
> > Am 29.06.2011 19:19, schrieb Adam Jackson:
> >> Perversely, do this by eliminating the comparison between stored
> >> and
> >> current fs state. On ipers, a perf trace showed
> >> try_update_scene_state
> >> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the
> >> memcmp.
> >> Taking that out takes try_update_scene_state down to 6.5% of the
> >> profile; more importantly, ipers goes from 10 to 14fps and gears
> >> goes
> >> from 790 to 830fps.
> >>
> >> Signed-off-by: Adam Jackson <ajax at redhat.com>
> >> ---
> >> src/gallium/drivers/llvmpipe/lp_setup.c | 61
> >> ++++++++++++++-----------------
> >> 1 files changed, 27 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c
> >> b/src/gallium/drivers/llvmpipe/lp_setup.c
> >> index cbe06e5..9118db5 100644
> >> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
> >> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
> >> @@ -839,42 +839,35 @@ try_update_scene_state( struct
> >> lp_setup_context *setup )
> >> setup->dirty |= LP_SETUP_NEW_FS;
> >> }
> >>
> >> -
> >> if (setup->dirty & LP_SETUP_NEW_FS) {
> >> - if (!setup->fs.stored ||
> >> - memcmp(setup->fs.stored,
> >> - &setup->fs.current,
> >> - sizeof setup->fs.current) != 0)
> >> - {
> >> - struct lp_rast_state *stored;
> >> - uint i;
> >> -
> >> - /* The fs state that's been stored in the scene is
> >> different from
> >> - * the new, current state. So allocate a new
> >> lp_rast_state object
> >> - * and append it to the bin's setup data buffer.
> >> - */
> >> - stored = (struct lp_rast_state *) lp_scene_alloc(scene,
> >> sizeof *stored);
> >> - if (!stored) {
> >> - assert(!new_scene);
> >> - return FALSE;
> >> - }
> >> + struct lp_rast_state *stored;
> >> + uint i;
> >> +
> >> + /* The fs state that's been stored in the scene is
> >> different from
> >> + * the new, current state. So allocate a new lp_rast_state
> >> object
> >> + * and append it to the bin's setup data buffer.
> >> + */
> >> + stored = (struct lp_rast_state *) lp_scene_alloc(scene,
> >> sizeof *stored);
> >> + if (!stored) {
> >> + assert(!new_scene);
> >> + return FALSE;
> >> + }
> >>
> >> - memcpy(stored,
> >> - &setup->fs.current,
> >> - sizeof setup->fs.current);
> >> - setup->fs.stored = stored;
> >> -
> >> - /* The scene now references the textures in the
> >> rasterization
> >> - * state record. Note that now.
> >> - */
> >> - for (i = 0; i < Elements(setup->fs.current_tex); i++) {
> >> - if (setup->fs.current_tex[i]) {
> >> - if (!lp_scene_add_resource_reference(scene,
> >> -
> >> setup->fs.current_tex[i],
> >> - new_scene)) {
> >> - assert(!new_scene);
> >> - return FALSE;
> >> - }
> >> + memcpy(stored,
> >> + &setup->fs.current,
> >> + sizeof setup->fs.current);
> >> + setup->fs.stored = stored;
> >> +
> >> + /* The scene now references the textures in the
> >> rasterization
> >> + * state record. Note that now.
> >> + */
> >> + for (i = 0; i < Elements(setup->fs.current_tex); i++) {
> >> + if (setup->fs.current_tex[i]) {
> >> + if (!lp_scene_add_resource_reference(scene,
> >> +
> >> setup->fs.current_tex[i],
> >> + new_scene)) {
> >> + assert(!new_scene);
> >> + return FALSE;
> >> }
> >> }
> >> }
> >
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
More information about the mesa-dev
mailing list