[Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Jose Fonseca
jfonseca at vmware.com
Thu Jun 30 03:14:52 PDT 2011
----- Original Message -----
> Hmm.
> Forgive my ignorance, but isn't memcmp() on structs pretty prone to
> give
> incorrect != results, given that there may be padding between members
> in
> structs and that IIRC gcc struct assignment is member-wise.
There's no alternative to bitwise comparison on C:
$ cat cmp.c
struct foo {
int a;
int b;
};
int cmp(const struct foo *a, const struct foo *b) {
return *a == *b;
}
$ gcc -c cmp.c
cmp.c: In function ‘cmp’:
cmp.c:7:12: error: invalid operands to binary == (have ‘const struct foo’ and ‘const struct foo’)
> What happens if there's padding between the jit_context and variant
> members of struct lp_rast_state?
>
> I seem to recall hitting similar issues a number of times in the
> past.
I recall that as well, but my memory is the other way around: struct assignment is considerer harmful because it breaks memcmp. Instead all structures should be initialized with memset(0) first, and always copied with memcpy. This should ensure that padding doesn't get clobbered.
But now that you mention it, I'm not still 100% that unsed bits on bitfields are preserved like that or not, when being assigned. We probably should ensure that the all bits in bitfields are used, using "reserved" members, so that the zeros there stay zero.
Jose
> /Thomas
>
> On 06/30/2011 03:36 AM, Roland Scheidegger wrote:
> > Ok in fact there's a gcc bug about memcmp:
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> > In short gcc's memcmp builtin is totally lame and loses to glibc's
> > memcmp (including call overhead, no knowledge about alignment etc.)
> > even
> > when comparing only very few bytes (and loses BIG time for lots of
> > bytes
> > to compare). Oops. Well at least if the strings are the same (I'd
> > guess
> > if the first byte is different it's hard to beat the gcc
> > builtin...).
> > So this is really a gcc bug. The bug is quite old though with no
> > fix in
> > sight apparently so might need to think about some workaround (but
> > just
> > not doing the comparison doesn't look like the right idea, since
> > apparently it would be faster with the comparison if gcc's memcmp
> > got
> > fixed).
> >
> >
> > Roland
> >
> >
> >
> > Am 30.06.2011 01:47, schrieb Roland Scheidegger:
> >
> >> I didn't even look at that was just curious why the memcmp (which
> >> is
> >> used a lot in other places) is slow. However, none of the other
> >> memcmp
> >> seem to show up prominently (cso functions are quite low in
> >> profiles,
> >> _mesa_search_program_cache uses memcmp too but it's not that high
> >> neither). So I guess those functions either aren't called that
> >> often or
> >> the sizes they compare are small.
> >> So should maybe just file a gcc bug for memcmp and look at that
> >> particular llvmpipe issue again :-).
> >>
> >> Roland
> >>
> >> Am 30.06.2011 01:16, schrieb Corbin Simpson:
> >>
> >>> Okay, so maybe I'm failing to recognize the exact situation here,
> >>> but
> >>> wouldn't it be possible to mark the FS state with a serial number
> >>> and
> >>> just compare those? Or are these FS states not CSO-cached?
> >>>
> >>> ~ C.
> >>>
> >>> On Wed, Jun 29, 2011 at 3:44 PM, Roland
> >>> Scheidegger<sroland at vmware.com> wrote:
> >>>
> >>>> Actually I ran some numbers here and tried out a optimized
> >>>> struct compare:
> >>>> original ipers: 12.1 fps
> >>>> ajax patch: 15.5 fps
> >>>> optimized struct compare: 16.8 fps
> >>>>
> >>>>
> >>>> This is the function I used for that (just enabled in that
> >>>> lp_setup
> >>>> function):
> >>>>
> >>>> static INLINE int util_cmp_struct(const void *src1, const void
> >>>> *src2,
> >>>> unsigned count)
> >>>> {
> >>>> /* hmm pointer casting is evil */
> >>>> const uint32_t *src1_ptr = (uint32_t *)src1;
> >>>> const uint32_t *src2_ptr = (uint32_t *)src2;
> >>>> unsigned i;
> >>>> assert(count % 4 == 0);
> >>>> for (i = 0; i< count/4; i++) {
> >>>> if (*src1_ptr != *src2_ptr) {
> >>>> return 1;
> >>>> }
> >>>> src1_ptr++;
> >>>> src2_ptr++;
> >>>> }
> >>>> return 0;
> >>>> }
> >>>>
> >>>> (And no this doesn't use repz cmpsd here.)
> >>>>
> >>>> So, unless I made some mistake, memcmp is just dead slow (*),
> >>>> most of
> >>>> the slowness probably coming from the bytewise comparison (and
> >>>> apparently I was wrong in assuming the comparison there might
> >>>> never be
> >>>> the same for ipers).
> >>>> Of course, the optimized struct compare relies on structs really
> >>>> being
> >>>> dword aligned (I think this is always the case), and
> >>>> additionally it
> >>>> relies on the struct size being a whole multiple of dwords -
> >>>> likely
> >>>> struct needs padding to ensure that (at least I don't think this
> >>>> is
> >>>> always guaranteed for all structs).
> >>>> But since memcmp is used extensively (cso for one) maybe some
> >>>> optimization along these lines might be worth it (though of
> >>>> course for
> >>>> small structs the win isn't going to be as big - and can't beat
> >>>> the repz
> >>>> cmps in code size...).
> >>>>
> >>>> Roland
> >>>>
> >>>> (*) I actually found some references some implementations might
> >>>> be
> >>>> better they don't just use repz cmpsb but they split this up in
> >>>> parts
> >>>> which do dword (or qword even - well for really large structs
> >>>> could use
> >>>> sse2) comparisons for the parts where it's possible and only
> >>>> byte
> >>>> comparisons for the remaining bytes (and if the compiler does
> >>>> that it
> >>>> probably would know the size at compile time here hence could
> >>>> leave out
> >>>> much of the code). Of course memcmp requires that the return
> >>>> value isn't
> >>>> just a true or false value, hence there's more code needed once
> >>>> an
> >>>> unequal dword is found, though the compiler could optimize that
> >>>> away too
> >>>> in case it's not needed. Much the same as memcpy is optimized
> >>>> usually
> >>>> really, so blame gcc :-).
> >>>>
> >>>>
> >>>>
> >>>> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
> >>>>
> >>>>> Ohh that's interesting, you'd think the comparison shouldn't be
> >>>>> that
> >>>>> expensive (though I guess in ipers case the comparison is never
> >>>>> true).
> >>>>> memcmp is quite extensively used everywhere. Maybe we could
> >>>>> replace that
> >>>>> with something faster (since we only ever care if the blocks
> >>>>> are the
> >>>>> same but not care about the lexographic ordering and always
> >>>>> compare
> >>>>> whole structs, should compare dwords instead of bytes for a 4
> >>>>> time
> >>>>> speedup)? Or isn't that the reason cmpsb instead of cmpsd is
> >>>>> used?
> >>>>> Also I guess it would help if the values which are more likely
> >>>>> to be
> >>>>> unequal are first in the struct (if we can tell that).
> >>>>> Of course though if it's unlikely to be the same as the
> >>>>> compared value
> >>>>> anyway not comparing at all still might be a win (here).
> >>>>>
> >>>>> Roland
> >>>>>
> >>>>> Am 29.06.2011 19:19, schrieb Adam Jackson:
> >>>>>
> >>>>>> Perversely, do this by eliminating the comparison between
> >>>>>> stored and
> >>>>>> current fs state. On ipers, a perf trace showed
> >>>>>> try_update_scene_state
> >>>>>> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie,
> >>>>>> the memcmp.
> >>>>>> Taking that out takes try_update_scene_state down to 6.5% of
> >>>>>> the
> >>>>>> profile; more importantly, ipers goes from 10 to 14fps and
> >>>>>> gears goes
> >>>>>> from 790 to 830fps.
> >>>>>>
> >>>>>> Signed-off-by: Adam Jackson<ajax at redhat.com>
> >>>>>> ---
> >>>>>> src/gallium/drivers/llvmpipe/lp_setup.c | 61
> >>>>>> ++++++++++++++-----------------
> >>>>>> 1 files changed, 27 insertions(+), 34 deletions(-)
> >>>>>>
> >>>>>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c
> >>>>>> b/src/gallium/drivers/llvmpipe/lp_setup.c
> >>>>>> index cbe06e5..9118db5 100644
> >>>>>> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
> >>>>>> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
> >>>>>> @@ -839,42 +839,35 @@ try_update_scene_state( struct
> >>>>>> lp_setup_context *setup )
> >>>>>> setup->dirty |= LP_SETUP_NEW_FS;
> >>>>>> }
> >>>>>>
> >>>>>> -
> >>>>>> if (setup->dirty& LP_SETUP_NEW_FS) {
> >>>>>> - if (!setup->fs.stored ||
> >>>>>> - memcmp(setup->fs.stored,
> >>>>>> -&setup->fs.current,
> >>>>>> - sizeof setup->fs.current) != 0)
> >>>>>> - {
> >>>>>> - struct lp_rast_state *stored;
> >>>>>> - uint i;
> >>>>>> -
> >>>>>> - /* The fs state that's been stored in the scene is
> >>>>>> different from
> >>>>>> - * the new, current state. So allocate a new
> >>>>>> lp_rast_state object
> >>>>>> - * and append it to the bin's setup data buffer.
> >>>>>> - */
> >>>>>> - stored = (struct lp_rast_state *)
> >>>>>> lp_scene_alloc(scene, sizeof *stored);
> >>>>>> - if (!stored) {
> >>>>>> - assert(!new_scene);
> >>>>>> - return FALSE;
> >>>>>> - }
> >>>>>> + struct lp_rast_state *stored;
> >>>>>> + uint i;
> >>>>>> +
> >>>>>> + /* The fs state that's been stored in the scene is
> >>>>>> different from
> >>>>>> + * the new, current state. So allocate a new
> >>>>>> lp_rast_state object
> >>>>>> + * and append it to the bin's setup data buffer.
> >>>>>> + */
> >>>>>> + stored = (struct lp_rast_state *) lp_scene_alloc(scene,
> >>>>>> sizeof *stored);
> >>>>>> + if (!stored) {
> >>>>>> + assert(!new_scene);
> >>>>>> + return FALSE;
> >>>>>> + }
> >>>>>>
> >>>>>> - memcpy(stored,
> >>>>>> -&setup->fs.current,
> >>>>>> - sizeof setup->fs.current);
> >>>>>> - setup->fs.stored = stored;
> >>>>>> -
> >>>>>> - /* The scene now references the textures in the
> >>>>>> rasterization
> >>>>>> - * state record. Note that now.
> >>>>>> - */
> >>>>>> - for (i = 0; i< Elements(setup->fs.current_tex);
> >>>>>> i++) {
> >>>>>> - if (setup->fs.current_tex[i]) {
> >>>>>> - if (!lp_scene_add_resource_reference(scene,
> >>>>>> -
> >>>>>> setup->fs.current_tex[i],
> >>>>>> -
> >>>>>> new_scene))
> >>>>>> {
> >>>>>> - assert(!new_scene);
> >>>>>> - return FALSE;
> >>>>>> - }
> >>>>>> + memcpy(stored,
> >>>>>> +&setup->fs.current,
> >>>>>> + sizeof setup->fs.current);
> >>>>>> + setup->fs.stored = stored;
> >>>>>> +
> >>>>>> + /* The scene now references the textures in the
> >>>>>> rasterization
> >>>>>> + * state record. Note that now.
> >>>>>> + */
> >>>>>> + for (i = 0; i< Elements(setup->fs.current_tex); i++) {
> >>>>>> + if (setup->fs.current_tex[i]) {
> >>>>>> + if (!lp_scene_add_resource_reference(scene,
> >>>>>> +
> >>>>>> setup->fs.current_tex[i],
> >>>>>> + new_scene))
> >>>>>> {
> >>>>>> + assert(!new_scene);
> >>>>>> + return FALSE;
> >>>>>> }
> >>>>>> }
> >>>>>> }
> >>>>>>
> >>>>> _______________________________________________
> >>>>> mesa-dev mailing list
> >>>>> mesa-dev at lists.freedesktop.org
> >>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>>>>
> >>>> _______________________________________________
> >>>> mesa-dev mailing list
> >>>> mesa-dev at lists.freedesktop.org
> >>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
More information about the mesa-dev
mailing list