[Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Roland Scheidegger
sroland at vmware.com
Wed Jun 29 16:47:55 PDT 2011
I didn't even look at that was just curious why the memcmp (which is
used a lot in other places) is slow. However, none of the other memcmp
seem to show up prominently (cso functions are quite low in profiles,
_mesa_search_program_cache uses memcmp too but it's not that high
neither). So I guess those functions either aren't called that often or
the sizes they compare are small.
So should maybe just file a gcc bug for memcmp and look at that
particular llvmpipe issue again :-).
Roland
Am 30.06.2011 01:16, schrieb Corbin Simpson:
> Okay, so maybe I'm failing to recognize the exact situation here, but
> wouldn't it be possible to mark the FS state with a serial number and
> just compare those? Or are these FS states not CSO-cached?
>
> ~ C.
>
> On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>> Actually I ran some numbers here and tried out a optimized struct compare:
>> original ipers: 12.1 fps
>> ajax patch: 15.5 fps
>> optimized struct compare: 16.8 fps
>>
>>
>> This is the function I used for that (just enabled in that lp_setup
>> function):
>>
>> static INLINE int util_cmp_struct(const void *src1, const void *src2,
>> unsigned count)
>> {
>> /* hmm pointer casting is evil */
>> const uint32_t *src1_ptr = (uint32_t *)src1;
>> const uint32_t *src2_ptr = (uint32_t *)src2;
>> unsigned i;
>> assert(count % 4 == 0);
>> for (i = 0; i < count/4; i++) {
>> if (*src1_ptr != *src2_ptr) {
>> return 1;
>> }
>> src1_ptr++;
>> src2_ptr++;
>> }
>> return 0;
>> }
>>
>> (And no this doesn't use repz cmpsd here.)
>>
>> So, unless I made some mistake, memcmp is just dead slow (*), most of
>> the slowness probably coming from the bytewise comparison (and
>> apparently I was wrong in assuming the comparison there might never be
>> the same for ipers).
>> Of course, the optimized struct compare relies on structs really being
>> dword aligned (I think this is always the case), and additionally it
>> relies on the struct size being a whole multiple of dwords - likely
>> struct needs padding to ensure that (at least I don't think this is
>> always guaranteed for all structs).
>> But since memcmp is used extensively (cso for one) maybe some
>> optimization along these lines might be worth it (though of course for
>> small structs the win isn't going to be as big - and can't beat the repz
>> cmps in code size...).
>>
>> Roland
>>
>> (*) I actually found some references some implementations might be
>> better they don't just use repz cmpsb but they split this up in parts
>> which do dword (or qword even - well for really large structs could use
>> sse2) comparisons for the parts where it's possible and only byte
>> comparisons for the remaining bytes (and if the compiler does that it
>> probably would know the size at compile time here hence could leave out
>> much of the code). Of course memcmp requires that the return value isn't
>> just a true or false value, hence there's more code needed once an
>> unequal dword is found, though the compiler could optimize that away too
>> in case it's not needed. Much the same as memcpy is optimized usually
>> really, so blame gcc :-).
>>
>>
>>
>> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
>>> Ohh that's interesting, you'd think the comparison shouldn't be that
>>> expensive (though I guess in ipers case the comparison is never true).
>>> memcmp is quite extensively used everywhere. Maybe we could replace that
>>> with something faster (since we only ever care if the blocks are the
>>> same but not care about the lexographic ordering and always compare
>>> whole structs, should compare dwords instead of bytes for a 4 time
>>> speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
>>> Also I guess it would help if the values which are more likely to be
>>> unequal are first in the struct (if we can tell that).
>>> Of course though if it's unlikely to be the same as the compared value
>>> anyway not comparing at all still might be a win (here).
>>>
>>> Roland
>>>
>>> Am 29.06.2011 19:19, schrieb Adam Jackson:
>>>> Perversely, do this by eliminating the comparison between stored and
>>>> current fs state. On ipers, a perf trace showed try_update_scene_state
>>>> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp.
>>>> Taking that out takes try_update_scene_state down to 6.5% of the
>>>> profile; more importantly, ipers goes from 10 to 14fps and gears goes
>>>> from 790 to 830fps.
>>>>
>>>> Signed-off-by: Adam Jackson <ajax at redhat.com>
>>>> ---
>>>> src/gallium/drivers/llvmpipe/lp_setup.c | 61 ++++++++++++++-----------------
>>>> 1 files changed, 27 insertions(+), 34 deletions(-)
>>>>
>>>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c b/src/gallium/drivers/llvmpipe/lp_setup.c
>>>> index cbe06e5..9118db5 100644
>>>> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
>>>> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
>>>> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup )
>>>> setup->dirty |= LP_SETUP_NEW_FS;
>>>> }
>>>>
>>>> -
>>>> if (setup->dirty & LP_SETUP_NEW_FS) {
>>>> - if (!setup->fs.stored ||
>>>> - memcmp(setup->fs.stored,
>>>> - &setup->fs.current,
>>>> - sizeof setup->fs.current) != 0)
>>>> - {
>>>> - struct lp_rast_state *stored;
>>>> - uint i;
>>>> -
>>>> - /* The fs state that's been stored in the scene is different from
>>>> - * the new, current state. So allocate a new lp_rast_state object
>>>> - * and append it to the bin's setup data buffer.
>>>> - */
>>>> - stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored);
>>>> - if (!stored) {
>>>> - assert(!new_scene);
>>>> - return FALSE;
>>>> - }
>>>> + struct lp_rast_state *stored;
>>>> + uint i;
>>>> +
>>>> + /* The fs state that's been stored in the scene is different from
>>>> + * the new, current state. So allocate a new lp_rast_state object
>>>> + * and append it to the bin's setup data buffer.
>>>> + */
>>>> + stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored);
>>>> + if (!stored) {
>>>> + assert(!new_scene);
>>>> + return FALSE;
>>>> + }
>>>>
>>>> - memcpy(stored,
>>>> - &setup->fs.current,
>>>> - sizeof setup->fs.current);
>>>> - setup->fs.stored = stored;
>>>> -
>>>> - /* The scene now references the textures in the rasterization
>>>> - * state record. Note that now.
>>>> - */
>>>> - for (i = 0; i < Elements(setup->fs.current_tex); i++) {
>>>> - if (setup->fs.current_tex[i]) {
>>>> - if (!lp_scene_add_resource_reference(scene,
>>>> - setup->fs.current_tex[i],
>>>> - new_scene)) {
>>>> - assert(!new_scene);
>>>> - return FALSE;
>>>> - }
>>>> + memcpy(stored,
>>>> + &setup->fs.current,
>>>> + sizeof setup->fs.current);
>>>> + setup->fs.stored = stored;
>>>> +
>>>> + /* The scene now references the textures in the rasterization
>>>> + * state record. Note that now.
>>>> + */
>>>> + for (i = 0; i < Elements(setup->fs.current_tex); i++) {
>>>> + if (setup->fs.current_tex[i]) {
>>>> + if (!lp_scene_add_resource_reference(scene,
>>>> + setup->fs.current_tex[i],
>>>> + new_scene)) {
>>>> + assert(!new_scene);
>>>> + return FALSE;
>>>> }
>>>> }
>>>> }
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>
>
>
More information about the mesa-dev
mailing list