[Mesa-dev] [PATCH] llvmpipe: Optimize new fs state setup
Thomas Hellstrom
thellstrom at vmware.com
Thu Jun 30 00:56:56 PDT 2011
Hmm.
Forgive my ignorance, but isn't memcmp() on structs pretty prone to give
incorrect != results, given that there may be padding between members in
structs and that IIRC gcc struct assignment is member-wise.
What happens if there's padding between the jit_context and variant
members of struct lp_rast_state?
I seem to recall hitting similar issues a number of times in the past.
/Thomas
On 06/30/2011 03:36 AM, Roland Scheidegger wrote:
> Ok in fact there's a gcc bug about memcmp:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> In short gcc's memcmp builtin is totally lame and loses to glibc's
> memcmp (including call overhead, no knowledge about alignment etc.) even
> when comparing only very few bytes (and loses BIG time for lots of bytes
> to compare). Oops. Well at least if the strings are the same (I'd guess
> if the first byte is different it's hard to beat the gcc builtin...).
> So this is really a gcc bug. The bug is quite old though with no fix in
> sight apparently so might need to think about some workaround (but just
> not doing the comparison doesn't look like the right idea, since
> apparently it would be faster with the comparison if gcc's memcmp got
> fixed).
>
>
> Roland
>
>
>
> Am 30.06.2011 01:47, schrieb Roland Scheidegger:
>
>> I didn't even look at that was just curious why the memcmp (which is
>> used a lot in other places) is slow. However, none of the other memcmp
>> seem to show up prominently (cso functions are quite low in profiles,
>> _mesa_search_program_cache uses memcmp too but it's not that high
>> neither). So I guess those functions either aren't called that often or
>> the sizes they compare are small.
>> So should maybe just file a gcc bug for memcmp and look at that
>> particular llvmpipe issue again :-).
>>
>> Roland
>>
>> Am 30.06.2011 01:16, schrieb Corbin Simpson:
>>
>>> Okay, so maybe I'm failing to recognize the exact situation here, but
>>> wouldn't it be possible to mark the FS state with a serial number and
>>> just compare those? Or are these FS states not CSO-cached?
>>>
>>> ~ C.
>>>
>>> On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger<sroland at vmware.com> wrote:
>>>
>>>> Actually I ran some numbers here and tried out a optimized struct compare:
>>>> original ipers: 12.1 fps
>>>> ajax patch: 15.5 fps
>>>> optimized struct compare: 16.8 fps
>>>>
>>>>
>>>> This is the function I used for that (just enabled in that lp_setup
>>>> function):
>>>>
>>>> static INLINE int util_cmp_struct(const void *src1, const void *src2,
>>>> unsigned count)
>>>> {
>>>> /* hmm pointer casting is evil */
>>>> const uint32_t *src1_ptr = (uint32_t *)src1;
>>>> const uint32_t *src2_ptr = (uint32_t *)src2;
>>>> unsigned i;
>>>> assert(count % 4 == 0);
>>>> for (i = 0; i< count/4; i++) {
>>>> if (*src1_ptr != *src2_ptr) {
>>>> return 1;
>>>> }
>>>> src1_ptr++;
>>>> src2_ptr++;
>>>> }
>>>> return 0;
>>>> }
>>>>
>>>> (And no this doesn't use repz cmpsd here.)
>>>>
>>>> So, unless I made some mistake, memcmp is just dead slow (*), most of
>>>> the slowness probably coming from the bytewise comparison (and
>>>> apparently I was wrong in assuming the comparison there might never be
>>>> the same for ipers).
>>>> Of course, the optimized struct compare relies on structs really being
>>>> dword aligned (I think this is always the case), and additionally it
>>>> relies on the struct size being a whole multiple of dwords - likely
>>>> struct needs padding to ensure that (at least I don't think this is
>>>> always guaranteed for all structs).
>>>> But since memcmp is used extensively (cso for one) maybe some
>>>> optimization along these lines might be worth it (though of course for
>>>> small structs the win isn't going to be as big - and can't beat the repz
>>>> cmps in code size...).
>>>>
>>>> Roland
>>>>
>>>> (*) I actually found some references some implementations might be
>>>> better they don't just use repz cmpsb but they split this up in parts
>>>> which do dword (or qword even - well for really large structs could use
>>>> sse2) comparisons for the parts where it's possible and only byte
>>>> comparisons for the remaining bytes (and if the compiler does that it
>>>> probably would know the size at compile time here hence could leave out
>>>> much of the code). Of course memcmp requires that the return value isn't
>>>> just a true or false value, hence there's more code needed once an
>>>> unequal dword is found, though the compiler could optimize that away too
>>>> in case it's not needed. Much the same as memcpy is optimized usually
>>>> really, so blame gcc :-).
>>>>
>>>>
>>>>
>>>> Am 29.06.2011 20:33, schrieb Roland Scheidegger:
>>>>
>>>>> Ohh that's interesting, you'd think the comparison shouldn't be that
>>>>> expensive (though I guess in ipers case the comparison is never true).
>>>>> memcmp is quite extensively used everywhere. Maybe we could replace that
>>>>> with something faster (since we only ever care if the blocks are the
>>>>> same but not care about the lexographic ordering and always compare
>>>>> whole structs, should compare dwords instead of bytes for a 4 time
>>>>> speedup)? Or isn't that the reason cmpsb instead of cmpsd is used?
>>>>> Also I guess it would help if the values which are more likely to be
>>>>> unequal are first in the struct (if we can tell that).
>>>>> Of course though if it's unlikely to be the same as the compared value
>>>>> anyway not comparing at all still might be a win (here).
>>>>>
>>>>> Roland
>>>>>
>>>>> Am 29.06.2011 19:19, schrieb Adam Jackson:
>>>>>
>>>>>> Perversely, do this by eliminating the comparison between stored and
>>>>>> current fs state. On ipers, a perf trace showed try_update_scene_state
>>>>>> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp.
>>>>>> Taking that out takes try_update_scene_state down to 6.5% of the
>>>>>> profile; more importantly, ipers goes from 10 to 14fps and gears goes
>>>>>> from 790 to 830fps.
>>>>>>
>>>>>> Signed-off-by: Adam Jackson<ajax at redhat.com>
>>>>>> ---
>>>>>> src/gallium/drivers/llvmpipe/lp_setup.c | 61 ++++++++++++++-----------------
>>>>>> 1 files changed, 27 insertions(+), 34 deletions(-)
>>>>>>
>>>>>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c b/src/gallium/drivers/llvmpipe/lp_setup.c
>>>>>> index cbe06e5..9118db5 100644
>>>>>> --- a/src/gallium/drivers/llvmpipe/lp_setup.c
>>>>>> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c
>>>>>> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context *setup )
>>>>>> setup->dirty |= LP_SETUP_NEW_FS;
>>>>>> }
>>>>>>
>>>>>> -
>>>>>> if (setup->dirty& LP_SETUP_NEW_FS) {
>>>>>> - if (!setup->fs.stored ||
>>>>>> - memcmp(setup->fs.stored,
>>>>>> -&setup->fs.current,
>>>>>> - sizeof setup->fs.current) != 0)
>>>>>> - {
>>>>>> - struct lp_rast_state *stored;
>>>>>> - uint i;
>>>>>> -
>>>>>> - /* The fs state that's been stored in the scene is different from
>>>>>> - * the new, current state. So allocate a new lp_rast_state object
>>>>>> - * and append it to the bin's setup data buffer.
>>>>>> - */
>>>>>> - stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored);
>>>>>> - if (!stored) {
>>>>>> - assert(!new_scene);
>>>>>> - return FALSE;
>>>>>> - }
>>>>>> + struct lp_rast_state *stored;
>>>>>> + uint i;
>>>>>> +
>>>>>> + /* The fs state that's been stored in the scene is different from
>>>>>> + * the new, current state. So allocate a new lp_rast_state object
>>>>>> + * and append it to the bin's setup data buffer.
>>>>>> + */
>>>>>> + stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof *stored);
>>>>>> + if (!stored) {
>>>>>> + assert(!new_scene);
>>>>>> + return FALSE;
>>>>>> + }
>>>>>>
>>>>>> - memcpy(stored,
>>>>>> -&setup->fs.current,
>>>>>> - sizeof setup->fs.current);
>>>>>> - setup->fs.stored = stored;
>>>>>> -
>>>>>> - /* The scene now references the textures in the rasterization
>>>>>> - * state record. Note that now.
>>>>>> - */
>>>>>> - for (i = 0; i< Elements(setup->fs.current_tex); i++) {
>>>>>> - if (setup->fs.current_tex[i]) {
>>>>>> - if (!lp_scene_add_resource_reference(scene,
>>>>>> - setup->fs.current_tex[i],
>>>>>> - new_scene)) {
>>>>>> - assert(!new_scene);
>>>>>> - return FALSE;
>>>>>> - }
>>>>>> + memcpy(stored,
>>>>>> +&setup->fs.current,
>>>>>> + sizeof setup->fs.current);
>>>>>> + setup->fs.stored = stored;
>>>>>> +
>>>>>> + /* The scene now references the textures in the rasterization
>>>>>> + * state record. Note that now.
>>>>>> + */
>>>>>> + for (i = 0; i< Elements(setup->fs.current_tex); i++) {
>>>>>> + if (setup->fs.current_tex[i]) {
>>>>>> + if (!lp_scene_add_resource_reference(scene,
>>>>>> + setup->fs.current_tex[i],
>>>>>> + new_scene)) {
>>>>>> + assert(!new_scene);
>>>>>> + return FALSE;
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>>
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>
>>>>
>>>
>>>
>>>
>>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
More information about the mesa-dev
mailing list