Mesa (master): llvmpipe: try to do more of rast_tri_3_16 with intrinsics

Tue Oct 12 12:58:47 UTC 2010

Cool.  Isosurf is a good benchmark for these rasterization functions.

I found one reason why main is slower than master.  Master has this commit:

Author: Keith Whitwell <keithw at vmware.com>  2010-09-12 14:29:00
Committer: Keith Whitwell <keithw at vmware.com>  2010-10-11 23:43:53
Parent: 5d9de332bfef20c2e8b8942980f3e085915df251 (llvmpipe: add debug helpers for epi32 etc)
Child:  309d7bb01bdc306bd4f1964768e78f5479deb5ab (llvmpipe: remove perspective-divide-per-quad code)
Branch: main
Follows: mesa-7.8.1
Precedes: 

    llvmpipe: fix wierd performance regression in isosurf

    I really don't understand the mechanism behind this, but it
    seems like the way data blocks for a scene are malloced, and in
    particular whether we treat them as stack or a queue, and whether
    we retain the most recently allocated or least recently allocated
    has a real affect (~5%) on isosurf framerates...

    This is probably specific to my distro or even just my machine,
    but none the less, it's nicer not to see the framerates go in the
    wrong direction.

which fixed a problem I thought didn't exist on main - turns out it does.

The "problem" as such is that although malloc/free are very fast, brk() and mmap() are not fast at all, and unfortunately it's hard to predict whether malloc() will end up calling brk() or not.  The "fix" above just removes a few mallocs & gets better behaviour out of the linux malloc implementation.  Without this, we're doing a brk() every scene, it seems, which is responsible for (some of) the slowdown.

All of this is linux-specific & may not be relevant on more complex demos, but it is fairly interesting.

Keith

________________________________________
From: José Fonseca [jfonseca at vmware.com]
Sent: Tuesday, October 12, 2010 1:57 PM
To: Keith Whitwell
Cc: mesa-commit at lists.freedesktop.org
Subject: Re: Mesa (master): llvmpipe: try to do more of rast_tri_3_16 with intrinsics

> +   __m128i b4a4_mask       = _mm_and_si128(b4a4, mask);
> +   __m128i b4a4_mask_shift = _mm_slli_si128(b4a4_mask, 4);

I think you could replace these two calls with _mm_slli_epi64(b4a4_mask,
32)

I also think you should replace the two _mm_sril_si128 above with
_mm_srli_epi64, as  _mm_sril_si128 could be up to 4x slower, according
to Intel's intrinsic guide.

There's a patch attached that does this, but I'm not sure how to
benchmark such a tiny optimization.

> +   __m128i result          = _mm_or_si128(ba_mask, b4a4_mask_shift);
> +#endif
> +
> +   return result;
> +}

Jose