[Pixman] [cairo] [PATCH] Added MIPS32R2 and MIPS DSP ASE optimized functions

Sun Nov 21 09:47:53 PST 2010

On Monday 13 September 2010 19:21:13 Georgi Beloev wrote:
> On Sep 12, 2010, at 5:08 AM, Soeren Sandmann wrote:
>> I'm hoping someone with more experience with MIPS than I have can give
>> more detailed comments on the assembly.
>
> Thanks for the all the comments! I'll update the code per your comments and
> submit another patch. Should I mail it just to the pixman list or CC both
> cairo and pixman?

Hi, any updates for your pixman MIPS optimization patches? Maybe some parts are
more ready than the others and can be submitted separately?

On Monday 13 September 2010 19:56:16 Georgi Beloev wrote:
> On Sep 13, 2010, at 6:46 AM, Siarhei Siamashka wrote:
> > On Thursday 09 September 2010 18:30:28 Georgi Beloev wrote:
> >> +// pixman_bool_t
> >> +// mips32r2_pixman_fill32(uint32_t *bits, int stride, int x, int y,
> >> +//	int width, int height, uint32_t  xor)
> >> +
> >> +	.global		mips32r2_pixman_fill32
> >> +	.ent		mips32r2_pixman_fill32
> >> +
> > 
> >> +mips32r2_pixman_fill32:
> > This looks mostly like a simple loop unrolling without any extra tricks.
> > Though assembly code may surely be faster than C (maybe saving one
> > instruction per loop iteration) because gcc generally has troubles
> > optimizing simple loops:
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37734
> > 
> > It's not directly related to your patch. But I wonder if it makes sense
> > to also add a manual loop unrolling to the C variant of pixman_fill32
> > and the other similar functions in order to get better general
> > performance on most of SIMD-less simple processors such as MIPS32R2,
> > older ARMs and the others (who knows, maybe even opencores.org ones)?
> 
> Yes, it is just simple loop unrolling. The code may also benefit from using
> "restrict" pointers to tell the compiler that it is safe to unroll the
> loops. Unfortunately, this is a C99 keyword and we are not compiling in
> C99 mode. Another useful optimization is adding prefetch-for-store
> instructions. However, in some cases these instructions can degrade
> performance rather than improve it.

Since that time, I also got some MIPS based device for my home gadgets
collection :) It's a router with a plain and boring "MIPS 24Kc V7.4" CPU which
only supports MIPS32R2 ISA (without any DSP or SIMD extensions).

Indeed, using PrepareForStore prefetch for the destination buffer is *very*
useful, providing approximately 1.5x-2x improvement for memcpy-alike code
(blits and composite operations) and approximately 3x performance
improvement for memset-alike code (fills).

Because "Write-back with write allocation" caching policy seems to be used
by default in mips linux, bulk linear stores of pixel data to memory are slow
(data is first *read* into cache, gets modified there and is written back
later). And PrepareForStore prefetch solves this problem, preventing
unnecessary allocation. Though if the caching policy is changed to
"Write-through, no write allocate", then PrepareForStore prefetch actually
hurts performance, which is kind of sad. Nevertheless, the following
(admittedly very old) post implies that only modes 2 and 3 are relevant for
MIPS ("Uncached" and "Write-back with write allocation"):
http://www.spinics.net/lists/mips/msg11750.html

So what are the next plans? Realistically, optimizing prefetches is the only
source of major performance improvement that I would expect from pixman
assembly optimizations on MIPS32R2.

PS. It's interesting how all this write-allocate stuff is handled on different
architectures: x86 can use special non-temporal store instructions and ARM
just implements delayed allocation in Cortex-A8.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20101121/1f561078/attachment.pgp>