[Pixman] [PATCH 0/3] Pixman MIPS DSPASE1
Siarhei Siamashka
siarhei.siamashka at gmail.com
Thu Feb 24 11:06:01 PST 2011
On Thursday 24 February 2011 19:17:38 Soeren Sandmann wrote:
> Hi,
>
> Thanks for picking up the MIPS work. There are some comments from last
> time from Siarhei and myself that I don't think have been addressed. See
> these mails:
>
> http://lists.freedesktop.org/archives/pixman/2010-December/000773.html
> http://lists.freedesktop.org/archives/pixman/2010-September/000496.html
>
> - In Siarhei's testing, the new over_n_8_8888() on MIPS32r2 was slower
> than the C fast path. From
> http://lists.freedesktop.org/archives/pixman/2010-December/000773.html :
>
> "One of the reasons for such a slowdown in gnome-system-monitor test is
> that it uses 'over_n_8_8888' operation with the mask where 96.5% of
> values are zero. And your MIPS32R2 optimized code does not handle
> these special cases, always taking the slowest path [1]."
>
> Ie., the way to make over_n_8_8888() fast is to skip compositing
> whenever the mask is 0x00 or 0xff.
I'll try to add some more information here. A short summary review of these
proposed MIPS32r2 optimizations is the following:
1. Fill operation is just an unrolled loop which is only 1 instruction smaller
than the code generated by gcc if the same level of loop unrolling is done in C
code. If the loop unrolling is done in C code (which would be beneficial for
all primitive embedded processors), then assembly code is going to be only
marginally faster when working with the data in L1 cache. In more realistic
scenarios, there will be no difference at all because memory is slow.
2. The 'over_n_8_8888' MIPS32r2 fast path is practically equivalent to C code,
with the branches responsible for handling special cases removed. It might
show better results in a synthetic benchmark like 'lowlevel-blt' exactly
because of the removed branches and because this benchmark tests translucent
case only. But in reality it may be (and is) a loss. Even considering the
translucent case alone for this operation, there is one optimization possible
which brings much better performance improvement (yes, I was inspired exactly
by the MIPS32r2 code from Georgi Beloev when tried to propose this patch):
http://lists.freedesktop.org/archives/pixman/2010-September/000494.html
But if somebody really cares about pixman performance on MIPS32r2 and wants to
do something really impressive, then the use of prefetch should be considered.
I have already explained it in
http://lists.freedesktop.org/archives/pixman/2010-November/000749.html
I even attached a simple benchmark program which can demonstrate the
effect of using prefetch. Running it on MIPS 24Kc provides the
following results:
# gcc -O2 -march=mips32r2 testmemspeed.c
# time ./a.out
real 0m3.355s
user 0m3.330s
sys 0m0.020s
# gcc -O2 -march=mips32r2 -DTEST_PREFETCH testmemspeed.c
# time ./a.out
real 0m1.178s
user 0m1.150s
sys 0m0.020s
# gcc -O2 -march=mips32r2 -DTEST_COPY testmemspeed.c
# time ./a.out
real 0m5.425s
user 0m5.390s
sys 0m0.030s
# gcc -O2 -march=mips32r2 -DTEST_COPY -DTEST_PREFETCH testmemspeed.c
# time ./a.out
real 0m2.744s
user 0m2.710s
sys 0m0.030s
It confirms 3x speed boost for memset-alike code and 2x speed boost for
memcpy-alike code.
To sum it up. The way they are, MIPS32r2 assembly optimizations in the current
state are better not to be added to pixman. And Veli-Matti apparently also
came to exactly the same conclusion, so there is no disagreement here.
> The same is likely also worthwhile even in the SIMD versions since
> memory access is so expensive.
Agreed, especially considering that the SIMD provided by MIPS DSP ASE is not
particularly wide. But I would not make it a strict requirement, even being
faster than C code on some practical use cases should be good enough to get
this MIPS port of pixman started.
> And finally, while the lowlevel-blt benchmarks are convenient to use,
> they are also synthetic, it is also important to test the performance
> with real-world workloads such as those found in the cairo perf traces.
I think that 'lowlevel-blt-bench' can be just extended to do 3 benchmarks for
each function: 'transparent', 'translucent', 'opaque'. This will cover many
of the possible use cases. Of course it may be impossible to make all of these
cases fast at the same time. That's why the final decision should be indeed
done based on benchmarking real-word workloads which are approximated by
cairo perf traces.
--
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testmemspeed.c
Type: text/x-csrc
Size: 2574 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110224/2ea99a59/attachment-0001.c>
More information about the Pixman
mailing list