[Pixman] [PATCH 0/5] ARMv6: New fast path implementations that utilise prefetch
siarhei.siamashka at gmail.com
Thu Dec 27 04:05:32 PST 2012
On Fri, 21 Dec 2012 18:47:31 -0000
"Ben Avison" <bavison at riscosopen.org> wrote:
> Hello pixman mailing list. This is my first post, so I hope I'm
> following the list etiquette OK.
> I have been working on improving pixman's performance on ARMv6/ARM11.
> Specifically, I'm targeting the Raspberry Pi, which uses a BCM2835
> SoC, from the BCM2708 family. This uses an ARM1176JZF-S core, running
> at 700 MHz.
Hello. Welcome to the mailing list. It's great that you are working on
better ARMv6 support. Because that's the area which definitely needs
> General features of the ARM11J76ZF-S are a 4-way set-associative L1
> data cache with cache line length of 8 words (128 bits) and a
> configurable size between 4KB and 64KB. The BCM2835 uses a L1 data
> cache size of 16KB, but also adds a Broadcom proprietary L2 cache of
> 128KB with cache lines of 16 words (256 bits) with flags to allow a
> cache line to be half valid.
> Empirical tests show that despite this, the write buffer operates at
> peak efficiency for 4-word aligned writes of 4 words. Even cacheline-
> aligned writes of 8 words are slower by over 30%; clearly there's some
> sort of shortcut happening in the 4-word case, because doing complete
> cache line fills take several times longer per byte than 4-word writes
> do. The Raspberry Pi bootloader has an option to disable write-
> allocate for the L2 cache (disable_l2cache_witealloc [sic]), and I
> didn't find it had any noticeable effect on 4-word write speeds,
> giving further credibility to this. (In fact, if anything, timings
> were slightly worse when write-allocate was disabled due to an
> increased fraction of test runs falling into a secondary cluster with
> a longer test runtime.)
Are you sure you have really set this "disable_l2cache_writealloc=1"
option? Because it was initially introduced with a spelling error:
and corrected a bit later:
Any writes that are smaller than 16 bytes suffer from a severe
performance penalty without setting this option. The Raspberry Pi
people must have some really good reason for not disabling L2 cache
write-allocate by default. Because we can't expect the C compilers
to combine writes, this affects most of the compiled code.
I have a benchmark program which tries different methods of
reading/copying/filling memory and gets the performance numbers.
It can discover various oddities in the memory subsystem. And I
have just created a wiki page with the results for some hardware:
The results for Raspberry Pi show that "disable_l2cache_writealloc=1"
option makes affects the results, especially for C code.
The code for the test program itself is here:
> I also saw no measurable difference between timings for the VFP
> register file compared to the main ARM register file: again, the
> optimum size was 4 32-bit registers (or 2 64-bit registers). Although
> the use of the VFP would ease register pressure on the ARM register
> file, in every case where we're actually short of registers, we
> actually want to do some integer manipulations of the pixel data so
> it's not of any benefit to use the VFP. It would also limit the
> usefulness of this implementation to ARM11s that have VFP fitted, so I
> have not pursued this avenue further.
> Additional testing of prefetching has identified marked differences in
> timings for different address patterns. In particular, there is a 50%
> speed penalty if the address is not in the first 2 words of each 8
> words: this has been tracked down to a fault in critical-word-first
> handling in the BCM2835 L2 cache. An even more extreme effect was
> observed if consecutive prefetches referenced the same address - this
> doubled the runtime (although I don't know if this is BCM2835 specific
> or not). Consequently, I have devised a prefetch scheme that is
> careful to prefetch only the addresses of the start of each cache
> line, and to only do so once per cache line.
> I am aware that some may question the targeting of BCM2835 specific
> cache behaviours in what is supposed to be a generic ARM11
There is nothing particularly wrong optimizing for the specific
implementation. Especially considering that it is probably the only
relevant ARM11 based device in reasonably wide use at the moment.
> However, the cache line size is fixed at 8 words
> across ARM1136, ARM1156 and ARM1176, so this approach will not lead to
> any cache lines being omitted from prefetch on any ARM11, and the
> overhead of branching over an unwanted PLD instruction which would
> actually have completed in a trivial amount of time on an ARM11
> without the BCM2835's bugs should be minimal, so I think it's valid to
> propose this patch for all ARMv6 chips.
Another potential target is Tegra2, which uses ARM Cortex-A9 without
NEON. Cortex-A9 also uses 32 bytes cache line size, so the cache
line size assumptions should not cause any problems.
> My new ARMv6 fast paths are assembled using a hierarchy of assembly
> macros, in a method inspired by Siarhei's ARM NEON fast paths -
> although obviously the details are somewhat different. The majority of
> my time so far has been spent on optimising the memory reads and
> writes, since these dominate all but the more complex pixel processing
> steps. So far, I've only converted a handful of the most common
> operations into macro form so far: in the most part these correspond
> to blits and fills, plus the routines which had previously been
> included in pixman-arm-simd-asm.S as disasembled versions of C
> functions using inline assembler. However, I'm pleased to report that
> even in the L1 test where memory overheads are not an issue, these
> operations are seeing some improvements from processing more than one
> pixel at once, and by the use of the SEL instruction.
> One minor change in functionality that I should note is that
> previously the top level function pixman_blt() was a no-op on ARMv6,
> because neither the armv6 nor the generic C fast path sources filled
> in the "blt" field in their pixman_implementation_t structure. I've
> fixed this, for what it's worth (though I'm assuming that not much
> software can have been using it if it escaped notice before).
The function pixman_blt() is used by the X server. In the case if
pixman returns FALSE, it fallbacks to generic C loops or memcpy/memmove.
So there is nothing really wrong with it, just the performance is not
great (mostly for scrolling and moving windows). Improvements here are
> I'm distributing the code as it is now to give people a chance to play
> with it over the Christmas break.
Thanks. There is one problem though. Looks like the patches got
corrupted by the mailer. It's possible to see the idea, but they
can't applied and tested. Please use "git send-email" next time.
> Some of the things I intend to tackle in the new year are:
> * no thought has been put into 24bpp formats yet
You can just provide optimized fetch/store iterators for it. I don't
think anybody seriously uses this format. In the case of NEON code, it
was implemented just because we could (and because NEON has some
convenient instructions for it), but not because it was strictly
needed for anything.
> * there is no support for scaled plots, either nearest-neighbour or
> bilinear-interpolation as yet; in fact the two scaled blit routines
> are the sole survivors of the old pixman-arm-simd-asm.S at present
The nearest scaled blit routines can benefit from the 16-byte store as
the comments imply. The processors newer than ARM11 have a properly
working write-combining buffer and can handle small writes just fine.
The bilinear code needs horizontal and vertical interpolation split
into separate loops for good performance, especially on non-SIMD
> * the number of fast path operations is small compared to other
> implementations; I'm targeting eventual parity with the number of
> operations in the NEON case
I'm not sure if it's a really good idea. Nowadays pixman got better
fetch/write back iterators, which avoid doing obviously redundant
operations and can, for example, perform compositing directly in the
destination buffer. They also do not try to fetch and convert data
from the destination buffer to a8r8g8b8 format in the temporary buffer
in the case if this data is not needed for the compositing operation.
Søren did a good job on improving this code. Of course there are still
some cases which need special fast paths, like src_0565_0565 (so that
we don't do the conversion for the source from r5g6b5 to the a8r8g8b8
format and then back to r5g6b5). But the need for many fast paths has
been greatly reduced.
I would suggest to first implement OVER and ADD combiners and
fetch/store iterators for r5g6b5 and 24-bit format with ARMv6 assembly.
And then use this code as the baseline for benchmarking. If you can
significantly improve performance by adding extra fast paths, then
go for it.
> To give you some idea of the improvements represented by this patch,
> please see the numbers below. These represent samples of 100 runs of
> lowlevel-blt-bench, and are comparing the head revision from git
> against the same with these patches applied. It seems that
> lowlevel-blt-bench is not very good at measuring the fastest
> operations, as a large proportional random error creeps in - I'm
> guessing it's to do with the way it tries to cancel out the function
> call overhead.
> All the results except those marked with (*) pass a statistical
> significance test (Student's independent two-sample t-test). I hope
> you'll agree that these are good results; the only fly in the ointment
> is the L1 test results for the three blit routines. In these cases,
> I'm competing against C fast path implementations that use memcpy() to
> do the blit, where memcpy() is already somewhat hand-tuned (although
> the tuning it does is obviously much less suited to memory-bound
> operations than the one I present here).
One problem with lowlevel-blt-bench is that it can't be used for
benchmarking OVER compositing operation (because some shortcuts are
possible, depending on the pixel data in the case if it is completely
opaque or transparent). But the format conversions and scaling
operations can be used with lowlevel-blt-bench just fine.
But a much more suitable tool is cairo-perf-trace, because it is based
on replays of the traces of real applications and provides reasonably
Yes, the current traces from http://cgit.freedesktop.org/cairo-traces
can't be effectively used on really slow embedded systems such as ARM11
and MIPS32. It could take a day just to run the benchmark. That's why I
tried to take these traces and trim them to make the execution time
reasonable on Raspberry Pi and on the other slow systems:
It's quite possible that the relevant parts of the traces got thrown
away, but IMHO it's better to use such traces than nothing at all.
And I also have added some sample scripts for automatic fetching and
compiling cairo and pixman code.
And as an example, running the cairo-perf-trace benchmark on
Raspberry Pi with PIXMAN_DISABLE="arm-simd arm-neon" in the default
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.29.1
[ 0] image t-firefox-planet-gnome 11.934 11.985 0.50% 15/15
[ 1] image t-chromium-tabs 4.646 4.668 0.31% 14/15
[ 2] image t-gvim 17.383 17.801 1.08% 13/15
[ 3] image t-gnome-system-monitor 35.758 35.901 0.15% 12/15
[ 4] image t-firefox-canvas-alpha 22.908 23.204 1.19% 15/15
[ 5] image t-firefox-scrolling 30.298 30.389 0.41% 14/15
[ 6] image t-firefox-chalkboard 39.178 39.208 0.41% 15/15
[ 7] image t-swfdec-youtube 10.412 10.499 0.56% 13/15
[ 8] image t-firefox-particles 25.069 25.203 0.49% 15/15
[ 9] image t-firefox-fishbowl 25.389 25.433 0.65% 14/15
[ 10] image t-firefox-fishtank 20.297 20.365 0.47% 14/15
[ 11] image t-xfce4-terminal-a1 33.657 33.815 0.52% 15/15
[ 12] image t-poppler-reseau 22.259 22.392 0.85% 15/15
[ 13] image t-grads-heat-map 3.973 3.988 0.20% 12/15
[ 14] image t-firefox-talos-gfx 25.519 25.952 0.81% 15/15
[ 15] image t-firefox-asteroids 16.815 16.848 0.43% 14/15
[ 16] image t-midori-zoomed 10.443 10.506 0.83% 14/15
[ 17] image t-swfdec-giant-steps 31.693 31.811 0.51% 15/15
[ 18] image t-firefox-talos-svg 19.272 19.390 0.49% 14/15
[ 19] image t-poppler 13.309 13.381 0.38% 14/15
[ 20] image t-firefox-canvas 18.571 18.758 0.64% 14/15
[ 21] image t-evolution 19.365 19.551 6.97% 15/15
[ 22] image t-firefox-paintball 24.763 25.010 0.99% 15/15
[ 23] image t-gnome-terminal-vim 20.473 20.588 0.45% 13/15
Now all the same, but with "disable_l2cache_writealloc=1" option set:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.29.1
[ 0] image t-firefox-planet-gnome 11.413 11.467 0.21% 14/15
[ 1] image t-chromium-tabs 4.446 4.462 0.20% 14/15
[ 2] image t-gvim 17.101 17.658 1.40% 13/15
[ 3] image t-gnome-system-monitor 28.664 29.051 0.80% 15/15
[ 4] image t-firefox-canvas-alpha 21.863 22.358 1.21% 15/15
[ 5] image t-firefox-scrolling 28.029 28.102 0.56% 14/15
[ 6] image t-firefox-chalkboard 38.478 38.517 0.34% 15/15
[ 7] image t-swfdec-youtube 10.172 10.363 1.72% 14/15
[ 8] image t-firefox-particles 24.491 24.551 0.54% 15/15
[ 9] image t-firefox-fishbowl 24.297 24.337 0.72% 14/15
[ 10] image t-firefox-fishtank 19.875 19.897 0.05% 13/15
[ 11] image t-xfce4-terminal-a1 25.915 26.233 1.02% 15/15
[ 12] image t-poppler-reseau 21.760 21.993 0.59% 14/15
[ 13] image t-grads-heat-map 3.998 4.025 0.39% 14/15
[ 14] image t-firefox-talos-gfx 24.760 25.272 1.10% 15/15
[ 15] image t-firefox-asteroids 12.846 12.952 0.94% 14/15
[ 16] image t-midori-zoomed 8.656 8.804 0.78% 14/15
[ 17] image t-swfdec-giant-steps 18.583 18.617 0.43% 14/15
[ 18] image t-firefox-talos-svg 18.819 18.888 0.17% 13/15
[ 19] image t-poppler 13.042 13.140 0.71% 14/15
[ 20] image t-firefox-canvas 18.050 18.093 0.23% 13/15
[ 21] image t-evolution 11.928 13.521 8.69% 14/15
[ 22] image t-firefox-paintball 22.239 22.539 0.70% 13/15
[ 23] image t-gnome-terminal-vim 17.469 17.833 1.22% 15/15
image t-swfdec-giant-steps (31810.97 0.51%) -> (18617.39 0.43%): 1.71x speedup
image t-evolution (19550.96 6.97%) -> (13521.37 8.69%): 1.62x speedup
image t-firefox-asteroids (16848.29 0.43%) -> (12951.88 0.94%): 1.31x speedup
image t-xfce4-terminal-a1 (33814.57 0.52%) -> (26232.78 1.02%): 1.30x speedup
image t-gnome-system-monitor (35901.36 0.15%) -> (29050.68 0.80%): 1.25x speedup
image t-midori-zoomed (10505.92 0.83%) -> (8804.26 0.78%): 1.21x speedup
image t-gnome-terminal-vim (20588.36 0.45%) -> (17832.86 1.22%): 1.17x speedup
image t-firefox-paintball (25010.15 0.99%) -> (22538.66 0.70%): 1.11x speedup
image t-firefox-scrolling (30389.01 0.41%) -> (28101.54 0.56%): 1.08x speedup
This is just an additional confirmation that the L2 cache
seems to be misconfigured by default in Raspberry Pi and
"disable_l2cache_writealloc=1" option makes a big difference.
More information about the Pixman