[Pixman] [PATCH 0/5] ARMv6: New fast path implementations that utilise prefetch

Thu Dec 27 04:05:32 PST 2012

On Fri, 21 Dec 2012 18:47:31 -0000
"Ben Avison" <bavison at riscosopen.org> wrote:

> Hello pixman mailing list. This is my first post, so I hope I'm
> following the list etiquette OK.
> 
> I have been working on improving pixman's performance on ARMv6/ARM11.
> Specifically, I'm targeting the Raspberry Pi, which uses a BCM2835
> SoC, from the BCM2708 family. This uses an ARM1176JZF-S core, running
> at 700 MHz.

Hello. Welcome to the mailing list. It's great that you are working on
better ARMv6 support. Because that's the area which definitely needs
improvements.

> General features of the ARM11J76ZF-S are a 4-way set-associative L1
> data cache with cache line length of 8 words (128 bits) and a
> configurable size between 4KB and 64KB. The BCM2835 uses a L1 data
> cache size of 16KB, but also adds a Broadcom proprietary L2 cache of
> 128KB with cache lines of 16 words (256 bits) with flags to allow a
> cache line to be half valid.
> 
> Empirical tests show that despite this, the write buffer operates at
> peak efficiency for 4-word aligned writes of 4 words. Even cacheline-
> aligned writes of 8 words are slower by over 30%; clearly there's some
> sort of shortcut happening in the 4-word case, because doing complete
> cache line fills take several times longer per byte than 4-word writes
> do. The Raspberry Pi bootloader has an option to disable write-
> allocate for the L2 cache (disable_l2cache_witealloc [sic]), and I
> didn't find it had any noticeable effect on 4-word write speeds,
> giving further credibility to this. (In fact, if anything, timings
> were slightly worse when write-allocate was disabled due to an
> increased fraction of test runs falling into a secondary cluster with
> a longer test runtime.)

Are you sure you have really set this "disable_l2cache_writealloc=1"
option? Because it was initially introduced with a spelling error:
    https://github.com/raspberrypi/firmware/commit/43b688a2e0a3afd6
and corrected a bit later:
    https://github.com/raspberrypi/firmware/commit/be61355f9e7e28a5

Any writes that are smaller than 16 bytes suffer from a severe
performance penalty without setting this option. The Raspberry Pi
people must have some really good reason for not disabling L2 cache
write-allocate by default. Because we can't expect the C compilers
to combine writes, this affects most of the compiled code.

I have a benchmark program which tries different methods of
reading/copying/filling memory and gets the performance numbers.
It can discover various oddities in the memory subsystem. And I
have just created a wiki page with the results for some hardware:
    https://github.com/ssvb/tinymembench/wiki

The results for Raspberry Pi show that "disable_l2cache_writealloc=1"
option makes affects the results, especially for C code.

The code for the test program itself is here:
    https://github.com/ssvb/tinymembench

> I also saw no measurable difference between timings for the VFP
> register file compared to the main ARM register file: again, the
> optimum size was 4 32-bit registers (or 2 64-bit registers). Although
> the use of the VFP would ease register pressure on the ARM register
> file, in every case where we're actually short of registers, we
> actually want to do some integer manipulations of the pixel data so
> it's not of any benefit to use the VFP. It would also limit the
> usefulness of this implementation to ARM11s that have VFP fitted, so I
> have not pursued this avenue further.
> 
> Additional testing of prefetching has identified marked differences in
> timings for different address patterns. In particular, there is a 50%
> speed penalty if the address is not in the first 2 words of each 8
> words: this has been tracked down to a fault in critical-word-first
> handling in the BCM2835 L2 cache. An even more extreme effect was
> observed if consecutive prefetches referenced the same address - this
> doubled the runtime (although I don't know if this is BCM2835 specific
> or not). Consequently, I have devised a prefetch scheme that is
> careful to prefetch only the addresses of the start of each cache
> line, and to only do so once per cache line.
> 
> I am aware that some may question the targeting of BCM2835 specific
> cache behaviours in what is supposed to be a generic ARM11
> implementation.

There is nothing particularly wrong optimizing for the specific
implementation. Especially considering that it is probably the only
relevant ARM11 based device in reasonably wide use at the moment.

> However, the cache line size is fixed at 8 words
> across ARM1136, ARM1156 and ARM1176, so this approach will not lead to
> any cache lines being omitted from prefetch on any ARM11, and the
> overhead of branching over an unwanted PLD instruction which would
> actually have completed in a trivial amount of time on an ARM11
> without the BCM2835's bugs should be minimal, so I think it's valid to
> propose this patch for all ARMv6 chips.

Another potential target is Tegra2, which uses ARM Cortex-A9 without
NEON. Cortex-A9 also uses 32 bytes cache line size, so the cache
line size assumptions should not cause any problems.

> My new ARMv6 fast paths are assembled using a hierarchy of assembly
> macros, in a method inspired by Siarhei's ARM NEON fast paths -
> although obviously the details are somewhat different. The majority of
> my time so far has been spent on optimising the memory reads and
> writes, since these dominate all but the more complex pixel processing
> steps. So far, I've only converted a handful of the most common
> operations into macro form so far: in the most part these correspond
> to blits and fills, plus the routines which had previously been
> included in pixman-arm-simd-asm.S as disasembled versions of C
> functions using inline assembler. However, I'm pleased to report that
> even in the L1 test where memory overheads are not an issue, these
> operations are seeing some improvements from processing more than one
> pixel at once, and by the use of the SEL instruction.
>
> One minor change in functionality that I should note is that
> previously the top level function pixman_blt() was a no-op on ARMv6,
> because neither the armv6 nor the generic C fast path sources filled
> in the "blt" field in their pixman_implementation_t structure. I've
> fixed this, for what it's worth (though I'm assuming that not much
> software can have been using it if it escaped notice before).

The function pixman_blt() is used by the X server. In the case if
pixman returns FALSE, it fallbacks to generic C loops or memcpy/memmove.
So there is nothing really wrong with it, just the performance is not
great (mostly for scrolling and moving windows). Improvements here are
definitely welcome.

> I'm distributing the code as it is now to give people a chance to play
> with it over the Christmas break.

Thanks. There is one problem though. Looks like the patches got
corrupted by the mailer. It's possible to see the idea, but they
can't applied and tested. Please use "git send-email" next time.

> Some of the things I intend to tackle in the new year are:
> * no thought has been put into 24bpp formats yet

You can just provide optimized fetch/store iterators for it. I don't
think anybody seriously uses this format. In the case of NEON code, it
was implemented just because we could (and because NEON has some
convenient instructions for it), but not because it was strictly
needed for anything.

> * there is no support for scaled plots, either nearest-neighbour or
>    bilinear-interpolation as yet; in fact the two scaled blit routines
>    are the sole survivors of the old pixman-arm-simd-asm.S at present

The nearest scaled blit routines can benefit from the 16-byte store as
the comments imply. The processors newer than ARM11 have a properly
working write-combining buffer and can handle small writes just fine.

The bilinear code needs horizontal and vertical interpolation split
into separate loops for good performance, especially on non-SIMD
platforms:
    http://lists.freedesktop.org/archives/pixman/2012-June/002140.html

> * the number of fast path operations is small compared to other
>    implementations; I'm targeting eventual parity with the number of
>    operations in the NEON case

I'm not sure if it's a really good idea. Nowadays pixman got better
fetch/write back iterators, which avoid doing obviously redundant
operations and can, for example, perform compositing directly in the
destination buffer. They also do not try to fetch and convert data
from the destination buffer to a8r8g8b8 format in the temporary buffer
in the case if this data is not needed for the compositing operation.
Søren did a good job on improving this code. Of course there are still
some cases which need special fast paths, like src_0565_0565 (so that
we don't do the conversion for the source from r5g6b5 to the a8r8g8b8
format and then back to r5g6b5). But the need for many fast paths has
been greatly reduced.

I would suggest to first implement OVER and ADD combiners and
fetch/store iterators for r5g6b5 and 24-bit format with ARMv6 assembly.
And then use this code as the baseline for benchmarking. If you can
significantly improve performance by adding extra fast paths, then
go for it.

> To give you some idea of the improvements represented by this patch,
> please see the numbers below. These represent samples of 100 runs of
> lowlevel-blt-bench, and are comparing the head revision from git
> against the same with these patches applied. It seems that
> lowlevel-blt-bench is not very good at measuring the fastest
> operations, as a large proportional random error creeps in - I'm
> guessing it's to do with the way it tries to cancel out the function
> call overhead.
> 
> All the results except those marked with (*) pass a statistical
> significance test (Student's independent two-sample t-test). I hope
> you'll agree that these are good results; the only fly in the ointment
> is the L1 test results for the three blit routines. In these cases,
> I'm competing against C fast path implementations that use memcpy() to
> do the blit, where memcpy() is already somewhat hand-tuned (although
> the tuning it does is obviously much less suited to memory-bound
> operations than the one I present here).

One problem with lowlevel-blt-bench is that it can't be used for
benchmarking OVER compositing operation (because some shortcuts are
possible, depending on the pixel data in the case if it is completely
opaque or transparent). But the format conversions and scaling
operations can be used with lowlevel-blt-bench just fine.

But a much more suitable tool is cairo-perf-trace, because it is based
on replays of the traces of real applications and provides reasonably
realistic results:
    http://cworth.org/intel/performance_measurement/

Yes, the current traces from http://cgit.freedesktop.org/cairo-traces
can't be effectively used on really slow embedded systems such as ARM11
and MIPS32. It could take a day just to run the benchmark. That's why I
tried to take these traces and trim them to make the execution time
reasonable on Raspberry Pi and on the other slow systems:
    http://cgit.freedesktop.org/~siamashka/trimmed-cairo-traces/

It's quite possible that the relevant parts of the traces got thrown
away, but IMHO it's better to use such traces than nothing at all.
And I also have added some sample scripts for automatic fetching and
compiling cairo and pixman code.

And as an example, running the cairo-perf-trace benchmark on
Raspberry Pi with PIXMAN_DISABLE="arm-simd arm-neon" in the default
configuration:

[ # ]  backend                         test   min(s) median(s) stddev. count
[ # ]    image: pixman 0.29.1
[  0]    image       t-firefox-planet-gnome   11.934   11.985   0.50%   15/15
[  1]    image              t-chromium-tabs    4.646    4.668   0.31%   14/15
[  2]    image                       t-gvim   17.383   17.801   1.08%   13/15
[  3]    image       t-gnome-system-monitor   35.758   35.901   0.15%   12/15
[  4]    image       t-firefox-canvas-alpha   22.908   23.204   1.19%   15/15
[  5]    image          t-firefox-scrolling   30.298   30.389   0.41%   14/15
[  6]    image         t-firefox-chalkboard   39.178   39.208   0.41%   15/15
[  7]    image             t-swfdec-youtube   10.412   10.499   0.56%   13/15
[  8]    image          t-firefox-particles   25.069   25.203   0.49%   15/15
[  9]    image           t-firefox-fishbowl   25.389   25.433   0.65%   14/15
[ 10]    image           t-firefox-fishtank   20.297   20.365   0.47%   14/15
[ 11]    image          t-xfce4-terminal-a1   33.657   33.815   0.52%   15/15
[ 12]    image             t-poppler-reseau   22.259   22.392   0.85%   15/15
[ 13]    image             t-grads-heat-map    3.973    3.988   0.20%   12/15
[ 14]    image          t-firefox-talos-gfx   25.519   25.952   0.81%   15/15
[ 15]    image          t-firefox-asteroids   16.815   16.848   0.43%   14/15
[ 16]    image              t-midori-zoomed   10.443   10.506   0.83%   14/15
[ 17]    image         t-swfdec-giant-steps   31.693   31.811   0.51%   15/15
[ 18]    image          t-firefox-talos-svg   19.272   19.390   0.49%   14/15
[ 19]    image                    t-poppler   13.309   13.381   0.38%   14/15
[ 20]    image             t-firefox-canvas   18.571   18.758   0.64%   14/15
[ 21]    image                  t-evolution   19.365   19.551   6.97%   15/15
[ 22]    image          t-firefox-paintball   24.763   25.010   0.99%   15/15
[ 23]    image         t-gnome-terminal-vim   20.473   20.588   0.45%   13/15

Now all the same, but with "disable_l2cache_writealloc=1" option set:

[ # ]  backend                         test   min(s) median(s) stddev. count
[ # ]    image: pixman 0.29.1
[  0]    image       t-firefox-planet-gnome   11.413   11.467   0.21%   14/15
[  1]    image              t-chromium-tabs    4.446    4.462   0.20%   14/15
[  2]    image                       t-gvim   17.101   17.658   1.40%   13/15
[  3]    image       t-gnome-system-monitor   28.664   29.051   0.80%   15/15
[  4]    image       t-firefox-canvas-alpha   21.863   22.358   1.21%   15/15
[  5]    image          t-firefox-scrolling   28.029   28.102   0.56%   14/15
[  6]    image         t-firefox-chalkboard   38.478   38.517   0.34%   15/15
[  7]    image             t-swfdec-youtube   10.172   10.363   1.72%   14/15
[  8]    image          t-firefox-particles   24.491   24.551   0.54%   15/15
[  9]    image           t-firefox-fishbowl   24.297   24.337   0.72%   14/15
[ 10]    image           t-firefox-fishtank   19.875   19.897   0.05%   13/15
[ 11]    image          t-xfce4-terminal-a1   25.915   26.233   1.02%   15/15
[ 12]    image             t-poppler-reseau   21.760   21.993   0.59%   14/15
[ 13]    image             t-grads-heat-map    3.998    4.025   0.39%   14/15
[ 14]    image          t-firefox-talos-gfx   24.760   25.272   1.10%   15/15
[ 15]    image          t-firefox-asteroids   12.846   12.952   0.94%   14/15
[ 16]    image              t-midori-zoomed    8.656    8.804   0.78%   14/15
[ 17]    image         t-swfdec-giant-steps   18.583   18.617   0.43%   14/15
[ 18]    image          t-firefox-talos-svg   18.819   18.888   0.17%   13/15
[ 19]    image                    t-poppler   13.042   13.140   0.71%   14/15
[ 20]    image             t-firefox-canvas   18.050   18.093   0.23%   13/15
[ 21]    image                  t-evolution   11.928   13.521   8.69%   14/15
[ 22]    image          t-firefox-paintball   22.239   22.539   0.70%   13/15
[ 23]    image         t-gnome-terminal-vim   17.469   17.833   1.22%   15/15

Speedups
========
image   t-swfdec-giant-steps (31810.97 0.51%) -> (18617.39 0.43%): 1.71x speedup
image            t-evolution (19550.96 6.97%) -> (13521.37 8.69%): 1.62x speedup
image    t-firefox-asteroids (16848.29 0.43%) -> (12951.88 0.94%): 1.31x speedup
image    t-xfce4-terminal-a1 (33814.57 0.52%) -> (26232.78 1.02%): 1.30x speedup
image t-gnome-system-monitor (35901.36 0.15%) -> (29050.68 0.80%): 1.25x speedup
image        t-midori-zoomed (10505.92 0.83%) ->  (8804.26 0.78%): 1.21x speedup
image   t-gnome-terminal-vim (20588.36 0.45%) -> (17832.86 1.22%): 1.17x speedup
image    t-firefox-paintball (25010.15 0.99%) -> (22538.66 0.70%): 1.11x speedup
image    t-firefox-scrolling (30389.01 0.41%) -> (28101.54 0.56%): 1.08x speedup

This is just an additional confirmation that the L2 cache
seems to be misconfigured by default in Raspberry Pi and
"disable_l2cache_writealloc=1" option makes a big difference.

-- 
Best regards,
Siarhei Siamashka