[Pixman] [PATCH 2/2] ARMv6: Add fast path for in_reverse_8888_8888

Fri Apr 4 01:51:52 PDT 2014

On Mon, 31 Mar 2014 15:54:24 +0300
Pekka Paalanen <ppaalanen at gmail.com> wrote:

> From: Ben Avison <bavison at riscosopen.org>
> 
> Benchmark results, "before" is the patch
> - ARMv6: Add fast path for over_n_8888_8888_ca
> and "after" contains the additional patches:
> - ARMv6: Add fast path flag to force no preload of destination buffer
> - ARMv6: Add fast path for in_reverse_8888_8888 (this patch)
> 
> lowlevel-blt-bench, in_reverse_8888_8888, 100 iterations:
> 
>        Before          After
>       Mean StdDev     Mean StdDev   Confidence   Change
> L1    21.1    0.1     32.0    0.1    100.00%     +51.9%
> L2    11.7    0.3     18.4    0.5    100.00%     +56.9%
> M     10.5    0.0     16.3    0.0    100.00%     +54.8%
> HT     8.2    0.0     12.0    0.0    100.00%     +46.7%
> VT     8.1    0.0     11.8    0.0    100.00%     +45.4%
> R      8.0    0.0     11.2    0.0    100.00%     +40.0%
> RT     4.7    0.0      6.0    0.1    100.00%     +28.1%
> 
> At most 14 outliers rejected per case per set.
> 
> cairo-perf-trace with trimmed traces, 30 iterations:
> 
>                                     Before          After
>                                    Mean StdDev     Mean StdDev   Confidence   Change
> t-firefox-paintball.trace          17.9    0.0     14.0    0.0    100.00%     +27.8%
> t-firefox-chalkboard.trace         36.6    0.0     35.8    0.0    100.00%      +2.1%
> t-firefox-canvas-alpha.trace       20.7    0.3     20.3    0.3    100.00%      +1.7%
> t-firefox-particles.trace          27.5    0.1     27.1    0.1    100.00%      +1.3%
> t-chromium-tabs.trace               4.9    0.0      4.8    0.0    100.00%      +1.1%
> t-evolution.trace                  13.0    0.1     12.9    0.1    100.00%      +1.0%
> t-swfdec-youtube.trace              7.8    0.0      7.7    0.0    100.00%      +0.8%
> t-gvim.trace                       33.0    0.2     32.8    0.2    100.00%      +0.7%
> t-gnome-terminal-vim.trace         19.8    0.2     19.7    0.2     99.46%      +0.6%
> t-grads-heat-map.trace              4.4    0.0      4.4    0.0     99.32%      +0.6%
> t-firefox-fishbowl.trace           21.1    0.0     21.0    0.0    100.00%      +0.5%
> t-firefox-planet-gnome.trace       10.9    0.0     10.8    0.0    100.00%      +0.4%
> t-firefox-canvas-swscroll.trace    32.1    0.1     32.0    0.1    100.00%      +0.4%
> t-firefox-fishtank.trace           13.2    0.0     13.1    0.0    100.00%      +0.4%
> t-firefox-asteroids.trace          11.1    0.0     11.0    0.0    100.00%      +0.4%
> t-firefox-canvas.trace             17.9    0.0     17.9    0.0     99.99%      +0.3%
> t-poppler.trace                     9.7    0.1      9.7    0.1     79.51%      +0.2%  (insignificant)
> t-firefox-talos-svg.trace          20.4    0.0     20.4    0.0     97.25%      +0.1%  (insignificant)
> t-swfdec-giant-steps.trace         14.8    0.0     14.8    0.0     96.75%      +0.1%  (insignificant)
> t-firefox-scrolling.trace          24.6    0.1     24.6    0.1     31.24%      +0.1%  (insignificant)
> t-midori-zoomed.trace               8.0    0.0      8.0    0.0     50.76%      +0.0%  (insignificant)
> t-gnome-system-monitor.trace       17.1    0.0     17.1    0.0      4.49%      -0.0%  (insignificant)
> t-xfce4-terminal-a1.trace           4.8    0.0      4.8    0.0     98.08%      -0.2%  (insignificant)
> t-poppler-reseau.trace             22.1    0.1     22.2    0.1     93.89%      -0.3%  (insignificant)
> t-firefox-talos-gfx.trace          25.4    0.4     25.5    0.5     75.53%      -0.5%  (insignificant)
> 
> At most 4 outliers rejected per case per set.
> 
> Cairo perf reports the running time, but the change is computed for
> operations per second instead (inverse of running time).
> 
> Confidence is based on Welch's t-test. Absolute changes less than 1%
> can be accounted as measurement errors, even if statistically
> significant.
> 
> There was a question of why FLAG_NO_PRELOAD_DST exists. If a patch
> removing that flag from pixman-arm-simd-asm.S is added on top, the
> change will be the following.
> 
> Before: flag in use
> After: flag removed
> 
>        Before          After
>       Mean StdDev     Mean StdDev   Confidence   Change
> L1    32.0    0.1     31.8    0.1    100.00%      -0.6%
> L2    18.4    0.5     25.0    0.5    100.00%     +36.0%
> M     16.3    0.0     25.7    0.0    100.00%     +57.9%
> HT    12.0    0.0     13.9    0.0    100.00%     +16.4%
> VT    11.8    0.0     13.2    0.0    100.00%     +12.4%
> R     11.2    0.0     14.0    0.0    100.00%     +24.3%
> RT     6.0    0.1      7.0    0.1    100.00%     +15.1%
> 
>                                     Before          After
>                                    Mean StdDev     Mean StdDev   Confidence   Change
> t-chromium-tabs.trace               4.8    0.0      4.8    0.0    100.00%      +0.7%
> t-poppler-reseau.trace             22.2    0.1     22.1    0.1     99.98%      +0.6%
> t-poppler.trace                     9.7    0.1      9.6    0.1     99.70%      +0.5%
> t-firefox-talos-gfx.trace          25.5    0.5     25.4    0.3     72.06%      +0.5%  (insignificant)
> t-firefox-canvas-alpha.trace       20.3    0.3     20.2    0.2     80.88%      +0.4%  (insignificant)
> t-firefox-canvas.trace             17.9    0.0     17.8    0.0     99.36%      +0.2%
> t-firefox-canvas-swscroll.trace    32.0    0.1     31.9    0.1     84.83%      +0.1%  (insignificant)
> t-firefox-asteroids.trace          11.0    0.0     11.0    0.0    100.00%      +0.1%
> t-midori-zoomed.trace               8.0    0.0      8.0    0.0     99.90%      +0.1%
> t-firefox-planet-gnome.trace       10.8    0.0     10.8    0.0     91.34%      +0.1%  (insignificant)
> t-firefox-scrolling.trace          24.6    0.1     24.6    0.1      0.53%      +0.0%  (insignificant)
> t-gnome-terminal-vim.trace         19.7    0.2     19.7    0.1     11.42%      -0.0%  (insignificant)
> t-firefox-talos-svg.trace          20.4    0.0     20.4    0.0     54.68%      -0.0%  (insignificant)
> t-swfdec-giant-steps.trace         14.8    0.0     14.8    0.0     78.92%      -0.0%  (insignificant)
> t-firefox-fishtank.trace           13.1    0.0     13.1    0.0     97.09%      -0.0%  (insignificant)
> t-gnome-system-monitor.trace       17.1    0.0     17.1    0.0     65.13%      -0.0%  (insignificant)
> t-evolution.trace                  12.9    0.1     12.9    0.1     34.70%      -0.1%  (insignificant)
> t-grads-heat-map.trace              4.4    0.0      4.4    0.0     28.95%      -0.1%  (insignificant)
> t-firefox-fishbowl.trace           21.0    0.0     21.0    0.0     99.92%      -0.2%
> t-xfce4-terminal-a1.trace           4.8    0.0      4.8    0.0     98.78%      -0.2%  (insignificant)
> t-firefox-particles.trace          27.1    0.1     27.3    0.1     99.89%      -0.5%
> t-swfdec-youtube.trace              7.7    0.0      7.8    0.0    100.00%      -0.7%
> t-gvim.trace                       32.8    0.2     33.1    0.2    100.00%      -0.9%
> t-firefox-chalkboard.trace         35.8    0.0     37.1    0.0    100.00%      -3.3%
> t-firefox-paintball.trace          14.0    0.0     15.0    0.0    100.00%      -6.2%
> 
> IOW, the flag has adverse effects on lowlevel-blt-bench performance,
> but improves one or two Cairo traces slightly.
> 
> v4, Pekka Paalanen <pekka.paalanen at collabora.co.uk> :
> 	Rebased, re-benchmarked on Raspberry Pi, commit message.

FYI, this is what Ben explained to me, when I asked about the preload
dst flag:

On Thu, 03 Apr 2014 14:40:23 +0100
"Ben Avison" <bavison at riscosopen.org> wrote:

> The thing with the lowlevel-blt-bench benchmarks for the more
> sophisticated composite types (as a general rule, anything that involves
> branches at the per-pixel level) is that they are only profiling the case
> where you have mid-level alpha values in the source/mask/destination.
> Real-world images typically have a disproportionate number of fully
> opaque and fully transparent pixels, which is why when there's a
> discrepancy between which implementation performs best with cairo-perf
> trace versus lowlevel-blt-bench, I usually favour the Cairo winner.
> 
> The results of removing FLAG_NO_PRELOAD_DST (in other words, adding
> preload of the destination buffer) are easy to explain in the
> lowlevel-blt-bench results. In the L1 case, the destination buffer is
> already in the L1 cache, so adding the preloads is simply adding extra
> instruction cycles that have no effect on memory operations. The "in"
> compositing operator depends upon the alpha of both source and
> destination, so if you use uniform mid-alpha, then you actually do need
> to read your destination pixels, so you benefit from preloading them. But
> for fully opaque or fully transparent source pixels, you don't need to
> read the corresponding destination pixel - it'll either be left alone or
> overwritten. Since the ARM11 doesn't use write-allocate cacheing, both of
> these cases avoid both the time taken to load the extra cachelines, as
> well as increasing the efficiency of the cache for other data. If you
> examine the source images being used by the Cairo test, you'll probably
> find they mostly use transparent or opaque pixels.

Thanks,
pq