[Pixman] [PATCH 2/2] ARMv6: Add fast path for in_reverse_8888_8888
Pekka Paalanen
ppaalanen at gmail.com
Fri Apr 4 01:51:52 PDT 2014
On Mon, 31 Mar 2014 15:54:24 +0300
Pekka Paalanen <ppaalanen at gmail.com> wrote:
> From: Ben Avison <bavison at riscosopen.org>
>
> Benchmark results, "before" is the patch
> - ARMv6: Add fast path for over_n_8888_8888_ca
> and "after" contains the additional patches:
> - ARMv6: Add fast path flag to force no preload of destination buffer
> - ARMv6: Add fast path for in_reverse_8888_8888 (this patch)
>
> lowlevel-blt-bench, in_reverse_8888_8888, 100 iterations:
>
> Before After
> Mean StdDev Mean StdDev Confidence Change
> L1 21.1 0.1 32.0 0.1 100.00% +51.9%
> L2 11.7 0.3 18.4 0.5 100.00% +56.9%
> M 10.5 0.0 16.3 0.0 100.00% +54.8%
> HT 8.2 0.0 12.0 0.0 100.00% +46.7%
> VT 8.1 0.0 11.8 0.0 100.00% +45.4%
> R 8.0 0.0 11.2 0.0 100.00% +40.0%
> RT 4.7 0.0 6.0 0.1 100.00% +28.1%
>
> At most 14 outliers rejected per case per set.
>
> cairo-perf-trace with trimmed traces, 30 iterations:
>
> Before After
> Mean StdDev Mean StdDev Confidence Change
> t-firefox-paintball.trace 17.9 0.0 14.0 0.0 100.00% +27.8%
> t-firefox-chalkboard.trace 36.6 0.0 35.8 0.0 100.00% +2.1%
> t-firefox-canvas-alpha.trace 20.7 0.3 20.3 0.3 100.00% +1.7%
> t-firefox-particles.trace 27.5 0.1 27.1 0.1 100.00% +1.3%
> t-chromium-tabs.trace 4.9 0.0 4.8 0.0 100.00% +1.1%
> t-evolution.trace 13.0 0.1 12.9 0.1 100.00% +1.0%
> t-swfdec-youtube.trace 7.8 0.0 7.7 0.0 100.00% +0.8%
> t-gvim.trace 33.0 0.2 32.8 0.2 100.00% +0.7%
> t-gnome-terminal-vim.trace 19.8 0.2 19.7 0.2 99.46% +0.6%
> t-grads-heat-map.trace 4.4 0.0 4.4 0.0 99.32% +0.6%
> t-firefox-fishbowl.trace 21.1 0.0 21.0 0.0 100.00% +0.5%
> t-firefox-planet-gnome.trace 10.9 0.0 10.8 0.0 100.00% +0.4%
> t-firefox-canvas-swscroll.trace 32.1 0.1 32.0 0.1 100.00% +0.4%
> t-firefox-fishtank.trace 13.2 0.0 13.1 0.0 100.00% +0.4%
> t-firefox-asteroids.trace 11.1 0.0 11.0 0.0 100.00% +0.4%
> t-firefox-canvas.trace 17.9 0.0 17.9 0.0 99.99% +0.3%
> t-poppler.trace 9.7 0.1 9.7 0.1 79.51% +0.2% (insignificant)
> t-firefox-talos-svg.trace 20.4 0.0 20.4 0.0 97.25% +0.1% (insignificant)
> t-swfdec-giant-steps.trace 14.8 0.0 14.8 0.0 96.75% +0.1% (insignificant)
> t-firefox-scrolling.trace 24.6 0.1 24.6 0.1 31.24% +0.1% (insignificant)
> t-midori-zoomed.trace 8.0 0.0 8.0 0.0 50.76% +0.0% (insignificant)
> t-gnome-system-monitor.trace 17.1 0.0 17.1 0.0 4.49% -0.0% (insignificant)
> t-xfce4-terminal-a1.trace 4.8 0.0 4.8 0.0 98.08% -0.2% (insignificant)
> t-poppler-reseau.trace 22.1 0.1 22.2 0.1 93.89% -0.3% (insignificant)
> t-firefox-talos-gfx.trace 25.4 0.4 25.5 0.5 75.53% -0.5% (insignificant)
>
> At most 4 outliers rejected per case per set.
>
> Cairo perf reports the running time, but the change is computed for
> operations per second instead (inverse of running time).
>
> Confidence is based on Welch's t-test. Absolute changes less than 1%
> can be accounted as measurement errors, even if statistically
> significant.
>
> There was a question of why FLAG_NO_PRELOAD_DST exists. If a patch
> removing that flag from pixman-arm-simd-asm.S is added on top, the
> change will be the following.
>
> Before: flag in use
> After: flag removed
>
> Before After
> Mean StdDev Mean StdDev Confidence Change
> L1 32.0 0.1 31.8 0.1 100.00% -0.6%
> L2 18.4 0.5 25.0 0.5 100.00% +36.0%
> M 16.3 0.0 25.7 0.0 100.00% +57.9%
> HT 12.0 0.0 13.9 0.0 100.00% +16.4%
> VT 11.8 0.0 13.2 0.0 100.00% +12.4%
> R 11.2 0.0 14.0 0.0 100.00% +24.3%
> RT 6.0 0.1 7.0 0.1 100.00% +15.1%
>
> Before After
> Mean StdDev Mean StdDev Confidence Change
> t-chromium-tabs.trace 4.8 0.0 4.8 0.0 100.00% +0.7%
> t-poppler-reseau.trace 22.2 0.1 22.1 0.1 99.98% +0.6%
> t-poppler.trace 9.7 0.1 9.6 0.1 99.70% +0.5%
> t-firefox-talos-gfx.trace 25.5 0.5 25.4 0.3 72.06% +0.5% (insignificant)
> t-firefox-canvas-alpha.trace 20.3 0.3 20.2 0.2 80.88% +0.4% (insignificant)
> t-firefox-canvas.trace 17.9 0.0 17.8 0.0 99.36% +0.2%
> t-firefox-canvas-swscroll.trace 32.0 0.1 31.9 0.1 84.83% +0.1% (insignificant)
> t-firefox-asteroids.trace 11.0 0.0 11.0 0.0 100.00% +0.1%
> t-midori-zoomed.trace 8.0 0.0 8.0 0.0 99.90% +0.1%
> t-firefox-planet-gnome.trace 10.8 0.0 10.8 0.0 91.34% +0.1% (insignificant)
> t-firefox-scrolling.trace 24.6 0.1 24.6 0.1 0.53% +0.0% (insignificant)
> t-gnome-terminal-vim.trace 19.7 0.2 19.7 0.1 11.42% -0.0% (insignificant)
> t-firefox-talos-svg.trace 20.4 0.0 20.4 0.0 54.68% -0.0% (insignificant)
> t-swfdec-giant-steps.trace 14.8 0.0 14.8 0.0 78.92% -0.0% (insignificant)
> t-firefox-fishtank.trace 13.1 0.0 13.1 0.0 97.09% -0.0% (insignificant)
> t-gnome-system-monitor.trace 17.1 0.0 17.1 0.0 65.13% -0.0% (insignificant)
> t-evolution.trace 12.9 0.1 12.9 0.1 34.70% -0.1% (insignificant)
> t-grads-heat-map.trace 4.4 0.0 4.4 0.0 28.95% -0.1% (insignificant)
> t-firefox-fishbowl.trace 21.0 0.0 21.0 0.0 99.92% -0.2%
> t-xfce4-terminal-a1.trace 4.8 0.0 4.8 0.0 98.78% -0.2% (insignificant)
> t-firefox-particles.trace 27.1 0.1 27.3 0.1 99.89% -0.5%
> t-swfdec-youtube.trace 7.7 0.0 7.8 0.0 100.00% -0.7%
> t-gvim.trace 32.8 0.2 33.1 0.2 100.00% -0.9%
> t-firefox-chalkboard.trace 35.8 0.0 37.1 0.0 100.00% -3.3%
> t-firefox-paintball.trace 14.0 0.0 15.0 0.0 100.00% -6.2%
>
> IOW, the flag has adverse effects on lowlevel-blt-bench performance,
> but improves one or two Cairo traces slightly.
>
> v4, Pekka Paalanen <pekka.paalanen at collabora.co.uk> :
> Rebased, re-benchmarked on Raspberry Pi, commit message.
FYI, this is what Ben explained to me, when I asked about the preload
dst flag:
On Thu, 03 Apr 2014 14:40:23 +0100
"Ben Avison" <bavison at riscosopen.org> wrote:
> The thing with the lowlevel-blt-bench benchmarks for the more
> sophisticated composite types (as a general rule, anything that involves
> branches at the per-pixel level) is that they are only profiling the case
> where you have mid-level alpha values in the source/mask/destination.
> Real-world images typically have a disproportionate number of fully
> opaque and fully transparent pixels, which is why when there's a
> discrepancy between which implementation performs best with cairo-perf
> trace versus lowlevel-blt-bench, I usually favour the Cairo winner.
>
> The results of removing FLAG_NO_PRELOAD_DST (in other words, adding
> preload of the destination buffer) are easy to explain in the
> lowlevel-blt-bench results. In the L1 case, the destination buffer is
> already in the L1 cache, so adding the preloads is simply adding extra
> instruction cycles that have no effect on memory operations. The "in"
> compositing operator depends upon the alpha of both source and
> destination, so if you use uniform mid-alpha, then you actually do need
> to read your destination pixels, so you benefit from preloading them. But
> for fully opaque or fully transparent source pixels, you don't need to
> read the corresponding destination pixel - it'll either be left alone or
> overwritten. Since the ARM11 doesn't use write-allocate cacheing, both of
> these cases avoid both the time taken to load the extra cachelines, as
> well as increasing the efficiency of the cache for other data. If you
> examine the source images being used by the Cairo test, you'll probably
> find they mostly use transparent or opaque pixels.
Thanks,
pq
More information about the Pixman
mailing list