[Pixman] [PATCH 00/12] Implement more vmx fast paths

Mon Jul 13 03:42:15 PDT 2015

On Thu,  2 Jul 2015 13:04:05 +0300
Oded Gabbay <oded.gabbay at gmail.com> wrote:

> Hi,
> 
> This patch-set implements the most heavily used fast paths, according to
> profiling done by me using the cairo traces package.
> 
> The patch-set adds many helper functions, to ease the conversion of fast paths
> between the sse2 implementations (which I used as a base) and the vmx
> implementations. All the helper functions are added in a single patch.
> 
> For each fast path, a different commit was made. Inside the commit message,
> I wrote the improvement of the relevant low-level-blt test (if relevant), and
> the improvement in cairo trimmed benchmarks.
> 
> From my obesrvations, the single most important fast path is vmx_fill, as it
> contributed the most improvement to cairo benchmarks.
> 
> This patchset is based on the previous patch-set I already sent to the mailing
> list, which contains implementation of three other fast paths.
> 
> Please review.
> 
> Thanks,
> 
>     Oded
> 
> Oded Gabbay (12):
>   vmx: add LOAD_VECTOR macro
>   vmx: add helper functions
>   vmx: implement fast path vmx_fill
>   vmx: implement fast path vmx_blt
>   vmx: implement fast path vmx_composite_copy_area
>   vmx: implement fast path vmx_composite_over_n_8888_8888_ca
>   vmx: implement fast path vmx_composite_over_n_8_8888
>   vmx: implement fast path vmx_composite_src_x888_8888
>   vmx: implement fast path scaled nearest vmx_8888_8888_OVER
>   vmx: implement fast path iterator vmx_fetch_x8r8g8b8
>   vmx: implement fast path iterator vmx_fetch_r5g6b5
>   vmx: implement fast path iterator vmx_fetch_a8
> 
>  pixman/pixman-vmx.c | 1441 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 1404 insertions(+), 37 deletions(-)

Now I'm back home and could run tests with this patch set (together with
your earlier over_8888_8888/add_8_8/add_8888_8888 patches) on my
Playstation3. It is a big endian system with a 32-bit rootfs (it can
also run 64-bit rootfs, but this makes no sense because the RAM size is
very limited). The "make check" test does not detect any problems.
And here are the benchmark results with all patches applied, compared
to the current pixman master branch (GCC 4.8.4):

Speedups
========
image       t-swfdec-giant-steps  12543.83 (12566.88 0.23%) -> 10086.22 (10111.42 0.13%):  1.24x speedup
▎
image                  t-poppler  4439.64 (4480.92 1.02%) -> 3635.83 (4471.68 12.78%):  1.22x speedup
▎
image     t-gnome-system-monitor  13752.81 (13781.71 0.28%) -> 11392.02 (11480.04 0.68%):  1.21x speedup
▎
image        t-firefox-talos-gfx  11813.09 (11877.13 0.33%) -> 9912.82 (9923.63 0.11%):  1.19x speedup
▎
image        t-firefox-paintball  8479.81 (8484.85 0.14%) -> 7153.60 (7178.76 0.19%):  1.19x speedup
▏
image            t-midori-zoomed  3274.41 (3289.13 4.91%) -> 2947.08 (3376.37 13.13%):  1.11x speedup
▏
image         t-firefox-fishbowl  9148.86 (9182.67 0.55%) -> 8473.17 (8521.02 1.04%):  1.08x speedup
▏
image        t-xfce4-terminal-a1  11579.41 (11680.05 0.50%) -> 10739.61 (10776.41 0.15%):  1.08x speedup
▏
image        t-firefox-asteroids  8073.61 (8084.57 0.17%) -> 7546.02 (7557.02 0.08%):  1.07x speedup
▏
image     t-firefox-planet-gnome  6451.82 (6500.78 0.58%) -> 6078.32 (6194.27 1.46%):  1.06x speedup

image            t-chromium-tabs  2001.25 (2007.24 0.43%) -> 1901.80 (1905.46 0.79%):  1.05x speedup

It looks like the performance improvements are a lot less impressive
on Playstation3 than on your POWER8 hardware. I'm not surprised too
much because the powerpc part of the Cell BE used in Playstation3 is
an in-order processor. The intrinsics compiled by GCC are also showing
much better performance improvements on out-of-order Intel processors
compared to in-order Atoms. And NEON intrinsics used to be (and probably
still are) a performance disaster on in-order ARM processors too.

However there is also a substantial difference in the unaligned
loads handling between big and little endian powerpc processors. For
comparison, it would be interesting to see similar benchmark results
(the cumulative result of all patches applied) on little endian POWER8
and also on some out-of-order big endian powerpc processor.

Regarding the fill fast path contributing the most. We probably should
unroll the fill loop in the generic C code too.

In general, all these patches look good to me. It's a major step in
catching up with the other architectures in pixman. The performance
improvements look very nice. I'll take a closer look at the individual
patches later today.

-- 
Best regards,
Siarhei Siamashka