[Pixman] [PATCH 00/12] Implement more vmx fast paths
oded.gabbay at gmail.com
Mon Jul 13 03:45:28 PDT 2015
On Mon, Jul 13, 2015 at 1:42 PM, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
> On Thu, 2 Jul 2015 13:04:05 +0300
> Oded Gabbay <oded.gabbay at gmail.com> wrote:
>> This patch-set implements the most heavily used fast paths, according to
>> profiling done by me using the cairo traces package.
>> The patch-set adds many helper functions, to ease the conversion of fast paths
>> between the sse2 implementations (which I used as a base) and the vmx
>> implementations. All the helper functions are added in a single patch.
>> For each fast path, a different commit was made. Inside the commit message,
>> I wrote the improvement of the relevant low-level-blt test (if relevant), and
>> the improvement in cairo trimmed benchmarks.
>> From my obesrvations, the single most important fast path is vmx_fill, as it
>> contributed the most improvement to cairo benchmarks.
>> This patchset is based on the previous patch-set I already sent to the mailing
>> list, which contains implementation of three other fast paths.
>> Please review.
>> Oded Gabbay (12):
>> vmx: add LOAD_VECTOR macro
>> vmx: add helper functions
>> vmx: implement fast path vmx_fill
>> vmx: implement fast path vmx_blt
>> vmx: implement fast path vmx_composite_copy_area
>> vmx: implement fast path vmx_composite_over_n_8888_8888_ca
>> vmx: implement fast path vmx_composite_over_n_8_8888
>> vmx: implement fast path vmx_composite_src_x888_8888
>> vmx: implement fast path scaled nearest vmx_8888_8888_OVER
>> vmx: implement fast path iterator vmx_fetch_x8r8g8b8
>> vmx: implement fast path iterator vmx_fetch_r5g6b5
>> vmx: implement fast path iterator vmx_fetch_a8
>> pixman/pixman-vmx.c | 1441 +++++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 1404 insertions(+), 37 deletions(-)
> Now I'm back home and could run tests with this patch set (together with
> your earlier over_8888_8888/add_8_8/add_8888_8888 patches) on my
> Playstation3. It is a big endian system with a 32-bit rootfs (it can
> also run 64-bit rootfs, but this makes no sense because the RAM size is
> very limited). The "make check" test does not detect any problems.
> And here are the benchmark results with all patches applied, compared
> to the current pixman master branch (GCC 4.8.4):
> image t-swfdec-giant-steps 12543.83 (12566.88 0.23%) -> 10086.22 (10111.42 0.13%): 1.24x speedup
> image t-poppler 4439.64 (4480.92 1.02%) -> 3635.83 (4471.68 12.78%): 1.22x speedup
> image t-gnome-system-monitor 13752.81 (13781.71 0.28%) -> 11392.02 (11480.04 0.68%): 1.21x speedup
> image t-firefox-talos-gfx 11813.09 (11877.13 0.33%) -> 9912.82 (9923.63 0.11%): 1.19x speedup
> image t-firefox-paintball 8479.81 (8484.85 0.14%) -> 7153.60 (7178.76 0.19%): 1.19x speedup
> image t-midori-zoomed 3274.41 (3289.13 4.91%) -> 2947.08 (3376.37 13.13%): 1.11x speedup
> image t-firefox-fishbowl 9148.86 (9182.67 0.55%) -> 8473.17 (8521.02 1.04%): 1.08x speedup
> image t-xfce4-terminal-a1 11579.41 (11680.05 0.50%) -> 10739.61 (10776.41 0.15%): 1.08x speedup
> image t-firefox-asteroids 8073.61 (8084.57 0.17%) -> 7546.02 (7557.02 0.08%): 1.07x speedup
> image t-firefox-planet-gnome 6451.82 (6500.78 0.58%) -> 6078.32 (6194.27 1.46%): 1.06x speedup
> image t-chromium-tabs 2001.25 (2007.24 0.43%) -> 1901.80 (1905.46 0.79%): 1.05x speedup
> It looks like the performance improvements are a lot less impressive
> on Playstation3 than on your POWER8 hardware. I'm not surprised too
> much because the powerpc part of the Cell BE used in Playstation3 is
> an in-order processor. The intrinsics compiled by GCC are also showing
> much better performance improvements on out-of-order Intel processors
> compared to in-order Atoms. And NEON intrinsics used to be (and probably
> still are) a performance disaster on in-order ARM processors too.
> However there is also a substantial difference in the unaligned
> loads handling between big and little endian powerpc processors. For
> comparison, it would be interesting to see similar benchmark results
> (the cumulative result of all patches applied) on little endian POWER8
> and also on some out-of-order big endian powerpc processor.
> Regarding the fill fast path contributing the most. We probably should
> unroll the fill loop in the generic C code too.
> In general, all these patches look good to me. It's a major step in
> catching up with the other architectures in pixman. The performance
> improvements look very nice. I'll take a closer look at the individual
> patches later today.
> Best regards,
> Siarhei Siamashka
More information about the Pixman