[Pixman] [PATCH 00/12] Implement more vmx fast paths

Mon Jul 13 03:45:28 PDT 2015

On Mon, Jul 13, 2015 at 1:42 PM, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
> On Thu,  2 Jul 2015 13:04:05 +0300
> Oded Gabbay <oded.gabbay at gmail.com> wrote:
>
>> Hi,
>>
>> This patch-set implements the most heavily used fast paths, according to
>> profiling done by me using the cairo traces package.
>>
>> The patch-set adds many helper functions, to ease the conversion of fast paths
>> between the sse2 implementations (which I used as a base) and the vmx
>> implementations. All the helper functions are added in a single patch.
>>
>> For each fast path, a different commit was made. Inside the commit message,
>> I wrote the improvement of the relevant low-level-blt test (if relevant), and
>> the improvement in cairo trimmed benchmarks.
>>
>> From my obesrvations, the single most important fast path is vmx_fill, as it
>> contributed the most improvement to cairo benchmarks.
>>
>> This patchset is based on the previous patch-set I already sent to the mailing
>> list, which contains implementation of three other fast paths.
>>
>> Please review.
>>
>> Thanks,
>>
>>     Oded
>>
>> Oded Gabbay (12):
>>   vmx: add LOAD_VECTOR macro
>>   vmx: add helper functions
>>   vmx: implement fast path vmx_fill
>>   vmx: implement fast path vmx_blt
>>   vmx: implement fast path vmx_composite_copy_area
>>   vmx: implement fast path vmx_composite_over_n_8888_8888_ca
>>   vmx: implement fast path vmx_composite_over_n_8_8888
>>   vmx: implement fast path vmx_composite_src_x888_8888
>>   vmx: implement fast path scaled nearest vmx_8888_8888_OVER
>>   vmx: implement fast path iterator vmx_fetch_x8r8g8b8
>>   vmx: implement fast path iterator vmx_fetch_r5g6b5
>>   vmx: implement fast path iterator vmx_fetch_a8
>>
>>  pixman/pixman-vmx.c | 1441 +++++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 1404 insertions(+), 37 deletions(-)
>
> Now I'm back home and could run tests with this patch set (together with
> your earlier over_8888_8888/add_8_8/add_8888_8888 patches) on my
> Playstation3. It is a big endian system with a 32-bit rootfs (it can
> also run 64-bit rootfs, but this makes no sense because the RAM size is
> very limited). The "make check" test does not detect any problems.
> And here are the benchmark results with all patches applied, compared
> to the current pixman master branch (GCC 4.8.4):
>
> Speedups
> ========
> image       t-swfdec-giant-steps  12543.83 (12566.88 0.23%) -> 10086.22 (10111.42 0.13%):  1.24x speedup
> ▎
> image                  t-poppler  4439.64 (4480.92 1.02%) -> 3635.83 (4471.68 12.78%):  1.22x speedup
> ▎
> image     t-gnome-system-monitor  13752.81 (13781.71 0.28%) -> 11392.02 (11480.04 0.68%):  1.21x speedup
> ▎
> image        t-firefox-talos-gfx  11813.09 (11877.13 0.33%) -> 9912.82 (9923.63 0.11%):  1.19x speedup
> ▎
> image        t-firefox-paintball  8479.81 (8484.85 0.14%) -> 7153.60 (7178.76 0.19%):  1.19x speedup
> ▏
> image            t-midori-zoomed  3274.41 (3289.13 4.91%) -> 2947.08 (3376.37 13.13%):  1.11x speedup
> ▏
> image         t-firefox-fishbowl  9148.86 (9182.67 0.55%) -> 8473.17 (8521.02 1.04%):  1.08x speedup
> ▏
> image        t-xfce4-terminal-a1  11579.41 (11680.05 0.50%) -> 10739.61 (10776.41 0.15%):  1.08x speedup
> ▏
> image        t-firefox-asteroids  8073.61 (8084.57 0.17%) -> 7546.02 (7557.02 0.08%):  1.07x speedup
> ▏
> image     t-firefox-planet-gnome  6451.82 (6500.78 0.58%) -> 6078.32 (6194.27 1.46%):  1.06x speedup
>
> image            t-chromium-tabs  2001.25 (2007.24 0.43%) -> 1901.80 (1905.46 0.79%):  1.05x speedup
>
>
> It looks like the performance improvements are a lot less impressive
> on Playstation3 than on your POWER8 hardware. I'm not surprised too
> much because the powerpc part of the Cell BE used in Playstation3 is
> an in-order processor. The intrinsics compiled by GCC are also showing
> much better performance improvements on out-of-order Intel processors
> compared to in-order Atoms. And NEON intrinsics used to be (and probably
> still are) a performance disaster on in-order ARM processors too.
>
> However there is also a substantial difference in the unaligned
> loads handling between big and little endian powerpc processors. For
> comparison, it would be interesting to see similar benchmark results
> (the cumulative result of all patches applied) on little endian POWER8
> and also on some out-of-order big endian powerpc processor.
>
> Regarding the fill fast path contributing the most. We probably should
> unroll the fill loop in the generic C code too.
>
> In general, all these patches look good to me. It's a major step in
> catching up with the other architectures in pixman. The performance
> improvements look very nice. I'll take a closer look at the individual
> patches later today.
>
> --
> Best regards,
> Siarhei Siamashka

Thanks!

      Oded