[Pixman] [PATCH 05/12] vmx: implement fast path vmx_composite_copy_area
Siarhei Siamashka
siarhei.siamashka at gmail.com
Thu Jul 16 02:21:57 PDT 2015
On Wed, 15 Jul 2015 15:33:35 +0300
Oded Gabbay <oded.gabbay at gmail.com> wrote:
> On Tue, Jul 14, 2015 at 11:41 AM, Siarhei Siamashka
> <siarhei.siamashka at gmail.com> wrote:
> > It is a good idea to have at least one benchmark result in the commit
> > message. Or provide a convincing explanation why this particular code
> > is beneficial. The "no changes were observed" commit message does not
> > do a good job justifying the need for this patch.
> >
> > If none of the traces of cairo applications can show any improvements,
> > then we can at least use "lowlevel-blt-bench src_8888_8888". I get
> > the following results on my Playstation3:
> >
> > == before ==
> >
> > src_8888_8888 = L1: 437.68 L2: 277.55 M:159.34 (242.58%)
> > HT: 90.31 VT: 50.77 R: 50.67 RT: 17.64 ( 148Kops/s)
> >
> > == after ==
> >
> > src_8888_8888 = L1: 850.60 L2: 453.91 M:174.26 (265.31%)
> > HT:105.68 VT: 54.17 R: 54.88 RT: 18.72 ( 154Kops/s)
> >
> >
> > Assuming that the commit message gets updated,
> > Acked-by: Siarhei Siamashka <siarhei.siamashka at gmail.com>
>
> I went and tested the "lowlevel-blt-bench src_8888_8888" with this
> patch and to my surprise, the fast path made the results much worse
> (Except from L1):
>
> reference memcpy speed = 25058.0MB/s (6264.5MP/s for 32bpp fills)
>
> Before After Change
> L1 6475.97 7430.21 +14.74%
> L2 6019.82 4284.12 -28.83%
> M 3004.82 2901.99 -3.42%
> HT 1666.21 1278.17 -23.29%
> VT 1719.05 1481.61 -13.81%
> R 759.46 581.2 -23.47%
> RT 218.86 181.07 -17.27%
> Kops/s 1464 1292 -11.75%
>
> So I thought maybe I will see the improvement only in ppc, so I run it
> on POWER7 ppc 64 bit:
>
> reference memcpy speed = 10651.0MB/s (2662.7MP/s for 32bpp fills)
>
> Before After Change
> L1 4165.37 4228.5 +1.52%
> L2 4337.16 4241.26 -2.21%
> M 1678.75 1607.69 -4.23%
> HT 886.92 808.52 -8.84%
> VT 781.87 742.73 -5.01%
> R 483.37 445.87 -7.76%
> RT 175.08 165.24 -5.62%
> Kops/s 1193 1135 -4.86%
>
> Nope, don't see the difference there as well.
>
> How did you check it on your machine ? Did you check with only
> blt+copy area patches, or with other patches applied as well ?
It's the difference between having all your patches applied and having
all your patches applied, but with the blt+copy code commented out.
> Currently, because both these patches don't show improvement in
> lowlevel-blt-bench AND in cairo, I'm willing to drop them from this
> patch-set until further investigation.
Thanks. Yes, I also think that it's a good idea to drop this patch.
My point was that every commit with performance needs to be justified
by having benchmark results in the commit message. Now we have more
than enough benchmarks and I'm quite happy about this.
With this kind of optimization, we are competing with the memcpy
implementation in glibc. Which is preferably expected to use hand
written assembly there because memcpy performance is rather important.
But the performance if memcpy in the standard C library is not
guaranteed to be good. For example, pixman MIPS assembly code contains
'pixman_mips_fast_memcpy' function, which is just a general purpose
memcpy implementation and we would not need to have it in the first
place if the glibc code was good enough.
Essentially, this kind of pixman code is a workaround for glibc memcpy
performance problems. Of course, unless we are providing some extra
tuning. Like for example, assume 4 byte granularity for x8r8g8b8 pixel
data and reduce setup overhead, while the standard memcpy has a bit more
work to do because it works with 1 byte granularity. Or in the case of
ARM NEON code, we do software prefetch which can span to the next
scanline too.
We already don't accept workarounds for GCC bugs without references
to reported issues in GCC bugzilla. Probably we should extend these
rules to do the same with blt/memcpy?
More detailed benchmark/profiling results from my Playstation3
(the current pixman git e2d211ac491cd9884aae7ccaf18e5b3042469cf2
without vmx patches):
$ perf record ./lowlevel-blt-bench -m 500 src_8888_8888 && perf report
59.31% lowlevel-blt-be libc-2.15.so [.] _wordcopy_fwd_aligned
11.32% lowlevel-blt-be libc-2.15.so [.] __random
6.51% lowlevel-blt-be lowlevel-blt-bench [.] bench_L
5.68% lowlevel-blt-be libc-2.15.so [.] memcpy
3.03% lowlevel-blt-be [kernel.kallsyms] [k] .raw_local_irq_restore
1.80% lowlevel-blt-be lowlevel-blt-bench [.] pixman_image_composite32
1.35% lowlevel-blt-be libc-2.15.so [.] __random_r
1.10% lowlevel-blt-be lowlevel-blt-bench [.] analyze_extent
1.02% lowlevel-blt-be lowlevel-blt-bench [.] _pixman_image_validate
1.00% lowlevel-blt-be lowlevel-blt-bench [.] _pixman_compute_composite_region32
0.98% lowlevel-blt-be lowlevel-blt-bench [.] bench_RT
0.93% lowlevel-blt-be lowlevel-blt-bench [.] fast_composite_src_memcpy
And here is the disassembly of '_wordcopy_fwd_aligned' (glibc-2.15):
00094118 <_wordcopy_fwd_aligned>:
94118: 94 21 ff e0 stwu r1,-32(r1)
9411c: 7d 88 02 a6 mflr r12
94120: 93 c1 00 18 stw r30,24(r1)
94124: 42 9f 00 05 bcl- 20,4*cr7+so,94128 <_wordcopy_fwd_aligned+0x10>
94128: 54 a9 16 fa rlwinm r9,r5,2,27,29
9412c: 7f c8 02 a6 mflr r30
94130: 3f de 00 0f addis r30,r30,15
94134: 7d 88 03 a6 mtlr r12
94138: 3b de be cc addi r30,r30,-16692
9413c: 81 5e e7 28 lwz r10,-6360(r30)
94140: 7d 2a 48 2e lwzx r9,r10,r9
94144: 7d 49 52 14 add r10,r9,r10
94148: 7d 49 03 a6 mtctr r10
9414c: 4e 80 04 20 bctr
94150: 81 44 00 00 lwz r10,0(r4)
94154: 38 84 ff f8 addi r4,r4,-8
94158: 39 23 ff f4 addi r9,r3,-12
9415c: 38 a5 00 02 addi r5,r5,2
94160: 81 04 00 0c lwz r8,12(r4)
94164: 91 49 00 0c stw r10,12(r9)
94168: 81 44 00 10 lwz r10,16(r4)
9416c: 91 09 00 10 stw r8,16(r9)
94170: 81 04 00 14 lwz r8,20(r4)
94174: 91 49 00 14 stw r10,20(r9)
94178: 34 a5 ff f8 addic. r5,r5,-8
9417c: 81 44 00 18 lwz r10,24(r4)
94180: 91 09 00 18 stw r8,24(r9)
94184: 81 04 00 1c lwz r8,28(r4)
94188: 91 49 00 1c stw r10,28(r9)
9418c: 38 69 00 20 addi r3,r9,32
94190: 41 82 00 90 beq- 94220 <_wordcopy_fwd_aligned+0x108>
94194: 38 84 00 20 addi r4,r4,32
94198: 81 44 00 00 lwz r10,0(r4)
9419c: 91 03 00 00 stw r8,0(r3)
941a0: 7c 69 1b 78 mr r9,r3
941a4: 81 04 00 04 lwz r8,4(r4)
941a8: 91 43 00 04 stw r10,4(r3)
941ac: 81 44 00 08 lwz r10,8(r4)
941b0: 91 09 00 08 stw r8,8(r9)
941b4: 81 04 00 0c lwz r8,12(r4)
941b8: 91 49 00 0c stw r10,12(r9)
941bc: 4b ff ff ac b 94168 <_wordcopy_fwd_aligned+0x50>
941c0: 81 04 00 00 lwz r8,0(r4)
941c4: 39 23 ff f8 addi r9,r3,-8
941c8: 38 84 ff fc addi r4,r4,-4
941cc: 38 a5 00 01 addi r5,r5,1
941d0: 4b ff ff dc b 941ac <_wordcopy_fwd_aligned+0x94>
941d4: 60 00 00 00 nop
941d8: 81 44 00 00 lwz r10,0(r4)
941dc: 38 63 ff fc addi r3,r3,-4
941e0: 4b ff ff c0 b 941a0 <_wordcopy_fwd_aligned+0x88>
941e4: 60 00 00 00 nop
941e8: 81 04 00 00 lwz r8,0(r4)
941ec: 38 a5 ff ff addi r5,r5,-1
941f0: 38 84 00 04 addi r4,r4,4
941f4: 4b ff ff a4 b 94198 <_wordcopy_fwd_aligned+0x80>
941f8: 38 a5 00 06 addi r5,r5,6
941fc: 81 44 00 00 lwz r10,0(r4)
94200: 38 84 ff e8 addi r4,r4,-24
94204: 34 a5 ff f8 addic. r5,r5,-8
94208: 81 04 00 1c lwz r8,28(r4)
9420c: 39 23 ff e4 addi r9,r3,-28
94210: 91 49 00 1c stw r10,28(r9)
94214: 38 69 00 20 addi r3,r9,32
94218: 40 82 ff 7c bne+ 94194 <_wordcopy_fwd_aligned+0x7c>
9421c: 60 00 00 00 nop
94220: 91 09 00 20 stw r8,32(r9)
94224: 83 c1 00 18 lwz r30,24(r1)
94228: 38 21 00 20 addi r1,r1,32
9422c: 4e 80 00 20 blr
94230: 81 04 00 00 lwz r8,0(r4)
94234: 39 23 ff e8 addi r9,r3,-24
94238: 38 84 ff ec addi r4,r4,-20
9423c: 38 a5 00 05 addi r5,r5,5
94240: 4b ff ff 38 b 94178 <_wordcopy_fwd_aligned+0x60>
94244: 60 00 00 00 nop
94248: 81 44 00 00 lwz r10,0(r4)
9424c: 39 23 ff ec addi r9,r3,-20
94250: 38 84 ff f0 addi r4,r4,-16
94254: 38 a5 00 04 addi r5,r5,4
94258: 4b ff ff 18 b 94170 <_wordcopy_fwd_aligned+0x58>
9425c: 60 00 00 00 nop
94260: 81 04 00 00 lwz r8,0(r4)
94264: 39 23 ff f0 addi r9,r3,-16
94268: 38 84 ff f4 addi r4,r4,-12
9426c: 38 a5 00 03 addi r5,r5,3
94270: 4b ff fe f8 b 94168 <_wordcopy_fwd_aligned+0x50>
94274: 60 00 00 00 nop
The disassembly of the main loop in the vmx code was provided here:
http://lists.freedesktop.org/archives/pixman/2015-July/003798.html
b170: 39 69 00 10 addi r11,r9,16
b174: 7c 80 48 0c lvsl v4,r0,r9
b178: 38 a9 00 20 addi r5,r9,32
b17c: 7c a0 48 ce lvx v5,r0,r9
b180: 38 c9 00 30 addi r6,r9,48
b184: 7d 87 48 ce lvx v12,r7,r9
b188: 7c c0 58 0c lvsl v6,r0,r11
b18c: 39 29 00 40 addi r9,r9,64
b190: 7c e0 58 ce lvx v7,r0,r11
b194: 7d a7 58 ce lvx v13,r7,r11
b198: 39 6a 00 10 addi r11,r10,16
b19c: 7d 00 28 0c lvsl v8,r0,r5
b1a0: 7d 20 28 ce lvx v9,r0,r5
b1a4: 7c 27 28 ce lvx v1,r7,r5
b1a8: 38 aa 00 20 addi r5,r10,32
b1ac: 7d 40 30 0c lvsl v10,r0,r6
b1b0: 7d 60 30 ce lvx v11,r0,r6
b1b4: 7c 07 30 ce lvx v0,r7,r6
b1b8: 38 ca 00 30 addi r6,r10,48
b1bc: 11 85 61 2b vperm v12,v5,v12,v4
b1c0: 11 a7 69 ab vperm v13,v7,v13,v6
b1c4: 10 29 0a 2b vperm v1,v9,v1,v8
b1c8: 10 0b 02 ab vperm v0,v11,v0,v10
b1cc: 7d 80 51 ce stvx v12,r0,r10
b1d0: 39 4a 00 40 addi r10,r10,64
b1d4: 7d a0 59 ce stvx v13,r0,r11
b1d8: 7c 20 29 ce stvx v1,r0,r5
b1dc: 7c 00 31 ce stvx v0,r0,r6
b1e0: 42 00 ff 90 bdnz+ b170 <vmx_blt.part.4+0x260>
And I already mentioned that I was not particularly happy about how
unaligned loads are implemented on big endian powerpc systems. However
even such non-perfect VMX code was still faster than non-VMX code
in glibc-2.15 on my Plasytation3. I'll try to update glibc to a more
recent version 2.20 (this system has not been updated for years) and
check if it makes any difference.
Either way, Playstation3/Cell is a dead end and I doubt that there
are many of them in the whole world still capable of running Linux
(this functionality had been officially crippled by firmware updates
from Sony). Your results from POWER7/POWER8 are surely a lot more
relevant.
PS. We might want to revisit the SSE2 code and check if having the blt
is justified there (benchmark it against the memcpy from modern glibc).
--
Best regards,
Siarhei Siamashka
More information about the Pixman
mailing list