[Pixman] [PATCH 05/12] vmx: implement fast path vmx_composite_copy_area

Thu Jul 16 02:21:57 PDT 2015

On Wed, 15 Jul 2015 15:33:35 +0300
Oded Gabbay <oded.gabbay at gmail.com> wrote:

> On Tue, Jul 14, 2015 at 11:41 AM, Siarhei Siamashka
> <siarhei.siamashka at gmail.com> wrote:
> > It is a good idea to have at least one benchmark result in the commit
> > message. Or provide a convincing explanation why this particular code
> > is beneficial. The "no changes were observed" commit message does not
> > do a good job justifying the need for this patch.
> >
> > If none of the traces of cairo applications can show any improvements,
> > then we can at least use "lowlevel-blt-bench src_8888_8888". I get
> > the following results on my Playstation3:
> >
> > == before ==
> >
> >            src_8888_8888 =  L1: 437.68  L2: 277.55  M:159.34 (242.58%)
> >            HT: 90.31  VT: 50.77  R: 50.67  RT: 17.64 ( 148Kops/s)
> >
> > == after ==
> >
> >            src_8888_8888 =  L1: 850.60  L2: 453.91  M:174.26 (265.31%)
> >            HT:105.68  VT: 54.17  R: 54.88  RT: 18.72 ( 154Kops/s)
> >
> >
> > Assuming that the commit message gets updated,
> > Acked-by: Siarhei Siamashka <siarhei.siamashka at gmail.com>
> 
> I went and tested the "lowlevel-blt-bench src_8888_8888" with this
> patch and to my surprise, the fast path made the results much worse
> (Except from L1):
> 
> reference memcpy speed = 25058.0MB/s (6264.5MP/s for 32bpp fills)
> 
>             Before         After           Change
> L1          6475.97       7430.21         +14.74%
> L2          6019.82       4284.12         -28.83%
> M           3004.82       2901.99         -3.42%
> HT          1666.21       1278.17         -23.29%
> VT          1719.05       1481.61         -13.81%
> R           759.46           581.2           -23.47%
> RT          218.86        181.07          -17.27%
> Kops/s      1464          1292            -11.75%
> 
> So I thought maybe I will see the improvement only in ppc, so I run it
> on POWER7 ppc 64 bit:
> 
> reference memcpy speed = 10651.0MB/s (2662.7MP/s for 32bpp fills)
> 
>                 Before          After           Change
> L1              4165.37         4228.5          +1.52%
> L2              4337.16         4241.26         -2.21%
> M               1678.75         1607.69         -4.23%
> HT              886.92          808.52          -8.84%
> VT              781.87          742.73          -5.01%
> R               483.37          445.87          -7.76%
> RT              175.08          165.24          -5.62%
> Kops/s          1193            1135            -4.86%
> 
> Nope, don't see the difference there as well.
> 
> How did you check it on your machine ? Did you check with only
> blt+copy area patches, or with other patches applied as well ?

It's the difference between having all your patches applied and having
all your patches applied, but with the blt+copy code commented out.

> Currently, because both these patches don't show improvement in
> lowlevel-blt-bench AND in cairo, I'm willing to drop them from this
> patch-set until further investigation.

Thanks. Yes, I also think that it's a good idea to drop this patch.
My point was that every commit with performance needs to be justified
by having benchmark results in the commit message. Now we have more
than enough benchmarks and I'm quite happy about this.

With this kind of optimization, we are competing with the memcpy
implementation in glibc. Which is preferably expected to use hand
written assembly there because memcpy performance is rather important.
But the performance if memcpy in the standard C library is not
guaranteed to be good. For example, pixman MIPS assembly code contains
'pixman_mips_fast_memcpy' function, which is just a general purpose
memcpy implementation and we would not need to have it in the first
place if the glibc code was good enough.

Essentially, this kind of pixman code is a workaround for glibc memcpy
performance problems. Of course, unless we are providing some extra
tuning. Like for example, assume 4 byte granularity for x8r8g8b8 pixel
data and reduce setup overhead, while the standard memcpy has a bit more
work to do because it works with 1 byte granularity. Or in the case of
ARM NEON code, we do software prefetch which can span to the next
scanline too.

We already don't accept workarounds for GCC bugs without references
to reported issues in GCC bugzilla. Probably we should extend these
rules to do the same with blt/memcpy?

More detailed benchmark/profiling results from my Playstation3
(the current pixman git e2d211ac491cd9884aae7ccaf18e5b3042469cf2
without vmx patches):

$ perf record ./lowlevel-blt-bench -m 500 src_8888_8888 && perf report

    59.31%  lowlevel-blt-be  libc-2.15.so         [.] _wordcopy_fwd_aligned
    11.32%  lowlevel-blt-be  libc-2.15.so         [.] __random
     6.51%  lowlevel-blt-be  lowlevel-blt-bench   [.] bench_L
     5.68%  lowlevel-blt-be  libc-2.15.so         [.] memcpy
     3.03%  lowlevel-blt-be  [kernel.kallsyms]    [k] .raw_local_irq_restore
     1.80%  lowlevel-blt-be  lowlevel-blt-bench   [.] pixman_image_composite32
     1.35%  lowlevel-blt-be  libc-2.15.so         [.] __random_r
     1.10%  lowlevel-blt-be  lowlevel-blt-bench   [.] analyze_extent
     1.02%  lowlevel-blt-be  lowlevel-blt-bench   [.] _pixman_image_validate
     1.00%  lowlevel-blt-be  lowlevel-blt-bench   [.] _pixman_compute_composite_region32
     0.98%  lowlevel-blt-be  lowlevel-blt-bench   [.] bench_RT
     0.93%  lowlevel-blt-be  lowlevel-blt-bench   [.] fast_composite_src_memcpy

And here is the disassembly of '_wordcopy_fwd_aligned' (glibc-2.15):

00094118 <_wordcopy_fwd_aligned>:
   94118:	94 21 ff e0 	stwu    r1,-32(r1)
   9411c:	7d 88 02 a6 	mflr    r12
   94120:	93 c1 00 18 	stw     r30,24(r1)
   94124:	42 9f 00 05 	bcl-    20,4*cr7+so,94128 <_wordcopy_fwd_aligned+0x10>
   94128:	54 a9 16 fa 	rlwinm  r9,r5,2,27,29
   9412c:	7f c8 02 a6 	mflr    r30
   94130:	3f de 00 0f 	addis   r30,r30,15
   94134:	7d 88 03 a6 	mtlr    r12
   94138:	3b de be cc 	addi    r30,r30,-16692
   9413c:	81 5e e7 28 	lwz     r10,-6360(r30)
   94140:	7d 2a 48 2e 	lwzx    r9,r10,r9
   94144:	7d 49 52 14 	add     r10,r9,r10
   94148:	7d 49 03 a6 	mtctr   r10
   9414c:	4e 80 04 20 	bctr
   94150:	81 44 00 00 	lwz     r10,0(r4)
   94154:	38 84 ff f8 	addi    r4,r4,-8
   94158:	39 23 ff f4 	addi    r9,r3,-12
   9415c:	38 a5 00 02 	addi    r5,r5,2
   94160:	81 04 00 0c 	lwz     r8,12(r4)
   94164:	91 49 00 0c 	stw     r10,12(r9)
   94168:	81 44 00 10 	lwz     r10,16(r4)
   9416c:	91 09 00 10 	stw     r8,16(r9)
   94170:	81 04 00 14 	lwz     r8,20(r4)
   94174:	91 49 00 14 	stw     r10,20(r9)
   94178:	34 a5 ff f8 	addic.  r5,r5,-8
   9417c:	81 44 00 18 	lwz     r10,24(r4)
   94180:	91 09 00 18 	stw     r8,24(r9)
   94184:	81 04 00 1c 	lwz     r8,28(r4)
   94188:	91 49 00 1c 	stw     r10,28(r9)
   9418c:	38 69 00 20 	addi    r3,r9,32
   94190:	41 82 00 90 	beq-    94220 <_wordcopy_fwd_aligned+0x108>
   94194:	38 84 00 20 	addi    r4,r4,32
   94198:	81 44 00 00 	lwz     r10,0(r4)
   9419c:	91 03 00 00 	stw     r8,0(r3)
   941a0:	7c 69 1b 78 	mr      r9,r3
   941a4:	81 04 00 04 	lwz     r8,4(r4)
   941a8:	91 43 00 04 	stw     r10,4(r3)
   941ac:	81 44 00 08 	lwz     r10,8(r4)
   941b0:	91 09 00 08 	stw     r8,8(r9)
   941b4:	81 04 00 0c 	lwz     r8,12(r4)
   941b8:	91 49 00 0c 	stw     r10,12(r9)
   941bc:	4b ff ff ac 	b       94168 <_wordcopy_fwd_aligned+0x50>
   941c0:	81 04 00 00 	lwz     r8,0(r4)
   941c4:	39 23 ff f8 	addi    r9,r3,-8
   941c8:	38 84 ff fc 	addi    r4,r4,-4
   941cc:	38 a5 00 01 	addi    r5,r5,1
   941d0:	4b ff ff dc 	b       941ac <_wordcopy_fwd_aligned+0x94>
   941d4:	60 00 00 00 	nop
   941d8:	81 44 00 00 	lwz     r10,0(r4)
   941dc:	38 63 ff fc 	addi    r3,r3,-4
   941e0:	4b ff ff c0 	b       941a0 <_wordcopy_fwd_aligned+0x88>
   941e4:	60 00 00 00 	nop
   941e8:	81 04 00 00 	lwz     r8,0(r4)
   941ec:	38 a5 ff ff 	addi    r5,r5,-1
   941f0:	38 84 00 04 	addi    r4,r4,4
   941f4:	4b ff ff a4 	b       94198 <_wordcopy_fwd_aligned+0x80>
   941f8:	38 a5 00 06 	addi    r5,r5,6
   941fc:	81 44 00 00 	lwz     r10,0(r4)
   94200:	38 84 ff e8 	addi    r4,r4,-24
   94204:	34 a5 ff f8 	addic.  r5,r5,-8
   94208:	81 04 00 1c 	lwz     r8,28(r4)
   9420c:	39 23 ff e4 	addi    r9,r3,-28
   94210:	91 49 00 1c 	stw     r10,28(r9)
   94214:	38 69 00 20 	addi    r3,r9,32
   94218:	40 82 ff 7c 	bne+    94194 <_wordcopy_fwd_aligned+0x7c>
   9421c:	60 00 00 00 	nop
   94220:	91 09 00 20 	stw     r8,32(r9)
   94224:	83 c1 00 18 	lwz     r30,24(r1)
   94228:	38 21 00 20 	addi    r1,r1,32
   9422c:	4e 80 00 20 	blr
   94230:	81 04 00 00 	lwz     r8,0(r4)
   94234:	39 23 ff e8 	addi    r9,r3,-24
   94238:	38 84 ff ec 	addi    r4,r4,-20
   9423c:	38 a5 00 05 	addi    r5,r5,5
   94240:	4b ff ff 38 	b       94178 <_wordcopy_fwd_aligned+0x60>
   94244:	60 00 00 00 	nop
   94248:	81 44 00 00 	lwz     r10,0(r4)
   9424c:	39 23 ff ec 	addi    r9,r3,-20
   94250:	38 84 ff f0 	addi    r4,r4,-16
   94254:	38 a5 00 04 	addi    r5,r5,4
   94258:	4b ff ff 18 	b       94170 <_wordcopy_fwd_aligned+0x58>
   9425c:	60 00 00 00 	nop
   94260:	81 04 00 00 	lwz     r8,0(r4)
   94264:	39 23 ff f0 	addi    r9,r3,-16
   94268:	38 84 ff f4 	addi    r4,r4,-12
   9426c:	38 a5 00 03 	addi    r5,r5,3
   94270:	4b ff fe f8 	b       94168 <_wordcopy_fwd_aligned+0x50>
   94274:	60 00 00 00 	nop

The disassembly of the main loop in the vmx code was provided here:
    http://lists.freedesktop.org/archives/pixman/2015-July/003798.html

    b170:	39 69 00 10 	addi    r11,r9,16
    b174:	7c 80 48 0c 	lvsl    v4,r0,r9
    b178:	38 a9 00 20 	addi    r5,r9,32
    b17c:	7c a0 48 ce 	lvx     v5,r0,r9
    b180:	38 c9 00 30 	addi    r6,r9,48
    b184:	7d 87 48 ce 	lvx     v12,r7,r9
    b188:	7c c0 58 0c 	lvsl    v6,r0,r11
    b18c:	39 29 00 40 	addi    r9,r9,64
    b190:	7c e0 58 ce 	lvx     v7,r0,r11
    b194:	7d a7 58 ce 	lvx     v13,r7,r11
    b198:	39 6a 00 10 	addi    r11,r10,16
    b19c:	7d 00 28 0c 	lvsl    v8,r0,r5
    b1a0:	7d 20 28 ce 	lvx     v9,r0,r5
    b1a4:	7c 27 28 ce 	lvx     v1,r7,r5
    b1a8:	38 aa 00 20 	addi    r5,r10,32
    b1ac:	7d 40 30 0c 	lvsl    v10,r0,r6
    b1b0:	7d 60 30 ce 	lvx     v11,r0,r6
    b1b4:	7c 07 30 ce 	lvx     v0,r7,r6
    b1b8:	38 ca 00 30 	addi    r6,r10,48
    b1bc:	11 85 61 2b 	vperm   v12,v5,v12,v4
    b1c0:	11 a7 69 ab 	vperm   v13,v7,v13,v6
    b1c4:	10 29 0a 2b 	vperm   v1,v9,v1,v8
    b1c8:	10 0b 02 ab 	vperm   v0,v11,v0,v10
    b1cc:	7d 80 51 ce 	stvx    v12,r0,r10
    b1d0:	39 4a 00 40 	addi    r10,r10,64
    b1d4:	7d a0 59 ce 	stvx    v13,r0,r11
    b1d8:	7c 20 29 ce 	stvx    v1,r0,r5
    b1dc:	7c 00 31 ce 	stvx    v0,r0,r6
    b1e0:	42 00 ff 90 	bdnz+   b170 <vmx_blt.part.4+0x260>

And I already mentioned that I was not particularly happy about how
unaligned loads are implemented on big endian powerpc systems. However
even such non-perfect VMX code was still faster than non-VMX code
in glibc-2.15 on my Plasytation3. I'll try to update glibc to a more
recent version 2.20 (this system has not been updated for years) and
check if it makes any difference.

Either way, Playstation3/Cell is a dead end and I doubt that there
are many of them in the whole world still capable of running Linux
(this functionality had been officially crippled by firmware updates
from Sony). Your results from POWER7/POWER8 are surely a lot more
relevant.

PS. We might want to revisit the SSE2 code and check if having the blt
is justified there (benchmark it against the memcpy from modern glibc).

-- 
Best regards,
Siarhei Siamashka