[Pixman] [PATCH 00/12] Implement more vmx fast paths

Thu Jul 16 01:00:23 PDT 2015

On Wed, Jul 15, 2015 at 6:48 PM, Adam Jackson <ajax at redhat.com> wrote:
> On Thu, 2015-07-02 at 13:04 +0300, Oded Gabbay wrote:
>> Hi,
>>
>> This patch-set implements the most heavily used fast paths, according to
>> profiling done by me using the cairo traces package.
>
> I finally got a chance to try this series on a power7, and the results
> are... mixed.  A sampling of x11perf numbers (against Xvfb, just
> switching pixman before and after):
>
>       before          after                Operation
> ------------   -------------------------   -------------------------
>    6856255.6      5564651.7 (     0.812)   10x10 rectangle
>     125522.9       455209.1 (     3.627)   100x100 rectangle
>       5419.2        29705.8 (     5.482)   500x500 rectangle
>
> This one is telling, I think.  This should be the vmx_fill path, and it
> looks like a nice win for large ops but a hit for small ops.  Is the
> vmx setup cost that high, or is there something else going on?
>
Yes, the setup is that high for fill :(
I noticed this right when I started to convert the functions from sse2
to vmx. The reason is that for every line in the image, you first
align to 16 byte and only then start to use vmx. The alignment to 16
byte is costly! If the width of the image is small, you may not even
have vmx operations done at all! In that case, the C fast-path is of
course faster.

I think the next optimization is to separate the implementation to
POWER8 and !POWER8. In POWER8, you can do unaligned access with almost
no penalty, so it is better to drop the alignment requirement and use
vmx from the start. But for POWER7 and below, we need to use the
current code.

Another option IMO, is to detect image size &alignment before
starting, and if it is small and unaligned, drop to the fallback (C
fast-path).

>    1641838.0      1684290.9 (     1.026)   Char in 80-char aa line (Charter 10)
>     432916.1       466759.2 (     1.078)   Char in 30-char aa line (Charter 24)
>    1412008.5      1545401.0 (     1.094)   Char in 80-char aa line (Courier 12)
>    1440361.7      1947014.6 (     1.352)   Char in 80-char rgb line (Charter 10)
>     384600.6       576289.5 (     1.498)   Char in 30-char rgb line (Charter 24)
>    1258381.8      1811421.7 (     1.439)   Char in 80-char rgb line (Courier 12)
>
> Render text gets faster, nice.
>
>    1202555.7      1228256.6 (     1.021)   Scroll 10x10 pixels
>     162282.8       131857.7 (     0.813)   Scroll 100x100 pixels
>       6819.8         6256.2 (     0.917)   Scroll 500x500 pixels
>    1695720.5      1752339.8 (     1.033)   Copy 10x10 from pixmap to window
>     210222.2       165836.1 (     0.789)   Copy 100x100 from pixmap to window
>      14408.8        10600.1 (     0.736)   Copy 500x500 from pixmap to window
>
> This should be the vmx_blit path, and it gets quite a bit worse for
> large ops.  Eesh.
>
>    1021293.5      1060568.6 (     1.038)   PutImage 10x10 square
>      54803.7        56420.0 (     1.029)   PutImage 100x100 square
>       1933.5         1935.4 (     1.001)   PutImage 500x500 square
>    1418641.0      1432543.1 (     1.010)   ShmPutImage 10x10 square
>     194769.2       160047.5 (     0.822)   ShmPutImage 100x100 square
>      11951.2        10968.1 (     0.918)   ShmPutImage 500x500 square
>
> Again, blit path, and usually worse for large ops.
>
>     576975.4       573388.4 (     0.994)   Composite 10x10 from pixmap to window
>     156830.4       131246.8 (     0.837)   Composite 100x100 from pixmap to window
>      12172.5        10150.2 (     0.834)   Composite 500x500 from pixmap to window
>
> Not-quite-a-blit path, but no transformation, and the same kind of
> performance hit.
>
>     176570.2       176330.2 (     0.999)   Scale 5x5 from pixmap to 10x10 window
>       4598.0         4460.9 (     0.970)   Scale 50x50 from pixmap to 100x100 window
>        189.9          185.9 (     0.979)   Scale 250x250 from pixmap to 500x500 window
>     269540.6       269767.4 (     1.001)   Scale 10x10 from pixmap to 5x5 window
>     267201.2       268220.5 (     1.004)   Scale 100x100 from pixmap to 5x5 window
>        766.8          740.1 (     0.965)   Scale 500x500 from pixmap to 250x250 window
>
> All within the noise margin, so I suspect the series just doesn't hit
> these paths.  (Ignore the implausible numbers from "Scale 100x100",
> that's an x11perf bug I just pushed a fix for.)
>
> I'm a little hesitant to take a 10% to 20% hit to software blit
> performance.  It might be that vmx_blt is just a mistake to try, that
> the CPU and compiler are smarter than we are.
>
Almost same story as for vmx_fill. Note that I had removed this patch
from the v2 I sent yesterday. Your observation strengthens my decision
to remove it.

> - ajax

To sum it up, I think vmx_fill gives a lot of boost with some
drawdawns, and vmx_blt is the opposite. So I would like to keep
vmx_fill and drop vmx_blt for now.
And, as I said, next step is to differentiate between POWER8 and
POWER7 (and older).

    Oded