[Pixman] [PATCH 00/12] Implement more vmx fast paths

Wed Jul 15 08:48:28 PDT 2015

On Thu, 2015-07-02 at 13:04 +0300, Oded Gabbay wrote:
> Hi,
> 
> This patch-set implements the most heavily used fast paths, according to
> profiling done by me using the cairo traces package.

I finally got a chance to try this series on a power7, and the results
are... mixed.  A sampling of x11perf numbers (against Xvfb, just
switching pixman before and after):

      before          after                Operation
------------   -------------------------   -------------------------
   6856255.6      5564651.7 (     0.812)   10x10 rectangle 
    125522.9       455209.1 (     3.627)   100x100 rectangle 
      5419.2        29705.8 (     5.482)   500x500 rectangle 

This one is telling, I think.  This should be the vmx_fill path, and it
looks like a nice win for large ops but a hit for small ops.  Is the
vmx setup cost that high, or is there something else going on?

   1641838.0      1684290.9 (     1.026)   Char in 80-char aa line (Charter 10) 
    432916.1       466759.2 (     1.078)   Char in 30-char aa line (Charter 24) 
   1412008.5      1545401.0 (     1.094)   Char in 80-char aa line (Courier 12) 
   1440361.7      1947014.6 (     1.352)   Char in 80-char rgb line (Charter 10) 
    384600.6       576289.5 (     1.498)   Char in 30-char rgb line (Charter 24) 
   1258381.8      1811421.7 (     1.439)   Char in 80-char rgb line (Courier 12) 

Render text gets faster, nice.

   1202555.7      1228256.6 (     1.021)   Scroll 10x10 pixels 
    162282.8       131857.7 (     0.813)   Scroll 100x100 pixels 
      6819.8         6256.2 (     0.917)   Scroll 500x500 pixels 
   1695720.5      1752339.8 (     1.033)   Copy 10x10 from pixmap to window 
    210222.2       165836.1 (     0.789)   Copy 100x100 from pixmap to window 
     14408.8        10600.1 (     0.736)   Copy 500x500 from pixmap to window

This should be the vmx_blit path, and it gets quite a bit worse for
large ops.  Eesh.

   1021293.5      1060568.6 (     1.038)   PutImage 10x10 square 
     54803.7        56420.0 (     1.029)   PutImage 100x100 square 
      1933.5         1935.4 (     1.001)   PutImage 500x500 square 
   1418641.0      1432543.1 (     1.010)   ShmPutImage 10x10 square 
    194769.2       160047.5 (     0.822)   ShmPutImage 100x100 square 
     11951.2        10968.1 (     0.918)   ShmPutImage 500x500 square 

Again, blit path, and usually worse for large ops.

    576975.4       573388.4 (     0.994)   Composite 10x10 from pixmap to window 
    156830.4       131246.8 (     0.837)   Composite 100x100 from pixmap to window 
     12172.5        10150.2 (     0.834)   Composite 500x500 from pixmap to window 

Not-quite-a-blit path, but no transformation, and the same kind of
performance hit.

    176570.2       176330.2 (     0.999)   Scale 5x5 from pixmap to 10x10 window 
      4598.0         4460.9 (     0.970)   Scale 50x50 from pixmap to 100x100 window 
       189.9          185.9 (     0.979)   Scale 250x250 from pixmap to 500x500 window 
    269540.6       269767.4 (     1.001)   Scale 10x10 from pixmap to 5x5 window 
    267201.2       268220.5 (     1.004)   Scale 100x100 from pixmap to 5x5 window 
       766.8          740.1 (     0.965)   Scale 500x500 from pixmap to 250x250 window

All within the noise margin, so I suspect the series just doesn't hit
these paths.  (Ignore the implausible numbers from "Scale 100x100",
that's an x11perf bug I just pushed a fix for.)

I'm a little hesitant to take a 10% to 20% hit to software blit
performance.  It might be that vmx_blt is just a mistake to try, that
the CPU and compiler are smarter than we are.

- ajax