x11perf putimagexy10 extremly slow for glamor on i965

Mon May 15 21:11:31 UTC 2017

On Sun, 2017-05-14 at 16:16 -0700, Keith Packard wrote:
> > Clemens Eisserer <linuxhippy at gmail.com> writes:
> 
> > I have observed extremly low x11perf-putimagexy10 results when using
> > glamor on top of the i965 driver - while shmput10 results are quite
> > ok.
> 
> Why do you care? xy format images are not something you should be using
> at all; they were designed for 1980s era Apollo workstations.

Indeed. But for the morbidly curious...

Glamor mostly doesn't accelerate xy image handling because GL mostly
doesn't believe in planar bitmaps as, like, a thing. The fallback path
creates a pbo, copies out from the drawable's fbo to the pbo,
glMapBuffer's the pbo, runs the op against the pbo with fb, and then
copies the data back to the fbo. So that's bad enough, now every
operation is at least two more blits.

Underneath, fbPutXYImage works by walking the bit planes in order and
merging them into the destination. This is just about pathologically
bad, because it means a read/modify/write cycle of the entire
destination _for each plane_. So you're doing many more operations per
pixel, and you're fighting the cache to do it.

You're also not comparing apples to apples in your test. shmput10 is
10x10 ZPixmap, you mean to compare to shmputxy10. Interestingly at
least on my glamor machine shmputxy10 is _slower_ than non-shm:

 84000 trep @   0.1108 msec (  9030.0/sec): PutImage XY 10x10 square
 72000 trep @   0.1302 msec (  7680.0/sec): ShmPutImage XY 10x10 square

I think that's just a funny interaction with the MIT-SHM code, which
when faced with an xy image will blast it into a (presumably z image)
scratch pixmap first and then CopyArea from that to the destination. If
glamor creates that pixmap on the GPU we're still going to do the same
fallback logic for the xy putimage phase, and then yet another blit
from that to the real destination. Oops. Forcing that pixmap's usage to
be GLAMOR_CREATE_PIXMAP_CPU brings shmputxy10 to 117kops/sec, which is
quite a bit nicer; probably we should formalize that usage for
ShmPutImage, and maybe do the equivalent trick for wire PutImage too.

But compare this all with leaving your pixels in a sensible format:

6000000 trep @   0.0019 msec (525000.0/sec): PutImage 10x10 square
4800000 trep @   0.0022 msec (454000.0/sec): ShmPutImage 10x10 square

XY images are just losers, don't bother. (ShmPutImage being slower is
curious, and it's slower for larger request sizes too, so there's
definitely something amiss there we should dig into.)

- ajax