[Pixman] [PATCH/RFC 0/2] Faster 90/180/270 degrees rotation

Mon Aug 2 06:57:59 PDT 2010

Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:

> The following patches introduce a new fast path flag and initial
> C fast path code for faster 90/180/270 degrees rotation.
> 
> It is not intended to be committed as is (the flags related part will
> have to be updated after the pending patches from Soeren reach git master).
> 
> Also looks like it is a bit difficult to make a universal C implementation
> which would work optimally on all targets (due do varying cache line sizes,
> hardware prefetch algorithms, TLB properties, etc.). And what is better
> for one architecture, seems to sometimes degrade performance for the other.
> This particular code seems to work fine on Intel Core2 (almost reaching
> performance of a simple nonrotated copy), but does not look very good on
> Intel Atom and ARM Cortex-A8.

Do you have any more details on why this is? The cache line size on
all those CPUs is 64 bytes, so you would think that it should work
equally well on all of them.

In your patch you access the destination in wide columns, which is a
sensible choice, but it does mean that you get a bunch of TLB misses
for the destination. Is this what is going on?

If you wanted to fix that, I suppose the optimal access pattern would
be to access in TLBSIZE x TLBSIZE tiles, and then in CACHELINE x
CACHELINE tiles within those. Ie., for the AMD I'm using now, which
has a TLB of 1024 pages, the ideal access pattern would be

        for (each 1024x1024 tile accessed in whatever order)
                for (each (64/pixel_size)x(64/pixel_size) tile)
                        ...

But 1024 x 1024 is big enough that we rarely see such images. Perhaps
the TLB is smaller on Cortex A8?

Is the performance of the current patch actually worse than a
non-tiled access, or is it just not as good as it could be?

Soren