[Pixman] [PATCH 3/5] ARMv6: New blit routines

Wed Jan 23 05:22:19 PST 2013

On Tue, 22 Jan 2013 14:26:50 -0000
"Ben Avison" <bavison at riscosopen.org> wrote:

> On Tue, 22 Jan 2013 13:10:54 -0000, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> > Just one thing looks a bit odd.
> >
> >> src_8888_8888
> >>
> >>     Before          After
> >>     Mean   StdDev   Mean   StdDev  Confidence  Change
> >> M   57.0   0.2      89.2   0.5     100.0%      +56.4%
> >
> > 89.2 MPix/s * 32bpp = ~357 MB/s
> >
> >> src_0565_0565
> >>
> >>     Before          After
> >>     Mean   StdDev   Mean   StdDev  Confidence  Change
> >> M   90.7   0.4      133.5  0.7     100.0%      +47.1%
> >
> > 133.5 MPix/s * 16bpp = ~267 MB/s
> >
> > Seems to be a much less efficient use of memory bandwidth here
> > compared to src_8888_8888?
> 
> I think what you're seeing here is the speed difference between the word-
> aligned and misaligned code paths, because the M test cycles over many
> starting X positions for the source buffer, but always uses X=1 for the
> destination buffer. For 32bpp, all pixel positions are word-aligned, but
> for 16bpp this will result in half the runs being misaligned.
> 
> I tried tweaking the M test to force alignment or misalignment, to get
> some comparative timings for src_0565_0565:
> 
> Aligned: 169.8 Mpix/s (rather closer to 2* the 32bpp result)
> Unaligned: 108.6 Mpix/s
> 
> I'm open to suggestions as to how to improve the misaligned case. Early
> in development, I compared the speed of doing LDM followed by in-
> register shuffling with either ORR or PKH instructions against using
> lots of unaligned LDRs, and the LDRs came out fastest by a small margin,
> which is why that's what's used in my patch.

I get the following results, when comparing to LDM based memcpy which
performs reshuffling with ORR instructions

memcpy (PIXMAN_DISABLE=arm-simd):
    src_0565_0565 =  L1: 425.14  L2: 216.23  M:169.38 HT: 34.77  VT: 27.29  R: 26.47  RT:  9.45
    src_8888_8888 =  L1: 490.12  L2: 139.03  M: 91.23 HT: 26.22  VT: 20.27  R: 20.41  RT:  8.44

pixman-armv6:
    src_0565_0565 =  L1: 404.15  L2: 157.19  M:136.94 HT: 54.44  VT: 48.20  R: 42.83  RT: 14.30
    src_8888_8888 =  L1: 352.31  L2: 144.24  M: 94.56 HT: 40.35  VT: 36.26  R: 34.27  RT: 13.34

The L1/L2 numbers are bogus, because ARM11 (and Cortex-A8 in default
configuration) does not allocate data into cache on write misses.
There is some code here to take read-allocate cache into account:
    http://cgit.freedesktop.org/pixman/tree/test/lowlevel-blt-bench.c?id=pixman-0.28.2#n178
But it uses 64 byte steps (which are bigger than 32 bytes cache line
in ARM11), and also it does such cache "prefetch" only for the first
scanline.

The numbers for HT/VT/R/RT are much better for the armv6 code because
these are testing blits with very small rectangles and per-scanline call
overhead shows up for memcpy.

Still the numbers for M test are interesting and demonstrate that using
LDM with ORR arithmetic is better for unaligned reads. Actually LDM
instruction only takes 1 cycle to issue on ARM11 regardless of the
size of the registers list, and can continue to run in the background
simultaneously with independent ALU instructions.

Another reason to avoid unaligned reads are uncached framebuffers. Some
people are trying to use them directly (not that it's a good idea):
    http://comments.gmane.org/gmane.comp.lib.cairo/23331
And unaligned uncached reads are going to be particularly slow.

Please note that I'm not demanding for an immediate fix. But it
would be nice to add this into a TODO list for the future ARMv6
improvements :-)

-- 
Best regards,
Siarhei Siamashka