[Pixman] [PATCH 0/2] Better instruction scheduling of solid source operator OVER

Sun Aug 21 22:01:53 PDT 2011

From: Taekyun Kim <tkq.kim at samsung.com>

I send fixed version of previous patches.
Previous version can be seen from here.

http://lists.freedesktop.org/archives/pixman/2011-August/001390.html

Some registers were clobbered and I did mistakes when running make check.
Now it passes make check.

I measured performance again with new cortex-a9 device running @ 1.2GHz.
Here are the numbers.

<< over_n_8_8888 >>
- cortex a8 -
before : L1: 201.35  L2: 190.48  M:101.94 ( 54.85%)  HT: 78.41  VT: 63.83  R: 58.25  RT: 21.74 ( 191Kops/s)
after  : L1: 257.65  L2: 255.49  M:102.04 ( 55.33%)  HT: 79.19  VT: 65.46  R: 59.23  RT: 21.12 ( 189Kops/s)

- cortex a9 -
before : L1: 157.35  L2: 159.81  M:133.00 ( 60.94%)  HT: 82.44  VT: 63.64  R: 51.66  RT: 19.15 ( 179Kops/s)
after  : L1: 216.83  L2: 219.40  M:135.83 ( 61.80%)  HT: 85.60  VT: 64.80  R: 52.23  RT: 19.16 ( 179Kops/s) 

<< over_n_8888 >>
- cortex a8 -
before : L1: 375.39  L2: 391.93  M:114.39 ( 40.99%)  HT: 99.37  VT: 98.20  R: 90.24  RT: 32.87 ( 240Kops/s)
after  : L1: 481.90  L2: 483.46  M:114.29 ( 40.69%)  HT:106.91  VT: 93.38  R: 90.74  RT: 29.51 ( 236Kops/s)

- cortex a9 -
before : L1: 324.50  L2: 332.79  M:155.55 ( 47.51%)  HT:111.93  VT: 93.58  R: 71.92  RT: 28.21 ( 233Kops/s)
after  : L1: 355.87  L2: 364.49  M:156.90 ( 47.59%)  HT:111.52  VT: 91.76  R: 72.16  RT: 28.22 ( 234Kops/s)

And here are some replies to siarhei's comments.

> Thanks, that's good. By the way, do you have some other a bit less
> synthetic benchmarks for geometry filling workload, which could show
> some measurable performance improvement?

I don't have micro-benchmark for geometry filling, instead I usually run 
some HTML5 canvas benchmarks which strokes and fills various kind of paths.
Below link is one of the benchmarks that I use.
http://flashcanvas.net/examples/dl.dropbox.com/u/1865210/mindcat/canvas_perf.html

> You can also try to run benchmarks using with different prefetch types
> in 'pixman-arm-neon-asm.S' by changing:
>     .set PREFETCH_TYPE_DEFAULT, PREFETCH_TYPE_ADVANCED
> to either of
>     .set PREFETCH_TYPE_DEFAULT, PREFETCH_TYPE_NONE
>     .set PREFETCH_TYPE_DEFAULT, PREFETCH_TYPE_SIMPLE
>
> Cortex-A9 has automatic hardware prefetcher. And if it is enabled
> (there are some hardware bugs in early Cortex-A9 revisions, so it may
> be disabled for a good reason), then PREFETCH_TYPE_NONE may be a good
> choice.

When I simply removed prefetching codes, I got no performance changes
in memory, but I got slowdown for L2 performance. I will check those
prefetch types NONE/SIMPLE for other fast paths too.

> In any case, now things are a little bit more interesting because with
> the latest ARM hardware we got a downgraded NEON unit in Cortex-A9
> (or speaking positively, should we say "balanced"?) and somewhat
> better memory bandwidth.

Thanks for all your information about behaviour of ARM and NEON.
That was very helpful to me. Definitely making things well optimized on 
both Cortex-A8 and A9 is quite necessary and I will contribute in that
direction.

--
Best Regards,
Taekyun Kim

Taekyun Kim (2):
  ARM: NEON better instruction scheduling of over_n_8_8888
  ARM: NEON better instruction scheduling of over_n_8888

 pixman/pixman-arm-neon-asm.S |  138 +++++++++++++++++++++++++++++++++---------
 test/lowlevel-blt-bench.c    |    9 ++-
 2 files changed, 113 insertions(+), 34 deletions(-)