[cairo] [PATCH/RFC] pixman ARM NEON optimizations (now about performance)

Siarhei Siamashka siarhei.siamashka at gmail.com
Mon Jul 27 19:30:26 PDT 2009


On Monday 27 July 2009, Siarhei Siamashka wrote:
[...]
> An interesting thing is that we have a heavily underused ARM
> pipeline while NEON unit is doing all the work. It is possible to implement
> quite a complex prefetch logic using ARM instructions, and it would have
> zero impact on the performance. I have some interesting stuff already and
> will post about the "quest for the ideal prefetcher" in the next e-mail.

Fine-grained prefetcher is now pushed to
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-neon-preview

The following code fragment implements a prefether, which runs
ahead of the main code, gradually increases prefetch distance up
to a certain limit (right now it is 10 steps by 8 pixels = 320 bytes)
and stops prefetching data right at the end of the image. Prefetching
beyond the right boundary is also done in order to ensure that all
data is prefetched. Upon going to the next line, LDR instruction is
used because we may have TLB miss here when stride is large enough (and
normal prefetch tried with PLD is just ignored on TLB misses).

new variables: pf_ctl (high bits - limit for the number of lines to advance,
lowest 4 bits - number of prefetch distance increase steps), pf_x (current
pixel position in a line), pf_src (pointer to the start of source line),
pf_dst (pointer to the start of destination line), dummy (useless)

/* increment position */
"add %[pf_x], %[pf_x], #8\n"
"tst %[pf_ctl], #0xF\n"
"addne %[pf_x], %[pf_x], #4\n"
"subne %[pf_ctl], %[pf_ctl], #1\n"

/* prefetch */
"pld [%[pf_src], %[pf_x], lsl #2]\n"
"pld [%[pf_dst], %[pf_x], lsl #1]\n"

/* move to the next line when needed */
"cmp %[pf_x], %[orig_w]\n"
"subge %[pf_x], %[pf_x], %[orig_w]\n"
"subges %[pf_ctl], %[pf_ctl], #0x10\n"
"ldrgeb %[dummy], [%[pf_src], %[src_stride], lsl #2]!\n"
"ldrgeb %[dummy], [%[pf_dst], %[dst_stride], lsl #1]!\n"

This sequence can be mixed into the flow of NEON instructions in the main
loop. Its overhead should be zero ideally as long as NEON instructions
dominate.

Benchmark results are the following (today they seem to have drifted
a bit from the yesterday's values for some reason, but are still
reproducible across multiple runs).

1. NEON code without any kind of prefetch:
over_8888_0565 - L1: 128.20 M: 30.27 HT: 19.49 VT: 19.23 R: 16.06
over_8888_0565 = L1:  96.25 M: 30.48 HT: 18.30 VT: 18.10 R: 15.26
over_8888_0888 - L1: 114.16 M: 23.01 HT: 16.20 VT: 15.94 R: 13.99
over_8888_0888 = L1:  86.90 M: 22.99 HT: 15.35 VT: 15.10 R: 13.35

2. Simple "128 bytes ahead" prefetch:
over_8888_0565 - L1: 128.24 M: 53.52 HT: 27.41 VT: 20.93 R: 16.78
over_8888_0565 = L1:  97.26 M: 53.46 HT: 25.21 VT: 19.59 R: 15.92

3. NEON code with fine-grained prefetcher:
over_8888_0565 - L1: 123.35 M: 56.03 HT: 30.22 VT: 25.10 R: 21.71
over_8888_0565 = L1:  93.03 M: 55.97 HT: 27.55 VT: 23.29 R: 20.30

Performance of the main loop for the perfectly cached case has
regressed a bit (by something like 2 cycles), probably this can
be improved with some different instructions order (I could get
even worse results with different order). Search for the optimal
instructions order can be automated by trying insert them randomly
in the code and using some kind of genetic selection. I will try
it later. An interesting thing is that even putting prefetcher
code into 'process_pixblock_head' macro without intermixing ARM
and NEON instructions, does not have any performance penalty at
all (probably just because it is less efficient, pipelined variant
of this NEON code is about ~30% faster).

Tests results involving memory accesses got quite a noticeable boost
with the fine-grained prefetcher.

But prefetcher also has own price. It requires additional 5 registers,
increasing the total number to 13. And this is not very nice because
14 registers is a hard limit, using frame pointers or PIC may reduce
it even more. Looks like either separate .S files, or "naked" functions
may be needed to get full control over registers allocation.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://lists.cairographics.org/archives/cairo/attachments/20090728/c7bdaa2c/attachment.pgp 


More information about the cairo mailing list