[Pixman] [PATCH 1/2] ARM: Tiny improvement in over_n_8888_8888_ca_process_pixblock_head

Sun Apr 10 08:47:19 PDT 2011

On Thu, Apr 7, 2011 at 6:10 AM, Soeren Sandmann <sandmann at cs.au.dk> wrote:
> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
>
>> On a second look, other than that, the whole function is actually fine
>> on Cortex-A8 and does not need any other performance tweaks because it
>> did not have many "bubbles" to fill in the first place. You can push
>> this patch with the additional "vmvn.8 d26, d26" instruction move (I
>> don't think a separate patch is needed for that).
>
> Alright, pushed to master.

Good, thanks.

> I'll follow up on Taekyun Kim's and your
> comments on the other patch later this week.

Well, as you wish. These standard non-scaled fast paths are
self-contained and can be added or improved any time, and the
underlying framework is already stable for such fast paths which means
no major disruptions. Your current implementation for
over_n_8888_0565_ca is already definitely helping a lot as proven by
cairo trace benchmark, a possible additional speedup might be even
hardly measurable. Providing good instructions scheduling from the
very start would be nice just to be done with it and have no need to
revisit this code later, but I would definitely like to see NEON
variant of over_n_8888_0565_ca added in pixman 0.22.0 either way.

If you still want to experiment with NEON instructions scheduling, I
recommend using something like the attached test program. It currently
contains the following test code:

    /* <<<<<<<< START OF THE NEON CODE TO BE BENCHMARKED >>>>>>> */

   vadd.s32 q0, q0, q0
   /* 1 cycle stall in NEON pipeline. The result almost never can be used
    * by the next instruction without penalty even for the basic arithmetic
    * instructions
    */
   vadd.s32 q0, q0, q0
   /* Note: 2 cycle stall here! Because VSUB instruction is a bit special
    * and needs the 3rd operand to be available earlier than the others, see:
    * http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch16s06s02.html
    */
   vsub.s32 q1, q1, q0
   vzip.8   q3, q4 /* An interesting case: this VZIP instruction needs 3 cycles
                    * itself, but because it is permute instruction, it can
                    * dual-issue with other arithmetic instructions.
                    * And dual-issue is possible on both its first and last
                    * cycles, effectively overlapping with preceeding VSUB and
                    * succeeding VADD, causing all three of them to only
                    * require 3 cycles in NEON pipeline. ARM Cortex-A9 can't
                    * dual-issue NEON instructions and will need 5 cycles here.
                    */
   vadd.s32 q1, q1, q1

   /*
    * If we sum everything up, we should get 8 cycles on ARM Cortex-A8 and
    * 10 cycles on ARM Cortex-A9 for this code sequence.
    */

    /* <<<<<<<<< END OF THE NEON CODE TO BE BENCHMARKED >>>>>>>> */

And as a result, on 1GHz Cortex-A8 we get:

$ gcc gcc neon-bench-template.S
$ time ./a.out

real    0m8.026s
user    0m8.023s
sys     0m0.000s

This template is mostly ready for using it with the code from pixman
standard fast paths (taking the instructions from '*_tail_head' macro
for some fast path). The only additional thing which needs to be done
when taking code for benchmarking is to expand 'fetch_src_pixblock'
and 'fetch_mask_pixblock' macros to the right instruction. But this
should be easy.

And the pipelining generally works in the following way here. First we
split all the NEON register to 2 sets (let's say A and B). Then the
fast path can be implemented as 3 parts:
1. load input data and do some calculations using the registers from set A
2. calculations using the registers from set B as destination, and A as source
3. do some calculations using the registers from set B and store the result

The first two parts of code go to 'head' macro, and the last part goes
to 'tail'. After this is done, the instructions from 'tail' can be
freely mixed with the instructions from 'head' of the next iteration.
And this allows to hide a lot of pipeline stalls and also dual issue
load/store instructions with arithmetic instructions.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: neon-bench-template.S
Type: application/octet-stream
Size: 4633 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110410/d6062de3/attachment.obj>