[Pixman] [PATCH 2/2] ARM: Add 'neon_composite_over_n_8888_0565' fast path

Wed Apr 6 11:34:36 PDT 2011

On Tue, Apr 5, 2011 at 7:46 AM, Taekyun Kim <podain77 at gmail.com> wrote:
> When we reorder some assembly lines, proper comments should be inserted for
> the people who tries to understand the code.
> It also make it easy later refactoring or something.

I could not come up with any better solution other than just using
different indentation for logically independent code blocks. Something
like converting:

/* do A */
A1
A2
/* 1 cycle stall */
A3
/* 2 cycles stall */
A4
/* done with A */
/* do B */
B1
B2
B3
/* done with B */

into

/* do A */
A1
A2
    /* do B */
    B1
A3
    B2
    B3
    /* done with B */
A4
/* done with A */

Of course, people reading the source code need to know about this
"convention". And it has its own disadvantages too. If anyone can
propose something more maintainable and easier to read, I'm all ears.

Maybe changing to the use of native codegenerator to compile fast path
code at runtime could make it easier. If we do a good job teaching it
to know instruction scheduling rules well enough.

> Although I still have some questions about effectiveness of reordering(SW
> pipelining) on out-of-order core,

We don't have Cortex-A15 in our hands yet, so we don't know for sure
:) Cortex-A9 has almost the same NEON unit as Cortex-A8, but does not
support even limited dual issue for NEON instructions anymore. There
are also other minor differences, mostly related to load/store
instructions but they very rarely show up.

> it significantly affects the performance on mostly used in-order superscalar
> CPUs.

Yes, and the results are quite measurable (tens of percents in many cases).

> And pixman_composite_over_n_8888_0565_ca_process_pixblock_head in tail_head
> block increases code size causing i-cache miss.
> We can think of jumping to head and then return to next part of tail_head
> block.
> But it seems difficult to do that without breaking
> generate_composite_function macro template.

The whole point of the use of software pipelining here is being able
to overlap the last part of previous iteration with the beginning of
current iteration. Jumping to "head" does not make much sense because
the instructions from "head" and "tail" can be reordered quite wildly,
diffusing into each other, so there is no clear border anymore.

And yes, unfortunately pipelining doubles the size of code. Same as
unrolling. It may be possible to add a way to disable pipelining for
some of the fast paths via some new special flag. So that the code
size can be reduced for the fast paths where pipelining provides too
little or no performance gain.

> One last thing is about practical benefits of this fast paths.
> The most important customer of these fast path functions is glyph rendering.
> Cairo's image backend show_glyph() functions composite cached glyph masks
> with solid color pattern using operator OVER in most cases.
> When there're overlaps between glyph boxes (by kerning or something), cairo
> create a mask equal size with entire text extent,
> and then accumulate component alpha with operator ADD and then composite
> using entire mask.
> So for non-overlapped cases, small sized OVER composite will frequently
> happen and small sized ADD composite for overlapped cases.
> We need to focus at the performance of small sized image composition with
> both operator OVER and ADD.

There were some ideas about improving text rendering in general and
extending pixman API to handle this task more effectively.

> The overhead for small sized images can be approximately identified by
> comparing rendering times of drawing total n pixels in m times (n/m pixels
> per one drawing) where m increases start from 1.
> This can contain function calling overhead, cache overhead, code control
> overhead, etc.

I think it might be interesting for you. I also have the following
experimental branch:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=playground/slow-path-reporter

It collects statistics about what operations do not have optimized
fast paths, along with the number of uses of these operations, total
number of pixels processed, average number of pixels per operation and
average scanline length. The code is currently linux specific and
writes results to syslog. These results can be converted into a more
human readable form by a script. I'm using it quite successfully and
it revealed some of the missing optimizations which would be hard to
identify in some other way.

-- 
Best regards,
Siarhei Siamashka