[Pixman] [PATCH 2/2] ARM: Add 'neon_composite_over_n_8888_0565' fast path

Mon Apr 4 21:46:23 PDT 2011

Hi,

I'm also a bit new to NEON, however I hope this brief review can be helpful
to you.
So let's start.

2011/4/5 Søren Sandmann Pedersen <sandmann at cs.au.dk>

> From: Søren Sandmann Pedersen <ssp at redhat.com>
>
> +    /*
> +     * 'combine_over_ca' replacement
> +     *
> +     * output: updated dest in d16 - blue, d17 - green, d18 - red
> +     */
> +    vmvn.8      q12, q12
> +    vmull.u8    q6,  d16, d24
> +    vmull.u8    q7,  d17, d25
> +    vmvn.8      d26, d26
> +    vmull.u8    q11, d18, d26
> +.endm
> +
>

There are some pipeline stalls, destination registers of vmvn will be
available at N3.
I think we can reorder above code like this.

vmvn.8      q12, q12
vshrn.u16   d16, q2,  #2 // last line of previous block
vmvn.8      d26, d26
vmull.u8    q6,  d16, d24
vmull.u8    q7,  d17, d25
vmull.u8    q11, d18, d26

+    vshll.u8    q14, d18, #8
> +    vshll.u8    q10, d17, #8
> +    vshll.u8    q15, d16, #8
> +    vsri.u16    q14, q10, #5
> +    vsri.u16    q14, q15, #11
>

It seems there are some what more stalls here.
See this page.
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch16s06s04.html
It seems impossible to eliminate all stalls by reordering only in this code
block.
Maybe reordering together with some above code lines would work for this.

vqadd.u8  d18, d2,  d18
vshll.u8    q10, d17, #8
vshll.u8    q14, d18, #8
vqadd.u8  q8,  q0,  q8
vsri.u16    q14, q10, #5
vshll.u8    q15, d16, #8
vsri.u16    q14, q15, #11

This will reduce the stalls but still some are remaining.
Maybe there can be some better ways.

When we reorder some assembly lines, proper comments should be inserted for
the people who tries to understand the code.
It also make it easy later refactoring or something.
<http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ik/CIHDIGCC.html>Although
I still have some questions about effectiveness of reordering(SW pipelining)
on out-of-order core,
it significantly affects the performance on mostly used in-order superscalar
CPUs.

And pixman_composite_over_n_8888_0565_ca_process_pixblock_head in tail_head
block increases code size causing i-cache miss.
We can think of jumping to head and then return to next part of tail_head
block.
But it seems difficult to do that without breaking
generate_composite_function macro template.

One last thing is about practical benefits of this fast paths.
The most important customer of these fast path functions is glyph rendering.
Cairo's image backend show_glyph() functions composite cached glyph masks
with solid color pattern using operator OVER in most cases.
When there're overlaps between glyph boxes (by kerning or something), cairo
create a mask equal size with entire text extent,
and then accumulate component alpha with operator ADD and then composite
using entire mask.
So for non-overlapped cases, small sized OVER composite will frequently
happen and small sized ADD composite for overlapped cases.
We need to focus at the performance of small sized image composition with
both operator OVER and ADD.

The overhead for small sized images can be approximately identified by
comparing rendering times of drawing total n pixels in m times (n/m pixels
per one drawing) where m increases start from 1.
This can contain function calling overhead, cache overhead, code control
overhead, etc.

-- 
Best Regards,
Taekyun Kim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110405/bb9a9408/attachment-0001.html>