[Pixman] [PATCH 2/2] ARM: Add 'neon_composite_over_n_8888_0565' fast path

Tue Apr 12 18:39:51 PDT 2011

Hi,

> I'm also a bit new to NEON, however I hope this brief review can be
> helpful to you.  So let's start.

Thanks for the comments. I'm including a new version shortly that
incorporates some of the changes, but unfortunately, I couldn't reliably
measure much of an improvement. On my system, the L1 value from
lowlevel-blt-bench seems to fluctuate between 85 and 95 pretty much no
matter what I do.

>     +    /*
>     +     * 'combine_over_ca' replacement
>     +     *
>     +     * output: updated dest in d16 - blue, d17 - green, d18 - red
>     +     */
>     +    vmvn.8      q12, q12
>     +    vmull.u8    q6,  d16, d24
>     +    vmull.u8    q7,  d17, d25
>     +    vmvn.8      d26, d26
>     +    vmull.u8    q11, d18, d26
>     +.endm
>  
> There are some pipeline stalls, destination registers of vmvn will be available
> at N3.
> I think we can reorder above code like this.
>  
> vmvn.8      q12, q12
> vshrn.u16   d16, q2,  #2 // last line of previous block
> vmvn.8      d26, d26
> vmull.u8    q6,  d16, d24
> vmull.u8    q7,  d17, d25
> vmull.u8    q11, d18, d26

The new code incorpates this change, and I *think* the L1 value was
somewhat higher on most runs, but to be honest this might just be
confirmation bias on my part ...

>     +    vshll.u8    q14, d18, #8
>     +    vshll.u8    q10, d17, #8
>     +    vshll.u8    q15, d16, #8
>     +    vsri.u16    q14, q10, #5
>     +    vsri.u16    q14, q15, #11
>
>  
> It seems there are some what more stalls here.
> See this page. http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/
> ch16s06s04.html
> It seems impossible to eliminate all stalls by reordering only in this code
> block.
> Maybe reordering together with some above code lines would work for this.
>
> vqadd.u8  d18, d2,  d18
> vshll.u8    q10, d17, #8
> vshll.u8    q14, d18, #8
> vqadd.u8  q8,  q0,  q8
> vsri.u16    q14, q10, #5
> vshll.u8    q15, d16, #8
> vsri.u16    q14, q15, #11

That particular version won't work because the first vshll instruction
uses d17 which is affected by the vqadd.u8 q8, q0, q8
instruction. However, it is possible to write it like this:

     vshll.u8    q14, d18, #8
  vqadd.u8  q8,  q0,  q8
     vshll.u8    q10, d17, #8
     vsri.u16    q14, q10, #5
     vshll.u8    q15, d16, #8
     vsri.u16    q14, q15, #11  

but then various other stalls appear. In my testing, this seemed to be a
slight slowdown (with the same caveat as above).

I think the main lesson is that we want to be serious about this level
of microoptimization, need more a more robust way of measuring
performance than lowlevel-blt-bench provides. In particular more
iteratons and tracking of standard deviations would be useful.

Also, I noticed that lowlevel-blt-bench copies the destination back into
the source in various places:

    memcpy (src, dst, BUFSIZE);
    memcpy (dst, src, BUFSIZE);

This could affect performance if dst ends up containing mostly zeros. In
the cases where the source is supposed to be solid, it could mean
operations like over_n_8_8888() will turn into noops completely. (This
happened to me in this case due to a bug in the over_n_8888_0565_ca()
code).

I think the order of those memcpy() calls should at a minimum be
reversed, but I'm not completely clear on why they are there in the
first place. To put the sources and destinations into cache perhaps?

Soren