[cairo] Comments on NEON code

Sun Jul 5 06:10:27 PDT 2009

On Saturday 04 July 2009, Soeren Sandmann wrote:

I'll try to reply regarding some technical stuff.

> * Alignment
>
> Here is a summary of how the Corex A8 works, as I read the
> documentation at
>
>    
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344j/Cihhci
>ci.html
>
> - There is a global "A bit", which if set causes the CPU to generate
>   an exception for all unaligned accesses.
>
> - When that bit is *not* set, unaligned accesses are allowed, but
>   slower than aligned access.

This A bit is stored in "c1, Control Register" which is not accessible
from the userspace, even for read access. From Cortex-A8 TRM:
"Attempts to read or write the Control Register from secure or nonsecure User 
modes result in an Undefined Instruction exception."

It is somewhat similar to AC flag (alignment check) in EFLAGS for x86. Except
that it can't be arbitrarily accessed from userspace (similar to the registers
with cpu features information which are also restricted, so that HWCAPS need
to be used in linux to get this information).

At least for linux, in the recent kernels A bit is unconditionally
disabled (arch/arm/mm/alignment.c), and hopefully will stay this way
forever. But even if not (or if pixman is run on some other OS, which has
different setup), then this is not a problem for NEON too, see below.

> - NEON load instructions have optional alignment qualifiers: @16, @32,
>   @64 and so one.
>
>   Using such a qualifier will make the load faster, unless you lied
>   about it, in which case an exception is generated regardless of the
>   A bit.

Yes, it is similar to x86, which has MOVAPS/MOVUPS instructions. Using
qualifier helps to save 1 cycle on memory accesses. Cortex-A8 cpu supports
peak read or write rate 128-bits per cycle from L1 cache when using aligned
memory accesses.

>   When not using a qualifier, the load behaves as a normal load:
>
>        if the A bit is set, unaligned access causes an exception
>
>        if the A bit is not set, unaligned access is allowed, but
>        slower than aligned access.
>
> Is the NEON code making assumptions about the state of this A bit? And
> if it is, should that not be tested for in pixman-cpu.c before
> enabling it?

Even if A bit is set, there is such thing as "element size" in NEON
instructions (which can be as small as 8 bits). And NEON instructions
have to to be aligned at element boundary. So instruction like

   vld1.8 {d0, d1, d2, d3}, [r0]

Will read four 64-bit registers, treating element size as 8 bits. This way
it is safe to use with any alignment.

But for example 

   vld1.32 {d0, d1, d2, d3}, [r0]

is better to be aligned at 32-bit boundary as it uses 32-bit elements. This
will only trigger exceptions when A bit is set. ARM also supports changing
endianness at runtime, so having correct element size in instructions is also
important in this case as it will make the difference on what data is read.

For the typical little endian system with A bit disabled, element size does
not affect much in practice.

Accesses to external memory are done for each element separately:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344h/ch09s04s03.html

So basically ARM is quite good at dealing with unaligned data, it just needs
to be used in a right way.

> I could certainly be misreading the code, or not understanding the
> impact of unaligned access on ARM, but if I'm not, maybe it would be
> worthwhile doing what the SSE2 implementation does:
>
>      - Align the destination and always read it with aligned
>        instructions.
>
>      - Load sources with unaligned instructions.
>
> Alternatively, more complicated schemes could be deviced where both
> source and destination are being loaded with aligned instructions.

ARM NEON is more flexible than SSEx. It can emulate practically every SSE
instruction easily, but emulating NEON with SSE is tricky. So no problems
here.

Aligning destination buffers on 16-byte boundary is just good for performance,
especially when accessing noncached memory (if it really needs to be accessed
from pixman).

> - Naming: 24x16, not 0888x0565?
>
> Is there any reason to not follow the same naming convention as
> elsewhere? And if there is, why not do it consistently? Otherwise,
> it's just a gratuitous inconsistency that will leave the next person
> reading the code wondering what the significance of this change is.

By the way, I always wondered about the names like 'fbCompositeSrc_8888x0565'.
Why 'Src', while it is definitely OVER operation?

-- 
Best regards,
Siarhei Siamashka