[cairo] [PATCH] Fixes for ARM features runtime autodetection support

Wed Jul 1 04:44:00 PDT 2009

On Mon, 2009-06-29 at 11:25 +0300, Siarhei Siamashka wrote:
> On Monday 29 June 2009, Jonathan Morton wrote:
> > > More serious is a problem with the newly added functions
> > > fbCompositeSolid_nx0565neon and fbCompositeOver_8888x0565neon.
> > > They are quite broken. And (not a) surprise - this breakage
> > > can be easily spotted by all the same 'scaling-test' program.
> >
> > Okay, I'll try to reproduce that and see what I can do about it.
> 
> Thanks.
> 
> Just running 'test/scaling-test' should survive and produce the same crc32
> value as on x86 (or as with neon optimizations disabled). To catch memory
> errors easier in gdb, malloc can be replaced by allocations using mmap,
> allocating buffers on the right "edge", so that accessing any data outside
> buffer would immediately crash.
> 
> Problems seem to be related to wrongly assuming that moving to the next pixel
> line would not change 16 byte alignment, but destination stride is only
> required to be 4 bytes aligned.

I think I see what you mean.  Our internal test cases used framebuffers
as the target, which limited the destination strides exercised.  In
these cases, it was sufficient to align the reads to 16 bytes on the
first scanline, and ensure that writes were precise.

The scaling test uses small, oddly-shaped pixmaps as targets.  This
means that 16-byte alignment of reads on the first scanline is not
sufficient, since the alignment may be different on subsequent (and
especially the last) scanlines.

The additional constraint of keeping the read strictly within the
scanline is required to fix this, and I'm working on that now.

> Is there really a strong need to copy data to the temporary buffer on stack?
> There are plenty of NEON registers still unused and they could serve as a
> temporary storage just fine.

The temporary buffer simplifies the inner loop (allowing every iteration
to work on 8 pixels at once without overrun or masking) and allows the
read phase (which may be from an uncached framebuffer) to be optimised.
I only use it for blitters which read-modify-write on the framebuffer.

Uncached framebuffers are a major performance problem.  We've determined
that write-to-read turnarounds are especially painful, since the
write-combining buffer has to be flushed (with RWT, RAS, CAS latencies
if it followed a read), then the full (WRT, RAS, CAS) latency of the
memory waited on, all before the blitter can even begin work on the next
group of pixels.  It's thus even more painful than consecutive small
reads, which only need to wait for the CAS latency each time.  (I'm
using SDRAM as an example, later technologies are probably even worse
relatively.)

To minimise the write-to-read turnarounds, the whole scanline (or, next
best, a large section of it) needs to be read at once.  There isn't
enough space in the NEON registers for that, even if they could be
accessed by index.  In general this is a classic example of needing to
optimise for throughput rather than latency when using SIMD.

All of these latencies are normally hidden by the cache when working on
an in-memory pixmap, with only cacheline misses incurring the relatively
small CAS latency, and writes normally deferred until the bus is idle.
Some sophisticated CPUs automatically preload for sequential accesses,
mostly hiding even this.

A potentially cleaner solution would be to abstract away the scanline
fetch and put in the implementations' overall composite functions, and
make all the inner-loop blitters work on one scanline at a time.  This
might allow a variety of efficiency improvements, but would be a lot of
effort to implement right now.

-- 
------
From: Jonathan Morton
      jonathan.morton at movial.com