[Pixman] [PATCH] sse2: faster bilinear scaling (pack 4 pixels to write with MOVDQA)
sandmann at cs.au.dk
Wed Sep 4 19:42:08 PDT 2013
Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> The loops are already unrolled, so it was just a matter of packing
> 4 pixels into a single XMM register and doing aligned 128-bit
> writes to memory via MOVDQA instructions for the SRC compositing
> operator fast path. For the other fast paths, this XMM register
> is also directly routed to further processing instead of doing
> extra reshuffling. This replaces "8 PACKSSDW/PACKUSWB + 4 MOVD"
> instructions with "3 PACKSSDW/PACKUSWB + 1 MOVDQA" per 4 pixels,
> which results in a clear performance improvement.
> There are also some other (less important) tweaks:
> 1. Convert 'pixman_fixed_t' to 'intptr_t' before using it as an
> index for addressing memory. The problem is that 'pixman_fixed_t'
> is a 32-bit data type and it has to be extended to 64-bit
> offsets, which needs extra instructions on 64-bit systems.
> 2. Dropped support for 8-bit interpolation precision to simplify
> the code.
If we are dropping support for 8-bit precision, let's drop it everywhere
(in a separate patch from this optimization). I'll send a patch as a
follow-up to this mail.
The other question I have is whether you tested if this makes the SSE2
fast paths competitive with the SSSE3 iterator? If it does, that would
allow us to postpone dealing with the iterators-vs-fastpaths problem.
More information about the Pixman