[Pixman] [PATCH] sse2: faster bilinear scaling (pack 4 pixels to write with MOVDQA)

Wed Sep 4 19:42:08 PDT 2013

Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:

> The loops are already unrolled, so it was just a matter of packing
> 4 pixels into a single XMM register and doing aligned 128-bit
> writes to memory via MOVDQA instructions for the SRC compositing
> operator fast path. For the other fast paths, this XMM register
> is also directly routed to further processing instead of doing
> extra reshuffling. This replaces "8 PACKSSDW/PACKUSWB + 4 MOVD"
> instructions with "3 PACKSSDW/PACKUSWB + 1 MOVDQA" per 4 pixels,
> which results in a clear performance improvement.
>
> There are also some other (less important) tweaks:
>
> 1. Convert 'pixman_fixed_t' to 'intptr_t' before using it as an
>    index for addressing memory. The problem is that 'pixman_fixed_t'
>    is a 32-bit data type and it has to be extended to 64-bit
>    offsets, which needs extra instructions on 64-bit systems.
>
> 2. Dropped support for 8-bit interpolation precision to simplify
>    the code.

If we are dropping support for 8-bit precision, let's drop it everywhere
(in a separate patch from this optimization). I'll send a patch as a
follow-up to this mail.

The other question I have is whether you tested if this makes the SSE2
fast paths competitive with the SSSE3 iterator? If it does, that would
allow us to postpone dealing with the iterators-vs-fastpaths problem.

Søren