[Pixman] Faster unorm_to_unorm for wide path processing

Sun Jun 10 09:27:46 PDT 2012

Attached is a simple patch that produces around 20 % Mpix/s improvement 
for wide path processing due to significant optimization of 
pixman_expand. On my i7 laptop, we go from:

> src_8888_2x10 =  L1:  62.08  L2:  60.73  M: 59.61
>                   (  4.30%)  HT: 46.81  VT: 42.17  R: 43.18  RT: 26.01 (
>                   325Kops/s)

to

>  src_8888_2x10 =  L1:  76.94  L2:  78.43  M: 75.87
>                   (  5.59%)  HT: 56.73  VT: 52.39  R: 53.00  RT: 29.29 (
>                   363Kops/s)

The key of the patch is the observation that unorm_to_unorm's work can 
more easily be done with a simple multiplication and shift, when the 
function is applied repeatedly and the parameters are not compile-time 
constants. For instance, converting from 0xfe to 0xfefe (expanding from 
8 bits to 16 bits) can be done by calculating

c = c * 0x101

However, sometimes the result is not a neat replication of all the bits. 
For instance, going from 10 bits to 16 bits can be done by calculating

c = c * 0x401UL >> 4

where the intermediate result is 20 bit wide repetition of the 10-bit 
pattern followed by shifting off the unnecessary lowest bits.

The patch has the algorithm to calculate the factor and the shift, and 
converts the code to use it.

-- 
Antti