[Pixman] [PATCH 3/3] Add an iterator that can fetch bilinearly scaled images

Wed Jul 31 18:06:17 PDT 2013

On Mon, 29 Jul 2013 05:03:29 -0400
"Søren Sandmann Pedersen" <soren.sandmann at gmail.com> wrote:

> This new iterator works in a separable way; that is, for a destination
> scaline, it scales the two involved source scanlines and then caches
> them so that they can be reused for the next destination scanlines.
> 
> This scheme saves a substantial amount of arithmetic for larger
> scalings; the time-per-pixel as reported by scaling-bench (running
> with PIXMAN_DISABLE="sse2 mmx") is graphed here:
> 
> 	http://people.freedesktop.org/~sandmann/scaling-graph.png
> 
> where the blue graph is the times before this patch, and the red graph
> is after.
> 
> As the graph shows, this patch is a slowdown for scaling ratios below
> 0.6, but not a big one, and large downscalings with the bilinear
> filter is not an important usecase anyway since the results are ugly.
> 
> The graph is also available here:
> 
>     	http://i.imgur.com/Nibcyes.png
> 
> The data used to generate the graph is available here:
> 
> 	http://people.freedesktop.org/~sandmann/before
> 	http://people.freedesktop.org/~sandmann/after

The results definitely look good (as expected). But what kind of
hardware and the compiler has been used for testing?

On MIPS 74K 480MHz and using gcc 4.5.3, the results look like this:

=== Old C ===

pixman: Disabled mips-dspr2 implementation
0.100000 : 320 240 => 32 24 : 0.001046 : 13.619040
0.300000 : 320 240 => 96 72 : 0.003226 : 4.667306
0.600000 : 320 240 => 192 144 : 0.010374 : 3.752109
0.900000 : 320 240 => 288 216 : 0.023395 : 3.760780
1.500000 : 320 240 => 480 360 : 0.062911 : 3.640671
2.000000 : 320 240 => 640 480 : 0.111207 : 3.620020
5.000000 : 320 240 => 1600 1200 : 0.693240 : 3.610625
10.000000 : 320 240 => 3200 2400 : 2.906125 : 3.784017

=== New C ===

pixman: Disabled mips-dspr2 implementation
0.100000 : 320 240 => 32 24 : 0.001290 : 16.797955
0.300000 : 320 240 => 96 72 : 0.005225 : 7.559235
0.600000 : 320 240 => 192 144 : 0.016348 : 5.912864
0.900000 : 320 240 => 288 216 : 0.034019 : 5.468588
1.500000 : 320 240 => 480 360 : 0.082273 : 4.761169
2.000000 : 320 240 => 640 480 : 0.140271 : 4.566112
5.000000 : 320 240 => 1600 1200 : 0.890944 : 4.640333
10.000000 : 320 240 => 3200 2400 : 3.416600 : 4.448698

=== DSPr2 assembly ===

0.100000 : 320 240 => 32 24 : 0.000610 : 7.941077
0.300000 : 320 240 => 96 72 : 0.002555 : 3.696661
0.600000 : 320 240 => 192 144 : 0.007881 : 2.850537
0.900000 : 320 240 => 288 216 : 0.019378 : 3.115063
1.500000 : 320 240 => 480 360 : 0.049227 : 2.848785
2.000000 : 320 240 => 640 480 : 0.096938 : 3.155530
5.000000 : 320 240 => 1600 1200 : 0.590205 : 3.073984
10.000000 : 320 240 => 3200 2400 : 1.929740 : 2.512682

Here the new C code appears to be slower for all scaling ratios.
I suspect that a possible reason is the use of 64-bit arithmetic
operations on a 32-bit processor.

The performance of C code is generally more important for less
common CPU architectures, because they are usually lacking special
optimizations. I'll try to additionally run some more benchmarks.
Also on a 64-bit Cell PPU from Playstation3 and other devices.

And a minor nitpick about the benchmark program. It would be a bit
nicer if it could align the results vertically and also produce more
human readable output (which could be unambiguously interpreted). For
example, without knowing that the last column is listing the time
spent per pixel, the people reading this e-mail could not easily
tell, which results are better.

-- 
Best regards,
Siarhei Siamashka