[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Mon Aug 30 13:31:35 PDT 2010

On Monday 30 August 2010 16:06:45 Siarhei Siamashka wrote:
> Also I tried to run my simple microbenchmarking program on Intel Atom N450
> netbook, x86_64 system. The results are the following:
> 
> --
> All results are presented in millions of pixels per second
> L1  - small Xx1 rectangle (fitting L1 cache), always blitted at the same
>       memory location with small drift in horizontal direction
> L2  - small XxY rectangle (fitting L2 cache), always blitted at the same
>       memory location with small drift in horizontal direction
> M   - large 1856x1080 rectangle, always blitted at the same
>       memory location with small drift in horizontal direction
> HT  - random rectangles with 32x32 average size are copied from
>       one 1920x1080 buffer to another, traversing from left to right
>       and from top to bottom
> VT  - random rectangles with 32x32 average size are copied from
>       one 1920x1080 buffer to another, traversing from top to bottom
>       and from left to right
> R   - random rectangles with 32x32 average size are copied from
>       random locations of one 1920x1080 buffer to another
> ---
> reference memcpy speed = 1007.2MB/s (251.8MP/s for 32bpp pixels)
> 
> --- C ---
>  src_x888_8888 = L1: 265.13 L2: 223.61 M:216.12 HT:109.69 VT: 80.23 R: 75.42
> 
> --- SSE2 ---
>  src_x888_8888 = L1: 611.50 L2: 494.79 M:120.17 HT: 83.57 VT: 79.48 R: 60.30
> 
> --- SSE2 (prefetch removed) ---
>  src_x888_8888 = L1: 683.39 L2: 539.57 M:260.07 HT:128.22 VT: 85.23 R: 81.40
> 
> --- SSSE3 ---
>  src_x888_8888 = L1:1559.35 L2: 798.95 M:254.31 HT:129.29 VT: 84.67 R: 75.00

And now a benchmark from Intel Core i7 860 too (with the buffer sizes
increased to be sure that they are much larger than 8MB cache):

---
reference memcpy speed = 6491.5MB/s (1622.9MP/s for 32bpp pixels)

--- C ---
 src_x888_8888 = L1:1145.72 L2:1124.48 M:902.11 HT:438.52 VT:351.58 R:316.74

--- SSE2 ---
 src_x888_8888 = L1:5628.96 L2:2766.65 M:1060.89 HT:576.04 VT:488.81 R:418.52

--- SSE2 (prefetch removed) ---
 src_x888_8888 = L1:6140.86 L2:2730.58 M:1066.52 HT:594.88 VT:484.12 R:423.24

--- SSSE3 ---
 src_x888_8888 = L1:5182.91 L2:2617.74 M:1064.30 HT:673.20 VT:568.15 R:441.97

So looks like on Intel Core i7 the differences between all the SSE2/SSSE3
implementations are mostly irrelevant. Standard memcpy shows better
memory bandwidth because it uses nontemporal writes for copying very large
buffers. If using save_128_write_combining() instead of save_128_aligned() in
sse2_composite_src_x888_8888(), then src_x888_8888 operation becomes just as
fast as memcpy.

For Intel Atom, SSSE3 seems to improve performance a lot when working with
L1/L2 cache, but has no advantage when working with large memory buffers. Still
I think it may be useful for hyperthreading (the other virtual thread may have
more resources to do something while src_x888_8888 is waiting for memory). On
the other hand, nontemporal writes can increase performance noticeably if the
buffer is known to be much bigger than L2 cache.

And Intel Atom does not like software prefetch very much. This reminds me an
older report:
http://lists.freedesktop.org/archives/pixman/2010-June/000218.html

I can try to run a full set of cairo-perf-trace benchmarks to get more relevant
numbers regarding the effect of software prefetch on Intel Atom.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100830/a3187f34/attachment.pgp>