[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Fri Aug 20 09:36:07 PDT 2010

Hi, Soeren Sandmann and Siarhei Siamashka:
	This patch is still for #20709. 
	We already aware there are some SSE2 intrinsic is trying to addressing similar performance issue in recent new commit. Where this intrinsic optimization for x8r8g8b8 ops is not good enough, and sometimes even worse, on ATOM. 
	In this patch, we provide SSSE3 optimized code, which co-exists with current SSE2 intrinsic, will effect when expected SSSE3 CPUID detected.
	Considering Siarhei Siamashka's suggestions, here are some enhancements comparing with pervious patch:
	1) 32 bit and 64 bit assemble code of highly optimized memory move + logical AND, using SSSE3
	2) SSSE3 runtime detection
	3) Stack executable issue is avoided
	4) make asm code shorter (almost half of pervious patch)

We measured performance, and compared with original SSE2 intrinsic enabled version(0.19.4), on ATOM, and get following findings using 480P flash H.264 video playing workload:
	1) sse2_composite_src_x888_8888()'s cycle reduced 67%. This function's total cycle ratio over whole system reduced from 5.6% to 1.9%
	2) whole system's C0 percentage reduced from 68.0% to 62.6%
Maybe it is not " dramatically", while we are glad to see those gain on both perf and power.

BTW, we build and ran make check on following 3 systems:
	1: 32 bit ATOM system, with SSSE3
	2: 64 bit ATOM system, with SSSE3
	3: 64 bit system, without SSSE3

Thanks!
Samuel

-----Original Message-----
From: Liu, Xinyun [mailto:xinyunliu at gmail.com] 
Sent: Friday, August 20, 2010 11:40 PM
To: Siarhei Siamashka; pixman at lists.freedesktop.org
Cc: Ma, Ling; Xu, Samuel
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Hi Siarhei Siamashka,

Here is a new patch, can you review it? Thank you!

With this patch, opfile said that the performance is increased dramatically for Atom.

Samuel and Ling will provide detailed data.

Regards,
Liu, Xinyun