[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
samuel.xu at intel.com
Fri Aug 27 05:00:49 PDT 2010
Hi, Siarhei Siamashka:
Thanks for quick response!
For 64 bit detect_cpu_features(), if ignore HAVE_GETISAX and _MSC_VER, it is ok for us to simplify it as your example in next update.
For pixman-ssse3.c, maybe we have 2 options:
1) duplicate 6562 lines from pixman-sse2.c to new pixman-ssse3.c in 1st patch (of course to replace 2 entries with newly added SSSE3 asm optimization), and then add "merge" mechanism in later patch.
2) firstly add "merge" mechanism patch, and the added new pixman-ssse3.c in later patch, which might be shorter (111 lines)
Does it mean 1) option is preferred?
We will response soon after your guide on how to split SSSE3 C file.
From: Siarhei Siamashka [mailto:siarhei.siamashka at gmail.com]
Sent: Friday, August 27, 2010 6:00 PM
To: Xu, Samuel
Cc: pixman at lists.freedesktop.org; Ma, Ling; Liu, Xinyun
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Friday 27 August 2010 05:59:00 Xu, Samuel wrote:
> Hi Siarhei Siamashka,
> Here is a new patch, can you review it? Thank you!
> It address following suggestions:
> 1: SSSE3 file is split to a new file.
> Comparing with to duplicate every
> content from SSE2 file, I added a way to merge new fast path function
> pointer table runtimely by CPUID check, to avoid huge code expansion
> and a bug you pointed out.(Without runtime merging, I have to
> duplicate every line from SSE2 file to SSSE3 file, it is really not
> graceful and redundant.
Just implementing everything in the same way as done in SSE2 code, you would get a proper delegate chain trying SSSE3 fast paths first, then trying SSE2, MMX, C fast paths (compositing in one pass) and finally fallback to a slow general path code (fetch, combine and store steps done in separate passes).
The runtime merging is not strictly needed, even though it could probably improve performance somewhat. But this clearly belongs to a separate patch.
> 2: Copy right information added
> 3: Makefile fix
> 4: author name fix
> For detect_cpu_features(), we possibly can remove the TWO lines asm
> code to copy edx to "result", to avoid SSE2 and MMX checking, since
> most preparing code still need. If we keep this 2 lines, it still give
> us some possibility to check other edx information of CPUID in the
> future. It is only executed once when pixman is loaded.
Wouldn't it be just as simple as doing the following in the block of code under __GNUC__ and __amd64__ or __x86_64__ ifdefs?
int have_ssse3_support ()
"mov $1, %%eax\n"
"mov %%ecx, %0\n"
: "=r" (cpuid_ecx)
: "%rax", "%rbx", "%rcx", "%rdx"
return cpuid_ecx & (1 << 9);
And we already know that we have MMX and SSE2 on all x86_64 hardware even without checking.
MSVC and any compilers other than GNU C are probably not so interesting initially. They just should not fail when building pixman.
> For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the
> reason of "why we use backward copy mode in the patch ", from Ling:
> All read operations after allocation stage can run speculatively, all
> write operation will run in program order, and if addresses are
> different read may run before older write operation, otherwise wait until write commit.
> However CPU don't check each address bit, so read could fail to
> recognize different address even theyare in different page. For
> example if rsi is 0xf004, rdi is 0xe008,in following operation there will generate big
> performance latency. 1. movq (%rsi), %rax
> 2. movq %rax, (%rdi)
> 3. movq 8(%rsi), %rax
> 4. movq %rax, 8(%rdi)
> If %rsi and rdi were in really the same meory page, there are TRUE
> read-after-write dependence because instruction 2 write 0x008 and
> instruction 3 read 0x00c, the two address are overlap partially.
> Actually there are in different page and no any issues, but without
> checking each address bit CPU could think they are in the same page,
> and instruction 3 have to wait for instruction 2 to write data into
> cache from write buffer, then load data from cache, the cost time read
> spent is equal to mfence instruction. We may avoid it by tuning operation sequence as follow.
> 1. movq 8(%rsi), %rax
> 2. movq %rax, 8(%rdi)
> 3. movq (%rsi), %rax
> 4. movq %rax, (%rdi)
> Instruction 3 read 0x004, instruction 2 write address 0x010, no any
> dependence. In this patch we first handle small size(less 48bytes) by
> this way, and the approach help us improve up to 2X on Atom.
Thanks for the explanation. Some comments describing why it was done this way in the code for such non-obvious cases would be generally nice.
So it is basically a store forwarding aliasing problem which is described in
"188.8.131.52 Store Forwarding" section from "Intel(r) 64 and IA-32 Architectures Optimization Reference Manual", right?
Wouldn't just the use of MOVD/MOVSS instructions here also solve this problem?
Store forwarding does not seem to be used for SIMD according to the manual. I haven't benchmarked anything yet though.
Moreover, if you have plans to generalize and extend SSSE3 optimizations to accelerate more stuff later, then something like 'src_8888_8888' and 'add_8888_8888' operations would be the really easy ones. But it would be very nice to avoid dumb copy/paste and 3x increase of assembly sources size. For this to go smoothly, preferably you would need to separate the loop code itself from the pixel processing logic. And in order to make pixel processing more simple and consistent, it's better to do all the pixel processing using SIMD instructions (both inner loop and also leading/trailing pixels). Leading and trailing unaligned pixels may be probably handled by using MOVD/MOVSS/MOVHPS or other similar instructions to do partial SSE register load/store operations.
It's not like this needs to be done right now, but it is better to plan ahead.
Something similar is implemented for ARM NEON optimizations.
More information about the Pixman