[Mesa-dev] [PATCH 2/3][RFC v2] mesa/main/x86: Add sse2 streaming clamping
Juha-Pekka Heikkila
juhapekka.heikkila at gmail.com
Wed Nov 5 00:44:24 PST 2014
Hi,
I did rely on gcc optimization run on moving things around for me. What
_mesa_streaming_clamp_float_rgba really look like when I compile it is this:
Dump of assembler code for function _mesa_streaming_clamp_float_rgba:
0x00007ffff401a0a0 <+0>: test %edi,%edi
0x00007ffff401a0a2 <+2>: je 0x7ffff401a0d7
<_mesa_streaming_clamp_float_rgba+55>
0x00007ffff401a0a4 <+4>: sub $0x1,%edi
0x00007ffff401a0a7 <+7>: shufps $0x0,%xmm0,%xmm0
0x00007ffff401a0ab <+11>: shufps $0x0,%xmm1,%xmm1
0x00007ffff401a0af <+15>: add $0x1,%rdi
0x00007ffff401a0b3 <+19>: shl $0x4,%rdi
0x00007ffff401a0b7 <+23>: xor %eax,%eax
0x00007ffff401a0b9 <+25>: nopl 0x0(%rax)
0x00007ffff401a0c0 <+32>: movups (%rsi,%rax,1),%xmm2
0x00007ffff401a0c4 <+36>: maxps %xmm0,%xmm2
0x00007ffff401a0c7 <+39>: minps %xmm1,%xmm2
0x00007ffff401a0ca <+42>: movups %xmm2,(%rdx,%rax,1)
0x00007ffff401a0ce <+46>: add $0x10,%rax
0x00007ffff401a0d2 <+50>: cmp %rdi,%rax
0x00007ffff401a0d5 <+53>: jne 0x7ffff401a0c0
<_mesa_streaming_clamp_float_rgba+32>
0x00007ffff401a0d7 <+55>: repz retq
End of assembler dump.
Gcc has after inlining moved all unnecessary stuff outside the loop but
I can still have _mesa_clamp_float_rgba function ready for generic use
on source level. I did trust gcc here also with the unrolling, looking
at the loop unrolling would reduce three instructions per round but I
suspect add/cmp/jne are not the expensive instructions here (I didn't check)
Out of order execution might be interesting to try here though. I need
to check if I can get gcc to behave properly, never before attempted
that with intrinsics on gcc :)
/Juha-Pekka
On 04.11.2014 19:35, Siavash Eliasi wrote:
> Hello. I'd get rid of "_mm_set1_ps" inside "_mesa_clamp_float_rgba" by
> passing _m128 version of min/max directly, so "_mm_set1_ps" will be
> moved out of the for loop.
>
> I'd also unroll the "_mesa_streaming_clamp_float_rgba" loop to minimize
> the loop overhead (and utilize out of order execution as a bonus),
> because nothing compute intensive is happening there. You can also use
> prefetching (_mm_prefetch) there to improve performance by reading data
> ahead from memory.
>
> Best regards,
> Siavash Eliasi.
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list