[Mesa-dev] [PATCH 2/3][RFC v2] mesa/main/x86: Add sse2 streaming clamping

Wed Nov 5 00:44:24 PST 2014

Hi,

I did rely on gcc optimization run on moving things around for me. What
_mesa_streaming_clamp_float_rgba really look like when I compile it is this:

Dump of assembler code for function _mesa_streaming_clamp_float_rgba:
   0x00007ffff401a0a0 <+0>:	test   %edi,%edi
   0x00007ffff401a0a2 <+2>:	je     0x7ffff401a0d7
<_mesa_streaming_clamp_float_rgba+55>
   0x00007ffff401a0a4 <+4>:	sub    $0x1,%edi
   0x00007ffff401a0a7 <+7>:	shufps $0x0,%xmm0,%xmm0
   0x00007ffff401a0ab <+11>:	shufps $0x0,%xmm1,%xmm1
   0x00007ffff401a0af <+15>:	add    $0x1,%rdi
   0x00007ffff401a0b3 <+19>:	shl    $0x4,%rdi
   0x00007ffff401a0b7 <+23>:	xor    %eax,%eax
   0x00007ffff401a0b9 <+25>:	nopl   0x0(%rax)
   0x00007ffff401a0c0 <+32>:	movups (%rsi,%rax,1),%xmm2
   0x00007ffff401a0c4 <+36>:	maxps  %xmm0,%xmm2
   0x00007ffff401a0c7 <+39>:	minps  %xmm1,%xmm2
   0x00007ffff401a0ca <+42>:	movups %xmm2,(%rdx,%rax,1)
   0x00007ffff401a0ce <+46>:	add    $0x10,%rax
   0x00007ffff401a0d2 <+50>:	cmp    %rdi,%rax
   0x00007ffff401a0d5 <+53>:	jne    0x7ffff401a0c0
<_mesa_streaming_clamp_float_rgba+32>
   0x00007ffff401a0d7 <+55>:	repz retq
End of assembler dump.

Gcc has after inlining moved all unnecessary stuff outside the loop but
I can still have _mesa_clamp_float_rgba function ready for generic use
on source level. I did trust gcc here also with the unrolling, looking
at the loop unrolling would reduce three instructions per round but I
suspect add/cmp/jne are not the expensive instructions here (I didn't check)

Out of order execution might be interesting to try here though. I need
to check if I can get gcc to behave properly, never before attempted
that with intrinsics on gcc :)

/Juha-Pekka

On 04.11.2014 19:35, Siavash Eliasi wrote:
> Hello. I'd get rid of "_mm_set1_ps" inside "_mesa_clamp_float_rgba" by
> passing _m128 version of min/max directly, so "_mm_set1_ps" will be
> moved out of the for loop.
> 
> I'd also unroll the "_mesa_streaming_clamp_float_rgba" loop to minimize
> the loop overhead (and utilize out of order execution as a bonus),
> because nothing compute intensive is happening there. You can also use
> prefetching (_mm_prefetch) there to improve performance by reading data
> ahead from memory.
> 
> Best regards,
> Siavash Eliasi.
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev