[Mesa-dev] [PATCH 1/2] mesa: reimplement IROUND(), add F_TO_I()

Fri May 18 17:01:47 PDT 2012

Am 19.05.2012 00:35, schrieb Brian Paul:
> On 05/18/2012 03:54 PM, Roland Scheidegger wrote:
>> Looks ok though I wonder if we really need our own assembly here?
>> In particular if the compiler decides to use sse we really shouldn't use
>> the fp stack for converting floats to ints. fistp is just twice as slow
>> as sse conversion on newer cpus, and additionally it might potentially
>> involve moving values from xmm regs to fp.
>> I suspect something like lroundf() would generate better code than the
>> manual assembly (and far better than the c code) if things are compiled
>> to use sse2 at least (the same is of course true for the other functions
>> like ceil etc.). But I guess that's not available everywhere...
> 
> For now, I'm just trying to fix the issue at hand.  If anyone wants to
> look into using lroundf() and SSE code, that's great.  I'm really not up
> on what's the fastest solution on various CPUs.

Actually, lroundf() is useless, gcc doesn't seem to have a builtin for
it, and the library function can't rely on default rounding mode, it
won't touch mxcsr and the resulting code is terrible.
In fact looks worse than the c code to me.

This is what I got (with gcc 4.6.2, -O2, x86_64:

Dump of assembler code for function lroundf:
=> 0x00007ffff7baca60 <+0>:     movd   %xmm0,%edx
   0x00007ffff7baca64 <+4>:     mov    %edx,%ecx
   0x00007ffff7baca66 <+6>:     mov    %edx,%eax
   0x00007ffff7baca68 <+8>:     shr    $0x17,%ecx
   0x00007ffff7baca6b <+11>:    sar    $0x1f,%eax
   0x00007ffff7baca6e <+14>:    and    $0xff,%ecx
   0x00007ffff7baca74 <+20>:    or     $0x1,%eax
   0x00007ffff7baca77 <+23>:    lea    -0x7f(%rcx),%edi
   0x00007ffff7baca7a <+26>:    cmp    $0x3e,%edi
   0x00007ffff7baca7d <+29>:    jg     0x7ffff7bacaa8 <lroundf+72>
   0x00007ffff7baca7f <+31>:    test   %edi,%edi
   0x00007ffff7baca81 <+33>:    js     0x7ffff7bacad0 <lroundf+112>
   0x00007ffff7baca83 <+35>:    and    $0x7fffff,%edx
   0x00007ffff7baca89 <+41>:    or     $0x800000,%edx
   0x00007ffff7baca8f <+47>:    cmp    $0x16,%edi
   0x00007ffff7baca92 <+50>:    jle    0x7ffff7bacab0 <lroundf+80>
   0x00007ffff7baca94 <+52>:    sub    $0x96,%ecx
   0x00007ffff7baca9a <+58>:    cltq
   0x00007ffff7baca9c <+60>:    shl    %cl,%rdx
   0x00007ffff7baca9f <+63>:    imul   %rdx,%rax
   0x00007ffff7bacaa3 <+67>:    retq
   0x00007ffff7bacaa4 <+68>:    nopl   0x0(%rax)
   0x00007ffff7bacaa8 <+72>:    cvttss2si %xmm0,%rax
   0x00007ffff7bacaad <+77>:    retq
   0x00007ffff7bacaae <+78>:    xchg   %ax,%ax
   0x00007ffff7bacab0 <+80>:    mov    %edi,%ecx
   0x00007ffff7bacab2 <+82>:    mov    $0x400000,%esi
   0x00007ffff7bacab7 <+87>:    cltq
   0x00007ffff7bacab9 <+89>:    sar    %cl,%esi
   0x00007ffff7bacabb <+91>:    mov    $0x17,%ecx
   0x00007ffff7bacac0 <+96>:    add    %esi,%edx
   0x00007ffff7bacac2 <+98>:    sub    %edi,%ecx
   0x00007ffff7bacac4 <+100>:   shr    %cl,%edx
   0x00007ffff7bacac6 <+102>:   imul   %rdx,%rax
   0x00007ffff7bacaca <+106>:   retq
   0x00007ffff7bacacb <+107>:   nopl   0x0(%rax,%rax,1)
   0x00007ffff7bacad0 <+112>:   movslq %eax,%rdx
   0x00007ffff7bacad3 <+115>:   xor    %eax,%eax
   0x00007ffff7bacad5 <+117>:   cmp    $0xffffffff,%edi
   0x00007ffff7bacad8 <+120>:   cmove  %rdx,%rax
   0x00007ffff7bacadc <+124>:   retq
End of assembler dump.

I think a single cvtss2si call (which btw is only sse not sse2) instead
would be a order of magnitude faster than this mess (but of course would
rely on default rounding mode)...

Anyway, I guess if we don't care about rounding, we should probably just
use the c truncation on x86_64 (and maybe all other cpus except x86 as
I'd guess they'd have some way to do this fast?). On x86 though c
truncation produces not so good code (messing with fpu control word to
adjust rounding mode, looks like gcc assumes it's set to default
rounding mode so I'm not sure why it actually was causing the failure in
the first place), unless -msse is specified (-mfpmath=sse isn't actually
required for just the conversion to happen with sse instruction) in
which case it's just the same single cvttss2si instruction as on x86_64.
Though surely in cases where a lot of floats (e.g. all values in a
texture image) are converted adjusting the float control word to get
correct rounding isn't an issue.
x87 is such a mess...

Roland