[Mesa-dev] [PATCH 5/7] translate_sse: Preserve low bit during unsigned -> float conversion.

Tue Apr 15 06:22:16 PDT 2014

Am 15.04.2014 02:52, schrieb Andreas Hartmetz:
>>> +               /* right shift & convert, losing the low bit - must clear
>>> +                * high bit because there is no unsigned convert
>>> instruction */> 
>>>                 sse2_psrld_imm(p->func, dataXMM, 1);
>>>
>>> +               sse2_cvtdq2ps(p->func, dataXMM, dataXMM);
>>> +
>>> +               /* convert low bit to float */
>>> +               sse2_pslld_imm(p->func, dataXMM2, 31);
>>> +               sse2_psrld_imm(p->func, dataXMM2, 31);
>>> +               sse2_cvtdq2ps(p->func, dataXMM2, dataXMM2);
>>
>> Is this really ideal, wouldn't something like (in horrible pseudo-code
>> notation)
>> dataXMM2 = dataXMM2 & CONST(0x1 vec)
>> dataXMM2 = cvtdq2ps(dataXMM2)
>> be faster?
>> I guess though your method avoids the constant, so probably not worth
>> bothering (I am actually wondering what code llvm generates for its
>> UIToFP instruction or what in general the fastest way to do this is).
>>
> Well, this whole sequence is somewhat wasteful for a single bit, but
> it mattered in my application before I realized that it won't work
> anyway on too many drivers.
> 
> I was reluctant to add another register for the 0x1 constant(*);
> loading it from memory each time seemed like it would take a lot of
> bandwidth. Alternatively, loading an immediate into one 32 bit
> (or smaller?) "sub register" and then copying it into the others
> would amount to about the same instruction count after all.
> Maybe there are more execution units available for doing it that way,
> I don't know. Also I haven't found how to actually copy a value from
> the lowest subregister into all others in one (fast) instruction.
Don't forget (most) cpus have full 128bit datapaths from l1 cache to
registers, so loading such a constant isn't too bad and not actually
slower than loading a 32bit variable. Though a method to load and spread
out 32bit constants would be nice, as this saves memory (and ultimately
memory bandwidth). Well such an instruction exists it's called
vbroadcast and avx only... Even there, the disadvantage being that you
can't use it directly as a memory src in other instructions, of course.
If the code is run in a tight loop, of course ideally you'd just keep
the const in a register, but you can't really do such optimizations
easily within that code.

> 
> (*) The code in this file generally uses rather few registers. If
> adding one to keep the 0x1 is not a problem, that's likely optimal.
> 
> FWIW, the best I could get compilers to produce was four times
> cvtsi2ss xmm0, rax - looks like the somewhat clever use of rax
> with a non maxed out value range is a hardcoded pattern for the
> conversion and the compilers have no means to be more "creative"
> there.
Yes actually IIRC I saw llvm doing something similar (but using fp stack
for no apparent reason). I think though de-vectorizing things is usually
a terrible idea if the vectorized version is only somewhat more complex.

> With x86 target I also saw a code sequence splitting the uint
> value into two 16 bit values, converting them and then adding
> them after multiplying the higher order bits by 0x10000. Repeated
> four times... not sure if there is a good reason for that or if it's
> just a compiler limitation that the code wasn't properly "SIMDed".
> Every one(!) of those four iterations seemed at least equally
> expensive to the whole sequence in this patch.
> Compilers tested were GCC 4.8 -O3 and Clang trunk -O3.
> 
> If anybody knows an optimal sequence I'd happily see that used
> instead.
Most likely it doesn't even exist (depending on cpu used). For instance,
some older cpus only have 64bit physical simd units - there using the
scalar approach might be best, but this may not be true for other cpus
(on Bulldozer cpus, transfers from generic int domain to simd domain for
instance such as required by cvtsi2ss imposes terrible latency penalties).
So your approach is fine by me, most likely it's not really often hit
anyway. The updated code looks good to me.

btw I've got anothher minor nitpick, the comments are a bit off (in two
places):
> case 32:           /* we lose precision if value > 2^23 - 1 */
This is untrue every number up to and including 2^24 is exactly
representable with floats, therefore should only lose precision beyond
that - well at least for the signed path but I think it should be true
now for unsigned too...

Roland