[Mesa-dev] [PATCH 5/7] translate_sse: Preserve low bit during unsigned -> float conversion.

Roland Scheidegger sroland at vmware.com
Tue Apr 15 12:33:36 PDT 2014


Am 15.04.2014 02:52, schrieb Andreas Hartmetz:
>>> +               /* right shift & convert, losing the low bit - must clear
>>> +                * high bit because there is no unsigned convert
>>> instruction */> 
>>>                 sse2_psrld_imm(p->func, dataXMM, 1);
>>>
>>> +               sse2_cvtdq2ps(p->func, dataXMM, dataXMM);
>>> +
>>> +               /* convert low bit to float */
>>> +               sse2_pslld_imm(p->func, dataXMM2, 31);
>>> +               sse2_psrld_imm(p->func, dataXMM2, 31);
>>> +               sse2_cvtdq2ps(p->func, dataXMM2, dataXMM2);
>>
>> Is this really ideal, wouldn't something like (in horrible pseudo-code
>> notation)
>> dataXMM2 = dataXMM2 & CONST(0x1 vec)
>> dataXMM2 = cvtdq2ps(dataXMM2)
>> be faster?
>> I guess though your method avoids the constant, so probably not worth
>> bothering (I am actually wondering what code llvm generates for its
>> UIToFP instruction or what in general the fastest way to do this is).
>>
> Well, this whole sequence is somewhat wasteful for a single bit, but
> it mattered in my application before I realized that it won't work
> anyway on too many drivers.
> 
> I was reluctant to add another register for the 0x1 constant(*);
> loading it from memory each time seemed like it would take a lot of
> bandwidth. Alternatively, loading an immediate into one 32 bit
> (or smaller?) "sub register" and then copying it into the others
> would amount to about the same instruction count after all.
> Maybe there are more execution units available for doing it that way,
> I don't know. Also I haven't found how to actually copy a value from
> the lowest subregister into all others in one (fast) instruction.
> 
> (*) The code in this file generally uses rather few registers. If
> adding one to keep the 0x1 is not a problem, that's likely optimal.
> 
> FWIW, the best I could get compilers to produce was four times
> cvtsi2ss xmm0, rax - looks like the somewhat clever use of rax
> with a non maxed out value range is a hardcoded pattern for the
> conversion and the compilers have no means to be more "creative"
> there. With x86 target I also saw a code sequence splitting the uint
> value into two 16 bit values, converting them and then adding
> them after multiplying the higher order bits by 0x10000. Repeated
> four times... not sure if there is a good reason for that or if it's
> just a compiler limitation that the code wasn't properly "SIMDed".
> Every one(!) of those four iterations seemed at least equally
> expensive to the whole sequence in this patch.
> Compilers tested were GCC 4.8 -O3 and Clang trunk -O3.
> 
> If anybody knows an optimal sequence I'd happily see that used
> instead.
I'm not sure how your code looked like but compilers aren't very good at
auto-vectorization usually...
I suspect the sequence using two 16bit values is probably the only
solution if you want to do this correctly fully vectorized. I missed
that previously but your code won't quite do the rounding correctly in
all cases (and the c compiler has to follow correct round to nearest for
int->float conversion). So a comment saying this doesn't quite get the
most exact possible float value (for values > 2^25) in all cases would
be nice. It is really annoying there's no uint->fp conversion
instructions :-(.

Roland


More information about the mesa-dev mailing list