[Mesa-dev] [PATCH 5/7] translate_sse: Preserve low bit during unsigned -> float conversion.

Mon Apr 14 17:52:49 PDT 2014

> > +               /* right shift & convert, losing the low bit - must clear
> > +                * high bit because there is no unsigned convert
> > instruction */> 
> >                 sse2_psrld_imm(p->func, dataXMM, 1);
> > 
> > +               sse2_cvtdq2ps(p->func, dataXMM, dataXMM);
> > +
> > +               /* convert low bit to float */
> > +               sse2_pslld_imm(p->func, dataXMM2, 31);
> > +               sse2_psrld_imm(p->func, dataXMM2, 31);
> > +               sse2_cvtdq2ps(p->func, dataXMM2, dataXMM2);
> 
> Is this really ideal, wouldn't something like (in horrible pseudo-code
> notation)
> dataXMM2 = dataXMM2 & CONST(0x1 vec)
> dataXMM2 = cvtdq2ps(dataXMM2)
> be faster?
> I guess though your method avoids the constant, so probably not worth
> bothering (I am actually wondering what code llvm generates for its
> UIToFP instruction or what in general the fastest way to do this is).
> 
Well, this whole sequence is somewhat wasteful for a single bit, but
it mattered in my application before I realized that it won't work
anyway on too many drivers.

I was reluctant to add another register for the 0x1 constant(*);
loading it from memory each time seemed like it would take a lot of
bandwidth. Alternatively, loading an immediate into one 32 bit
(or smaller?) "sub register" and then copying it into the others
would amount to about the same instruction count after all.
Maybe there are more execution units available for doing it that way,
I don't know. Also I haven't found how to actually copy a value from
the lowest subregister into all others in one (fast) instruction.

(*) The code in this file generally uses rather few registers. If
adding one to keep the 0x1 is not a problem, that's likely optimal.

FWIW, the best I could get compilers to produce was four times
cvtsi2ss xmm0, rax - looks like the somewhat clever use of rax
with a non maxed out value range is a hardcoded pattern for the
conversion and the compilers have no means to be more "creative"
there. With x86 target I also saw a code sequence splitting the uint
value into two 16 bit values, converting them and then adding
them after multiplying the higher order bits by 0x10000. Repeated
four times... not sure if there is a good reason for that or if it's
just a compiler limitation that the code wasn't properly "SIMDed".
Every one(!) of those four iterations seemed at least equally
expensive to the whole sequence in this patch.
Compilers tested were GCC 4.8 -O3 and Clang trunk -O3.

If anybody knows an optimal sequence I'd happily see that used
instead.

> Roland
> 

<snip>
Patch that fixes accidentally scalar mov follows.