[Mesa-dev] RFC: tgsi opcodes for 32x32 muls with 64bit results

Thu May 2 21:58:56 PDT 2013

----- Original Message -----
> Currently, there's no way to get the high bits of a 32x32
> signed/unsigned integer multiplication with tgsi.
> However, all of d3d10, OpenGL, and OpenCL support that, so we need it as
> well.
> There's essentially two ways how it could be done:
> - a 2-destination instruction returning both high and low bits (this is
> how it looks like in d3d10 and glsl)
> - use the existing umul for the low bits and have another instruction
> for the high bits (this is how it looks like in opencl)
> 
> Well there's other possibilities but these looked like they'd match both
> APIs and HW reasonably (well with the exception of things like sse2
> which would prefer 2x2 32bit inputs and return 2x64bit as one reg...).
> 
> Actually it's two new instructions because unlike for the low bits it
> matters for the high bits if the source operands are signed or unsigned.
> 
> Personally I'm favoring two separate instructions for low and high bits
> to not have to deal with multi-destination instructions, but if someone
> makes a strong case for one returning both low and high bits I could be
> convinced otherwise. I think though two instructions matches most hw
> very well (with the exception of software renderers and possibly intel
> graphics but then a good backend could certainly recognize this).

Roland,

I don't know about GPU HW, but I think that what you propose will forever prevent decent SSE code generation with LLVM.

Using two separate opcodes for hi/low bits relies on common sub-expression elimination to merge the two multiplication operations back into one.  But I strongly doubt that even LLVM's optimization passes will be able to do that.

Getting the 64bits results with LLVM will require sign extend the source arguments (http://llvm.org/docs/LangRef.html#mul-instruction ) or SSE intrinsics. Eitherway, the expressions for the low and high bit will be radically different, so we'll end with two multiplies in the end -- which I think it is simply inadmissible -- TGSI should not stand in the way of backends generating good code.

So I strongly think this is a bad idea. TGSI has support for multiple destinations, though we never made much use of it. I see nothing special about it.

If you can prove me wrong -- that LLVM can handle merge the multiplies -- fine.  But I do think we have bigger fish to fry, so I'd prefer we don't put too much time debating this.

Jose