[Mesa-dev] RFC: tgsi opcodes for 32x32 muls with 64bit results

Fri May 3 07:32:35 PDT 2013

----- Original Message -----
> Am 03.05.2013 06:58, schrieb Jose Fonseca:
> > 
> > 
> > ----- Original Message -----
> >> Currently, there's no way to get the high bits of a 32x32
> >> signed/unsigned integer multiplication with tgsi. However, all of
> >> d3d10, OpenGL, and OpenCL support that, so we need it as well.
> >> There's essentially two ways how it could be done: - a
> >> 2-destination instruction returning both high and low bits (this
> >> is how it looks like in d3d10 and glsl) - use the existing umul for
> >> the low bits and have another instruction for the high bits (this
> >> is how it looks like in opencl)
> >> 
> >> Well there's other possibilities but these looked like they'd match
> >> both APIs and HW reasonably (well with the exception of things like
> >> sse2 which would prefer 2x2 32bit inputs and return 2x64bit as one
> >> reg...).
> >> 
> >> Actually it's two new instructions because unlike for the low bits
> >> it matters for the high bits if the source operands are signed or
> >> unsigned.
> >> 
> >> Personally I'm favoring two separate instructions for low and high
> >> bits to not have to deal with multi-destination instructions, but
> >> if someone makes a strong case for one returning both low and high
> >> bits I could be convinced otherwise. I think though two
> >> instructions matches most hw very well (with the exception of
> >> software renderers and possibly intel graphics but then a good
> >> backend could certainly recognize this).
> > 
> > Roland,
> > 
> > I don't know about GPU HW, but I think that what you propose will
> > forever prevent decent SSE code generation with LLVM.
> > 
> > Using two separate opcodes for hi/low bits relies on common
> > sub-expression elimination to merge the two multiplication operations
> > back into one.  But I strongly doubt that even LLVM's optimization
> > passes will be able to do that.
> > 
> > Getting the 64bits results with LLVM will require sign extend the
> > source arguments (http://llvm.org/docs/LangRef.html#mul-instruction )
> > or SSE intrinsics. Eitherway, the expressions for the low and high
> > bit will be radically different, so we'll end with two multiplies in
> > the end -- which I think it is simply inadmissible -- TGSI should not
> > stand in the way of backends generating good code.

> You can't generate good code either way, this is a deficiency of sse
> instruction set.
> As I've outlined in another email, I think the best you can do with
> sse41 is:
> - shuffle both src args (put 2nd/4th elements into 1st/3rd slot)
> - 2xpmuldq/pmuludq for doing the 32x32->64bit mul for both 1st/3rd and
> 2nd/4th element
> - shuffle the high bits into place (I think this needs 3 hw shuffle
> instructions)
> - shuffle the low bits into place (can benefit from shuffles for high
> bits, so just one another shuffle)
> 
> Maybe you can do better with more clever shuffles, but in any case the
> low bits will always require one (at least) additional shuffle.
>
> If you have separate opcodes, everything will be the same, except the
> last step you'll just ignore that shuffle and instead just use the
> pmulld instruction, which will do exactly what you need for the low
> bits. Sure multiplications are more effort for the hw, but hell it even
> has the same throughput on most cpus compared to a shuffle, just latency
> is worse. In any case it would be 8 vs 8 instructions, with just one
> instruction of them very slightly worse. We have much more optimization
> opportunities elsewhere than that (I agree that with sse2, which lacks
> pmulld, it would be worse, but we never particularly cared about that).

That's the thing -- if we have 32x32->64 opcodes we can fine tune this later. If we stick with separate high bit opcodes then that ability is lost (at least without coming back and changing TGSI again).

> > 
> > So I strongly think this is a bad idea. TGSI has support for multiple
> > destinations, though we never made much use of it. I see nothing
> > special about it.
> > 
> > If you can prove me wrong -- that LLVM can handle merge the
> > multiplies -- fine.  But I do think we have bigger fish to fry, so
> > I'd prefer we don't put too much time debating this.
> 
> No I doubt llvm can merge it (though in theory nothing would prevent it
> from recognizing the pattern). My guess is it will do scalar extraction,
> and use the imul/mul instructions (which can return 2x32bit numbers even
> on 32bit), then combine the vectors back together (most likely element
> by element). If it actually does it like that, a separate mul for the
> low bits would be in fact a win, because it would save the 4 reinsertion
> of the elements at the cost of just one vector mul (llvm uses pmulld
> just fine). But looking at this that way doesn't really make sense, we
> need instructions which make sense for everybody and aren't specified to
> suit one very peculiar implementation.
> But even if it generates optimal code, fact is that the multiply for
> getting the low bits is essentially noise in the whole instruction
> sequence. And who knows maybe intel will one day add some pmulhd/pmulhud
> instruction (which just makes plain more sense for vector instruction
> sets rather than the weird expanding muls).
> So I really don't see how that will prevent good code from being
> generated. Yes it will be one more multiplication (3 instead of 2 if
> doing everything vectorized) but multiplications are hardly expensive
> these days. We have much, much more important things to care about.
> 
> But I'd like to hear from other driver writers. It looked like for
> radeon and nouveau separate lo/hi instructions would be perfect, but I
> can't be sure. Intel IGPs OTOH always calculate a 64bit result for 32bit
> multiplies using the accumulator, so two instructions would indeed be
> suboptimal - but since it's the same calculation twice an optimizing
> backend should be able to get rid of the extra calc quite easily.

Not as easy as if we have the 32x32->64bits.

I really think that having an abstraction where an arithmetic operation is broken into two operations is inherently bad.  It is unnecessarily imposing assumptions/restrictions on the backends.

Jose