[Mesa-dev] RFC: tgsi opcodes for 32x32 muls with 64bit results

Fri May 3 06:42:02 PDT 2013

Am 03.05.2013 06:58, schrieb Jose Fonseca:
> 
> 
> ----- Original Message -----
>> Currently, there's no way to get the high bits of a 32x32 
>> signed/unsigned integer multiplication with tgsi. However, all of
>> d3d10, OpenGL, and OpenCL support that, so we need it as well. 
>> There's essentially two ways how it could be done: - a
>> 2-destination instruction returning both high and low bits (this
>> is how it looks like in d3d10 and glsl) - use the existing umul for
>> the low bits and have another instruction for the high bits (this
>> is how it looks like in opencl)
>> 
>> Well there's other possibilities but these looked like they'd match
>> both APIs and HW reasonably (well with the exception of things like
>> sse2 which would prefer 2x2 32bit inputs and return 2x64bit as one
>> reg...).
>> 
>> Actually it's two new instructions because unlike for the low bits
>> it matters for the high bits if the source operands are signed or
>> unsigned.
>> 
>> Personally I'm favoring two separate instructions for low and high
>> bits to not have to deal with multi-destination instructions, but
>> if someone makes a strong case for one returning both low and high
>> bits I could be convinced otherwise. I think though two
>> instructions matches most hw very well (with the exception of
>> software renderers and possibly intel graphics but then a good
>> backend could certainly recognize this).
> 
> Roland,
> 
> I don't know about GPU HW, but I think that what you propose will
> forever prevent decent SSE code generation with LLVM.
> 
> Using two separate opcodes for hi/low bits relies on common
> sub-expression elimination to merge the two multiplication operations
> back into one.  But I strongly doubt that even LLVM's optimization
> passes will be able to do that.
> 
> Getting the 64bits results with LLVM will require sign extend the
> source arguments (http://llvm.org/docs/LangRef.html#mul-instruction )
> or SSE intrinsics. Eitherway, the expressions for the low and high
> bit will be radically different, so we'll end with two multiplies in
> the end -- which I think it is simply inadmissible -- TGSI should not
> stand in the way of backends generating good code.
You can't generate good code either way, this is a deficiency of sse
instruction set.
As I've outlined in another email, I think the best you can do with
sse41 is:
- shuffle both src args (put 2nd/4th elements into 1st/3rd slot)
- 2xpmuldq/pmuludq for doing the 32x32->64bit mul for both 1st/3rd and
2nd/4th element
- shuffle the high bits into place (I think this needs 3 hw shuffle
instructions)
- shuffle the low bits into place (can benefit from shuffles for high
bits, so just one another shuffle)

Maybe you can do better with more clever shuffles, but in any case the
low bits will always require one (at least) additional shuffle.

If you have separate opcodes, everything will be the same, except the
last step you'll just ignore that shuffle and instead just use the
pmulld instruction, which will do exactly what you need for the low
bits. Sure multiplications are more effort for the hw, but hell it even
has the same throughput on most cpus compared to a shuffle, just latency
is worse. In any case it would be 8 vs 8 instructions, with just one
instruction of them very slightly worse. We have much more optimization
opportunities elsewhere than that (I agree that with sse2, which lacks
pmulld, it would be worse, but we never particularly cared about that).

> 
> So I strongly think this is a bad idea. TGSI has support for multiple
> destinations, though we never made much use of it. I see nothing
> special about it.
> 
> If you can prove me wrong -- that LLVM can handle merge the
> multiplies -- fine.  But I do think we have bigger fish to fry, so
> I'd prefer we don't put too much time debating this.

No I doubt llvm can merge it (though in theory nothing would prevent it
from recognizing the pattern). My guess is it will do scalar extraction,
and use the imul/mul instructions (which can return 2x32bit numbers even
on 32bit), then combine the vectors back together (most likely element
by element). If it actually does it like that, a separate mul for the
low bits would be in fact a win, because it would save the 4 reinsertion
of the elements at the cost of just one vector mul (llvm uses pmulld
just fine). But looking at this that way doesn't really make sense, we
need instructions which make sense for everybody and aren't specified to
suit one very peculiar implementation.
But even if it generates optimal code, fact is that the multiply for
getting the low bits is essentially noise in the whole instruction
sequence. And who knows maybe intel will one day add some pmulhd/pmulhud
instruction (which just makes plain more sense for vector instruction
sets rather than the weird expanding muls).
So I really don't see how that will prevent good code from being
generated. Yes it will be one more multiplication (3 instead of 2 if
doing everything vectorized) but multiplications are hardly expensive
these days. We have much, much more important things to care about.

But I'd like to hear from other driver writers. It looked like for
radeon and nouveau separate lo/hi instructions would be perfect, but I
can't be sure. Intel IGPs OTOH always calculate a 64bit result for 32bit
multiplies using the accumulator, so two instructions would indeed be
suboptimal - but since it's the same calculation twice an optimizing
backend should be able to get rid of the extra calc quite easily.

Roland