[Mesa-dev] RFC: tgsi opcodes for 32x32 muls with 64bit results

Roland Scheidegger sroland at vmware.com
Fri May 3 08:27:55 PDT 2013


Am 03.05.2013 16:45, schrieb Christoph Bumiller:
> On 03.05.2013 16:32, Jose Fonseca wrote:
>>
>> ----- Original Message -----
>>> Am 03.05.2013 06:58, schrieb Jose Fonseca:
>>>>
>>>> ----- Original Message -----
>>>>> Currently, there's no way to get the high bits of a 32x32
>>>>> signed/unsigned integer multiplication with tgsi. However, all of
>>>>> d3d10, OpenGL, and OpenCL support that, so we need it as well.
>>>>> There's essentially two ways how it could be done: - a
>>>>> 2-destination instruction returning both high and low bits (this
>>>>> is how it looks like in d3d10 and glsl) - use the existing umul for
>>>>> the low bits and have another instruction for the high bits (this
>>>>> is how it looks like in opencl)
>>>>>
>>>>> Well there's other possibilities but these looked like they'd match
>>>>> both APIs and HW reasonably (well with the exception of things like
>>>>> sse2 which would prefer 2x2 32bit inputs and return 2x64bit as one
>>>>> reg...).
>>>>>
>>>>> Actually it's two new instructions because unlike for the low bits
>>>>> it matters for the high bits if the source operands are signed or
>>>>> unsigned.
>>>>>
>>>>> Personally I'm favoring two separate instructions for low and high
>>>>> bits to not have to deal with multi-destination instructions, but
>>>>> if someone makes a strong case for one returning both low and high
>>>>> bits I could be convinced otherwise. I think though two
>>>>> instructions matches most hw very well (with the exception of
>>>>> software renderers and possibly intel graphics but then a good
>>>>> backend could certainly recognize this).
>>>> Roland,
>>>>
>>>> I don't know about GPU HW, but I think that what you propose will
>>>> forever prevent decent SSE code generation with LLVM.
>>>>
>>>> Using two separate opcodes for hi/low bits relies on common
>>>> sub-expression elimination to merge the two multiplication operations
>>>> back into one.  But I strongly doubt that even LLVM's optimization
>>>> passes will be able to do that.
>>>>
>>>> Getting the 64bits results with LLVM will require sign extend the
>>>> source arguments (http://llvm.org/docs/LangRef.html#mul-instruction )
>>>> or SSE intrinsics. Eitherway, the expressions for the low and high
>>>> bit will be radically different, so we'll end with two multiplies in
>>>> the end -- which I think it is simply inadmissible -- TGSI should not
>>>> stand in the way of backends generating good code.
>>> You can't generate good code either way, this is a deficiency of sse
>>> instruction set.
>>> As I've outlined in another email, I think the best you can do with
>>> sse41 is:
>>> - shuffle both src args (put 2nd/4th elements into 1st/3rd slot)
>>> - 2xpmuldq/pmuludq for doing the 32x32->64bit mul for both 1st/3rd and
>>> 2nd/4th element
>>> - shuffle the high bits into place (I think this needs 3 hw shuffle
>>> instructions)
>>> - shuffle the low bits into place (can benefit from shuffles for high
>>> bits, so just one another shuffle)
>>>
>>> Maybe you can do better with more clever shuffles, but in any case the
>>> low bits will always require one (at least) additional shuffle.
>>>
>>> If you have separate opcodes, everything will be the same, except the
>>> last step you'll just ignore that shuffle and instead just use the
>>> pmulld instruction, which will do exactly what you need for the low
>>> bits. Sure multiplications are more effort for the hw, but hell it even
>>> has the same throughput on most cpus compared to a shuffle, just latency
>>> is worse. In any case it would be 8 vs 8 instructions, with just one
>>> instruction of them very slightly worse. We have much more optimization
>>> opportunities elsewhere than that (I agree that with sse2, which lacks
>>> pmulld, it would be worse, but we never particularly cared about that).
>> That's the thing -- if we have 32x32->64 opcodes we can fine tune this later. If we stick with separate high bit opcodes then that ability is lost (at least without coming back and changing TGSI again).
>>
>>>> So I strongly think this is a bad idea. TGSI has support for multiple
>>>> destinations, though we never made much use of it. I see nothing
>>>> special about it.
>>>>
>>>> If you can prove me wrong -- that LLVM can handle merge the
>>>> multiplies -- fine.  But I do think we have bigger fish to fry, so
>>>> I'd prefer we don't put too much time debating this.
>>> No I doubt llvm can merge it (though in theory nothing would prevent it
>>> from recognizing the pattern). My guess is it will do scalar extraction,
>>> and use the imul/mul instructions (which can return 2x32bit numbers even
>>> on 32bit), then combine the vectors back together (most likely element
>>> by element). If it actually does it like that, a separate mul for the
>>> low bits would be in fact a win, because it would save the 4 reinsertion
>>> of the elements at the cost of just one vector mul (llvm uses pmulld
>>> just fine). But looking at this that way doesn't really make sense, we
>>> need instructions which make sense for everybody and aren't specified to
>>> suit one very peculiar implementation.
>>> But even if it generates optimal code, fact is that the multiply for
>>> getting the low bits is essentially noise in the whole instruction
>>> sequence. And who knows maybe intel will one day add some pmulhd/pmulhud
>>> instruction (which just makes plain more sense for vector instruction
>>> sets rather than the weird expanding muls).
>>> So I really don't see how that will prevent good code from being
>>> generated. Yes it will be one more multiplication (3 instead of 2 if
>>> doing everything vectorized) but multiplications are hardly expensive
>>> these days. We have much, much more important things to care about.
>>>
>>> But I'd like to hear from other driver writers. It looked like for
>>> radeon and nouveau separate lo/hi instructions would be perfect, but I
>>> can't be sure. Intel IGPs OTOH always calculate a 64bit result for 32bit
>>> multiplies using the accumulator, so two instructions would indeed be
>>> suboptimal - but since it's the same calculation twice an optimizing
>>> backend should be able to get rid of the extra calc quite easily.
>> Not as easy as if we have the 32x32->64bits.
>>
>>
>> I really think that having an abstraction where an arithmetic operation is broken into two operations is inherently bad.  It is unnecessarily imposing assumptions/restrictions on the backends.
> 
> I think I'd rather have 2 destination registers on 1 instruction for
> this reason. Splitting into 2 instructions at the driver backend level
> is much simpler than reassembling a 64 bit integer from 2 separate
> instructions later.
> 
> The question is how to distribute the result. Low parts to DST[0] and
> high parts to DST[1] or low parts to DST[0,1].x,z and high parts to
> DST[0,1].y,w. The latter would match how we treat other 64 bit values
> right now (doubles/float64).

Such output order would match sse2 much better indeed too but it would
be useless since all apis I know of require separate low and high
results anyway (d3d10, glsl, opencl).
(The difference with doubles being that you have 2x64bit inputs there
too, so this would correspond more with returning low parts of a
64x64bit multiply. But we don't have 64bit integer operations.)

Roland


More information about the mesa-dev mailing list