[Mesa-dev] RFC: tgsi opcodes for 32x32 muls with 64bit results

Christoph Bumiller e0425955 at student.tuwien.ac.at
Fri May 3 07:45:27 PDT 2013


On 03.05.2013 16:32, Jose Fonseca wrote:
>
> ----- Original Message -----
>> Am 03.05.2013 06:58, schrieb Jose Fonseca:
>>>
>>> ----- Original Message -----
>>>> Currently, there's no way to get the high bits of a 32x32
>>>> signed/unsigned integer multiplication with tgsi. However, all of
>>>> d3d10, OpenGL, and OpenCL support that, so we need it as well.
>>>> There's essentially two ways how it could be done: - a
>>>> 2-destination instruction returning both high and low bits (this
>>>> is how it looks like in d3d10 and glsl) - use the existing umul for
>>>> the low bits and have another instruction for the high bits (this
>>>> is how it looks like in opencl)
>>>>
>>>> Well there's other possibilities but these looked like they'd match
>>>> both APIs and HW reasonably (well with the exception of things like
>>>> sse2 which would prefer 2x2 32bit inputs and return 2x64bit as one
>>>> reg...).
>>>>
>>>> Actually it's two new instructions because unlike for the low bits
>>>> it matters for the high bits if the source operands are signed or
>>>> unsigned.
>>>>
>>>> Personally I'm favoring two separate instructions for low and high
>>>> bits to not have to deal with multi-destination instructions, but
>>>> if someone makes a strong case for one returning both low and high
>>>> bits I could be convinced otherwise. I think though two
>>>> instructions matches most hw very well (with the exception of
>>>> software renderers and possibly intel graphics but then a good
>>>> backend could certainly recognize this).
>>> Roland,
>>>
>>> I don't know about GPU HW, but I think that what you propose will
>>> forever prevent decent SSE code generation with LLVM.
>>>
>>> Using two separate opcodes for hi/low bits relies on common
>>> sub-expression elimination to merge the two multiplication operations
>>> back into one.  But I strongly doubt that even LLVM's optimization
>>> passes will be able to do that.
>>>
>>> Getting the 64bits results with LLVM will require sign extend the
>>> source arguments (http://llvm.org/docs/LangRef.html#mul-instruction )
>>> or SSE intrinsics. Eitherway, the expressions for the low and high
>>> bit will be radically different, so we'll end with two multiplies in
>>> the end -- which I think it is simply inadmissible -- TGSI should not
>>> stand in the way of backends generating good code.
>> You can't generate good code either way, this is a deficiency of sse
>> instruction set.
>> As I've outlined in another email, I think the best you can do with
>> sse41 is:
>> - shuffle both src args (put 2nd/4th elements into 1st/3rd slot)
>> - 2xpmuldq/pmuludq for doing the 32x32->64bit mul for both 1st/3rd and
>> 2nd/4th element
>> - shuffle the high bits into place (I think this needs 3 hw shuffle
>> instructions)
>> - shuffle the low bits into place (can benefit from shuffles for high
>> bits, so just one another shuffle)
>>
>> Maybe you can do better with more clever shuffles, but in any case the
>> low bits will always require one (at least) additional shuffle.
>>
>> If you have separate opcodes, everything will be the same, except the
>> last step you'll just ignore that shuffle and instead just use the
>> pmulld instruction, which will do exactly what you need for the low
>> bits. Sure multiplications are more effort for the hw, but hell it even
>> has the same throughput on most cpus compared to a shuffle, just latency
>> is worse. In any case it would be 8 vs 8 instructions, with just one
>> instruction of them very slightly worse. We have much more optimization
>> opportunities elsewhere than that (I agree that with sse2, which lacks
>> pmulld, it would be worse, but we never particularly cared about that).
> That's the thing -- if we have 32x32->64 opcodes we can fine tune this later. If we stick with separate high bit opcodes then that ability is lost (at least without coming back and changing TGSI again).
>
>>> So I strongly think this is a bad idea. TGSI has support for multiple
>>> destinations, though we never made much use of it. I see nothing
>>> special about it.
>>>
>>> If you can prove me wrong -- that LLVM can handle merge the
>>> multiplies -- fine.  But I do think we have bigger fish to fry, so
>>> I'd prefer we don't put too much time debating this.
>> No I doubt llvm can merge it (though in theory nothing would prevent it
>> from recognizing the pattern). My guess is it will do scalar extraction,
>> and use the imul/mul instructions (which can return 2x32bit numbers even
>> on 32bit), then combine the vectors back together (most likely element
>> by element). If it actually does it like that, a separate mul for the
>> low bits would be in fact a win, because it would save the 4 reinsertion
>> of the elements at the cost of just one vector mul (llvm uses pmulld
>> just fine). But looking at this that way doesn't really make sense, we
>> need instructions which make sense for everybody and aren't specified to
>> suit one very peculiar implementation.
>> But even if it generates optimal code, fact is that the multiply for
>> getting the low bits is essentially noise in the whole instruction
>> sequence. And who knows maybe intel will one day add some pmulhd/pmulhud
>> instruction (which just makes plain more sense for vector instruction
>> sets rather than the weird expanding muls).
>> So I really don't see how that will prevent good code from being
>> generated. Yes it will be one more multiplication (3 instead of 2 if
>> doing everything vectorized) but multiplications are hardly expensive
>> these days. We have much, much more important things to care about.
>>
>> But I'd like to hear from other driver writers. It looked like for
>> radeon and nouveau separate lo/hi instructions would be perfect, but I
>> can't be sure. Intel IGPs OTOH always calculate a 64bit result for 32bit
>> multiplies using the accumulator, so two instructions would indeed be
>> suboptimal - but since it's the same calculation twice an optimizing
>> backend should be able to get rid of the extra calc quite easily.
> Not as easy as if we have the 32x32->64bits.
>
>
> I really think that having an abstraction where an arithmetic operation is broken into two operations is inherently bad.  It is unnecessarily imposing assumptions/restrictions on the backends.

I think I'd rather have 2 destination registers on 1 instruction for
this reason. Splitting into 2 instructions at the driver backend level
is much simpler than reassembling a 64 bit integer from 2 separate
instructions later.

The question is how to distribute the result. Low parts to DST[0] and
high parts to DST[1] or low parts to DST[0,1].x,z and high parts to
DST[0,1].y,w. The latter would match how we treat other 64 bit values
right now (doubles/float64).

>
> Jose
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev



More information about the mesa-dev mailing list