[Mesa-dev] ARB_gs5 new instruction support in gallium

Mon Apr 21 13:36:36 PDT 2014

Am 21.04.2014 21:10, schrieb Ilia Mirkin:
> On Mon, Apr 21, 2014 at 2:52 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>> Am 21.04.2014 17:54, schrieb Ilia Mirkin:
>>> Hello,
>>>
>>> I've been giving some thought to catching up with core mesa on ARB_gs5
>>> support. One of the things that ARB_gs5 introduces are new operations:
>>>
>>>       genType frexp(genType x, out genIType exp);
>>>       genType ldexp(genType x, in genIType exp);
>>>
>>>       genIType bitfieldExtract(genIType value, int offset, int bits);
>>>       genUType bitfieldExtract(genUType value, int offset, int bits);
>>>
>>>       genIType bitfieldInsert(genIType base, genIType insert, int offset,
>>>                               int bits);
>>>       genUType bitfieldInsert(genUType base, genUType insert, int offset,
>>>                               int bits);
>>>
>>>       genIType bitfieldReverse(genIType value);
>>>       genUType bitfieldReverse(genUType value);
>>>
>>>       genIType bitCount(genIType value);
>>>       genIType bitCount(genUType value);
>>>
>>>       genIType findLSB(genIType value);
>>>       genIType findLSB(genUType value);
>>>
>>>       genIType findMSB(genIType value);
>>>       genIType findMSB(genUType value);
>>>
>>>       genUType uaddCarry(genUType x, genUType y, out genUType carry);
>>>       genUType usubBorrow(genUType x, genUType y, out genUType borrow);
>>>
>>>       void umulExtended(genUType x, genUType y, out genUType msb,
>>>                         out genUType lsb);
>>>       void imulExtended(genIType x, genIType y, out genIType msb,
>>>                         out genIType lsb);
>>>
>>> (I've skipped the packing stuff since that seems to already be
>>> supported/lowered elsewhere, i2f/f2i which is already handled, and the
>>> texture gather stuff, for which support already exists. And the
>>> interpolateAt* stuff which isn't supported by core mesa yet, and when
>>> it is, will require a very diff kind of handling than the above.)
>>>
>>> I guess the only drivers one really needs to worry about here are
>>> r600/radeonsi and nouveau. svga is largely a passthrough afaik, and
>>> llvmpipe/softpipe is software and can thus implement it however it
>>> wants.
>>>
>>> Looking at the nvc0+ shader ISA, there are instructions to directly
>>> handle all the bitfield stuff (bitfieldExtract, bitfieldInsert,
>>> bitfieldReverse, bitCount, findLSB, findMSB). There is also a "mul
>>> high", which is that the *mulExtended stuff gets translated into.
>>>
>>> There are no instructions to handle frexp/ldexp, or the add carry/sub
>>> borrow stuff. (Looking at the code the blob generates, they just do
>>> all that "by hand". Even though there is a "set cc" flag on those
>>> instructions which one might assume has the carry. But the blob didn't
>>> use it.)
>>>
>>> So I was thinking that we could just take the relevant SM5
>>> instructions and lower the rest. Specifically, these would be the new
>>> opcodes:
>>>
>>> IBFE
>>> UBFE
>>> BFI
>>> BREV (not BFREV since most instructions appear to be 3/4 letters)
>>> POPC (shorter than "countbits")
>>> LSB
>>> UMSB
>>> IMSB
>>> IMULHI
>> We already have imul_hi.
> 
> Yeah, I noticed that after I sent it out. Only llvmpipe (and perhaps
> softpipe) supports it though, based on a quick grep. And nothing emits
> it (although presumably the vmware d3d10 st makes use of it).
> 
>>
>>>
>>> I just took a look at the Radeon SI ISA, and it does seem like it has
>>> ldexp/frexp instructions, as well as setting the carry flag for
>>> addc/subb. Although since TGSI doesn't have flags or multiple
>>> destinations, not sure how the latter 2 could be easily encoded in the
>>> glsl->tgsi translation.
>> It is not entirely true that tgsi doesn't support multiple destinations.
>> The token format allows 0-3 destinations. But so far instructions with
>> more than one destination do not exist. There was some discussion about
>> it when we needed umul_hi/imul_hi (since these are also multiple
>> destination sm4 instructions) but deemed it not worth it, partly also
>> because it didn't look like (most) gpus could actually benefit from this
>> being just 1 instruction instead of two (that is, it would emit the same
>> 2 instructions for the low and high part of the mul anyway). Mostly
>> because gpus (and cpus) usually follow the model of multiple 32bit
>> sources in, one 32bit dst out. Obviously the accumulator of intel gpus
>> is an exception there.
>> So, you could follow that same model with subb/addc - use the existing
>> sub/add and just use a new instruction for the borrow/carry part (though
>> it looks like if you do it with two instructions anyway, you could just
>> use an existing instruction for the carry/borrow part). But if gpus
>> actually can set two regs simultaneously (or otherwise benefit from this
>> being one instruction without having to "reassemble" it, for instance
>> with special carry flags), then it might be better to actually use
>> multi-dest instructions. Most likely because this hasn't been used at
>> all until now it will break in some places, but there should not be
>> anything major preventing this to work.
> 
> You're still going to have to reassemble it one way or another --
> either detecting UADD/ADDC combinations, or UADD/USLT combinations.
> Might as well use the more general one, no? (And a similar combo can
> be used for SUBB, I think.)
Yes, if you use two instructions.
> 
> Having real multiple outputs will be useful if anyone wants to pipe
> FREXP all the way through -- that'll be a bit awkward to do as 2
> opcodes. Since nvc0 doesn't support it, I won't be losing sleep over
> it :)
Well in theory it doesn't look awkward at all to me as 2 instructions -
one just returns the mantissa, the other the exponent. As far as I can
tell, this is exactly what radeonsi would do.
(Older radeons it seems do not support frexp, or rather they only
support it for doubles - there indeed it is one instruction returning 2
results, as it is using 4 slots out of the 4 (or 5) vliw slots.)
I guess though the things which can't be lowered reasonably would be
more important to implement.


> 
>>
>>
>>>
>>> Thoughts/opinions before I go and implement the above? Is someone else
>>> already working on this?
>> I think this looks good overall. We're getting close to the max number
>> of different instructions though (256) but if that should become a
>> problem can easily ditch some (or double the max number by killing a bit
>> from max number of sources - 0-15 sources is not useful, 0-7 would still
>> be more than enough).
> 
> I didn't realize there was a max instruction quantity, but these will
> have to be added one way or another if gallium is to support GL4.0 :)
> There's also the Double ISA which appears to be documented but not
> actually in p_shader_tokens.h, which will take up a whole bunch of
> opcodes as well.
Yes indeed.

> 
> In any case, I'm going to take a stab at implementing these and piping
> them through to nvc0 after I finish up ARB_sample_shading (coming soon
> to a patch near you).
> 
>   -ilia
> 

Roland