[Mesa-dev] prep work for 64-bit integer support

Thu Jun 9 20:28:30 UTC 2016

On Thu, Jun 9, 2016 at 4:11 PM, Ian Romanick <idr at freedesktop.org> wrote:
> On 06/09/2016 11:26 AM, Ilia Mirkin wrote:
>> On Thu, Jun 9, 2016 at 2:07 PM, Ian Romanick <idr at freedesktop.org> wrote:
>>> On 06/08/2016 02:15 PM, Dave Airlie wrote:
>>>> While writing ARB_gpu_shader_int64 I realised I needed to change
>>>> a lot of existing checks for doubles to 64bit, so I decided to
>>>> do that as much in advance as possible.
>>>
>>> I didn't know you were working on that.  I just started poking at more
>>> general sized integer support too.  I wanted to add support for 8, 16,
>>> and 64-bit types.
>>
>> Might be worth noting that NVIDIA has some support for "SIMD"
>> operations on 16- and 8-bit sized values packed in a 32-bit integer.
>> You can see what operations are supported by looking up "video
>> instructions" in the PTX ISA - those roughly map 1:1 with the
>> hardware. However I've never seen NVIDIA blob actually generate them,
>> even with NV_gpu_shader5's u8vec4 and such. I don't know how this
>> changes on Pascal, which is rumored to support fp16 ALU natively.
>
> Have you tried feeding it PTX directly?  It could just be a limitation
> of the GLSL compiler.

I haven't. Although I suspect that if I tell PTX to emit a particular
instruction, then it will convert it to the proper ISA encoding and
emit it, since they really do map 1:1 last I looked. I was more
surprised that u8vec4 + u8vec4 didn't end up using it, and instead did
the adds as 4x32-bit and then re-extracted the low 8 bits. Perhaps
NVIDIA knows something I don't, or perhaps like you say, their GLSL
compiler is just not smart enough to do it. Or perhaps that specific
case caused them to decide not to do it, but a different case would
have used it (probably various issues with instruction latencies, dual
issue capabilities, etc).

I had originally proposed using this feature to the dolphin team, who
has a ton of u8's in their shaders that they constantly bit-mask and
clamp, but when I saw what the blob was going to do with those, I
withdrew that suggestion.

>
>>> What's your hardware support plan?  I think that any hardware that can
>>> do uaddCarry, usubBorrow, [ui]mulExtended, and findMSB can implement
>>> everything in a relatively efficient manner.  I've coded almost all of
>>> the possible 64-bit operations in GLSL using ivec2 or uvec2 and these
>>> primitives as a proof of concept.  Less efficient implementations of
>>> everything is possible if any of those primitives are missing.
>>> Technically speaking, it ought to be possible to expose 64-bit integer
>>> support on *any* hardware that has true integers.
>>>
>>> I'm currently leaning towards implementing these as a NIR lowering pass,
>>> but there are other possibilities.  There are advantages to doing the
>>> lowering after most or all of the device independent optimizations.  In
>>> addition, doing it completely in NIR means that we can get 64-bit
>>> integer support for SPIR-V nearly for free.  I've also considered GLSL
>>> IR lowering or lowering while translating GLSL IR to NIR.
>>
>> While I can't speak for AMD hw, NVIDIA has some limited support for 64-bit ints:
>>
>> (a) atomics
>> (b) shifts (so you don't have to use a temp + bitfield manipulation to
>> shift from one 32-bit val to another)
>> (c) conversion between float/double and 64-bit ints
>
> Yeah, some Intel hardware is similar.  I suspect we'd want to have a
> bitfield to select which specific operations or groups of operations
> actually need to be lowered.  Jason and Ken reminded me that we already
> do basically the same thing for fp64.
>
>> And things like addition can be done using things like carry bits. We
>> have a pass to auto-lower 64-bit integer ops at the "end" so that
>> splitting them up doesn't affect things like constant propagation and
>> other optimizations. [I'm sure it'll need adjusting for a full 64-bit
>> int implementation, it mostly ends up getting used with address
>> calculations.] So I'd be highly in favor of (a) letting the backend
>> deal with it and (b) having the requisite TGSI opcodes to express it
>> all cleanly [which is what Dave has done].
>
> We'll definitely need support in the lower-level IRs.  Current and
> future GPUs have various levels of native support.  We really want to
> take advantage of that.  Some drivers will also want to implement their
> own lowering for some things.  For example, before Gen7, Intel GPUs
> didn't have a 32x32->64 multiplier.  They have a 16x32->48 multiplier
> (I'm not kidding) that can be used to simulate a 32x32->64 multiplier.
> I think we can use that in a clever way to generate a 64x64->64 results
> more efficiently than would come from a generic lowering pass that uses
> 32x32->64 multiplications.

nv50 has 24x24 -> 32 [and 16x16 -> 32]. Loads of fun to implement
imulExtended() on that - you still have to compute the low bits for
the carry information. nvc0 all has the regular 32x32 -> low/high 32
logic, with optional carry addition/generation, so it's no trouble.

>
> At the same time, if implementing lowering once at a higher level means
> that we can enable a feature in more places more quickly, that seems
> like winning.  I think blending the two approaches will lead to the best
> overall result.  I doubt Marek will spend any effort implementing 64-bit
> integer support for r600.  If the real work of adding that support
> happened at higher levels of Mesa, I bet he'd accept patches. :)

I'm in no way opposed to having shareable "fudging" logic, so that it
can be used by drivers with less sophisticated backends, or ones that
are getting less development interest. Just want to make sure that a
way to let the backend just deal with it remains.

Cheers,

  -ilia