[Mesa-dev] prep work for 64-bit integer support

Thu Jun 9 21:14:03 UTC 2016

On Thu, Jun 9, 2016 at 10:28 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
> On Thu, Jun 9, 2016 at 4:11 PM, Ian Romanick <idr at freedesktop.org> wrote:
>> On 06/09/2016 11:26 AM, Ilia Mirkin wrote:
>>> On Thu, Jun 9, 2016 at 2:07 PM, Ian Romanick <idr at freedesktop.org> wrote:
>>>> On 06/08/2016 02:15 PM, Dave Airlie wrote:
>>>>> While writing ARB_gpu_shader_int64 I realised I needed to change
>>>>> a lot of existing checks for doubles to 64bit, so I decided to
>>>>> do that as much in advance as possible.
>>>>
>>>> I didn't know you were working on that.  I just started poking at more
>>>> general sized integer support too.  I wanted to add support for 8, 16,
>>>> and 64-bit types.
>>>
>>> Might be worth noting that NVIDIA has some support for "SIMD"
>>> operations on 16- and 8-bit sized values packed in a 32-bit integer.
>>> You can see what operations are supported by looking up "video
>>> instructions" in the PTX ISA - those roughly map 1:1 with the
>>> hardware. However I've never seen NVIDIA blob actually generate them,
>>> even with NV_gpu_shader5's u8vec4 and such. I don't know how this
>>> changes on Pascal, which is rumored to support fp16 ALU natively.
>>
>> Have you tried feeding it PTX directly?  It could just be a limitation
>> of the GLSL compiler.
>
> I haven't. Although I suspect that if I tell PTX to emit a particular
> instruction, then it will convert it to the proper ISA encoding and
> emit it, since they really do map 1:1 last I looked. I was more
> surprised that u8vec4 + u8vec4 didn't end up using it, and instead did
> the adds as 4x32-bit and then re-extracted the low 8 bits. Perhaps
> NVIDIA knows something I don't, or perhaps like you say, their GLSL
> compiler is just not smart enough to do it. Or perhaps that specific
> case caused them to decide not to do it, but a different case would
> have used it (probably various issues with instruction latencies, dual
> issue capabilities, etc).
>
> I had originally proposed using this feature to the dolphin team, who
> has a ton of u8's in their shaders that they constantly bit-mask and
> clamp, but when I saw what the blob was going to do with those, I
> withdrew that suggestion.
>
>>
>>>> What's your hardware support plan?  I think that any hardware that can
>>>> do uaddCarry, usubBorrow, [ui]mulExtended, and findMSB can implement
>>>> everything in a relatively efficient manner.  I've coded almost all of
>>>> the possible 64-bit operations in GLSL using ivec2 or uvec2 and these
>>>> primitives as a proof of concept.  Less efficient implementations of
>>>> everything is possible if any of those primitives are missing.
>>>> Technically speaking, it ought to be possible to expose 64-bit integer
>>>> support on *any* hardware that has true integers.
>>>>
>>>> I'm currently leaning towards implementing these as a NIR lowering pass,
>>>> but there are other possibilities.  There are advantages to doing the
>>>> lowering after most or all of the device independent optimizations.  In
>>>> addition, doing it completely in NIR means that we can get 64-bit
>>>> integer support for SPIR-V nearly for free.  I've also considered GLSL
>>>> IR lowering or lowering while translating GLSL IR to NIR.
>>>
>>> While I can't speak for AMD hw, NVIDIA has some limited support for 64-bit ints:
>>>
>>> (a) atomics
>>> (b) shifts (so you don't have to use a temp + bitfield manipulation to
>>> shift from one 32-bit val to another)
>>> (c) conversion between float/double and 64-bit ints
>>
>> Yeah, some Intel hardware is similar.  I suspect we'd want to have a
>> bitfield to select which specific operations or groups of operations
>> actually need to be lowered.  Jason and Ken reminded me that we already
>> do basically the same thing for fp64.
>>
>>> And things like addition can be done using things like carry bits. We
>>> have a pass to auto-lower 64-bit integer ops at the "end" so that
>>> splitting them up doesn't affect things like constant propagation and
>>> other optimizations. [I'm sure it'll need adjusting for a full 64-bit
>>> int implementation, it mostly ends up getting used with address
>>> calculations.] So I'd be highly in favor of (a) letting the backend
>>> deal with it and (b) having the requisite TGSI opcodes to express it
>>> all cleanly [which is what Dave has done].
>>
>> We'll definitely need support in the lower-level IRs.  Current and
>> future GPUs have various levels of native support.  We really want to
>> take advantage of that.  Some drivers will also want to implement their
>> own lowering for some things.  For example, before Gen7, Intel GPUs
>> didn't have a 32x32->64 multiplier.  They have a 16x32->48 multiplier
>> (I'm not kidding) that can be used to simulate a 32x32->64 multiplier.
>> I think we can use that in a clever way to generate a 64x64->64 results
>> more efficiently than would come from a generic lowering pass that uses
>> 32x32->64 multiplications.
>
> nv50 has 24x24 -> 32 [and 16x16 -> 32]. Loads of fun to implement
> imulExtended() on that - you still have to compute the low bits for
> the carry information. nvc0 all has the regular 32x32 -> low/high 32
> logic, with optional carry addition/generation, so it's no trouble.
>
>>
>> At the same time, if implementing lowering once at a higher level means
>> that we can enable a feature in more places more quickly, that seems
>> like winning.  I think blending the two approaches will lead to the best
>> overall result.  I doubt Marek will spend any effort implementing 64-bit
>> integer support for r600.  If the real work of adding that support
>> happened at higher levels of Mesa, I bet he'd accept patches. :)
>
> I'm in no way opposed to having shareable "fudging" logic, so that it
> can be used by drivers with less sophisticated backends, or ones that
> are getting less development interest. Just want to make sure that a
> way to let the backend just deal with it remains.

For r600, the lowering to int32 should take place in a common place
(e.g. GLSL IR). For radeonsi, we'd like to get all int64 opcodes
because we already have full int64 support in the LLVM backend.

Marek