[Mesa-dev] nir: find_msb vs clz

Thu Apr 2 12:23:32 UTC 2020

W.r.t AMD hardware, it seems to map quite exactly the DX behavior,
so given that I would, of course favor a NIR instruction which maps that :)

Small overview (scalar and vector):
S_FLBIT_I32_{B32,B64} - D = the number of zeros before the first one 
starting from the MSB.Returns -1 if none.
S_FLBIT_I32{_I64} - Count how many bits in a row (from MSB to LSB) are 
the same as the sign bit. Return -1 if the input is zero or all 1's (-1).
V_FFBH_U32 - D.u = position of first 1 in S0 from MSB; D=0xFFFFFFFF if 
S0==0.
V_FFBH_I32 - D.u = position of first bit different from sign bit in S0 
from MSB; D=0xFFFFFFFF if S0==0 or 0xFFFFFFFF.

So, currently we emit a subtraction and bcsel every time we encounter 
*find_msb. For DX->VK, I didn't check yet how they map,
but we could probably also add the reverse optimization, like we do for 
bfm/bfe and such.

@Jason, take the above as naming suggestions :P

Daniel

On 1/4/20 22:52, Jason Ekstrand <jason at jlekstrand.net> wrote:

> On Wed, Apr 1, 2020 at 1:52 PM Eric Anholt <eric at anholt.net> wrote:
>> On Wed, Apr 1, 2020 at 11:39 AM Erik Faye-Lund
>> <erik.faye-lund at collabora.com> wrote:
>>> While working on the NIR to DXIL conversion code for D3D12, I've
>>> noticed that we're not exactly doing the best we could here.
>>>
>>> First some background:
>>>
>>> NIR currently has a few instructions that does kinda the same:
>>>
>>> 1. nir_op_ufind_msb: Finds the index of the most significant bit,
>>> counting from the least significant bit. It returns -1 on zero-input.
>>>
>>> 2. nir_op_ifind_msb: A signed version of ufind_msb; looks for the first
>>> non sign-bit. It's not terribly interesting in this context, as it can
>>> be trivially lowered if missing, and it doesn't seem like any hardware
>>> supports this natively. I'm just mentioning it for completeness.
>>>
>>> 3. nir_op_uclz: Counts the amount of leading zeroes, counding from the
>>> most significant bit. It returns 32 on zero-input, and only exist in an
>>> unsigned 32-bit variation.
>>>
>>> ufind_msb is kinda the O.G here, uclz was recently added, and is as far
>>> as I can see only used in an intel-specific SPIR-V instruction.
>>>
>>> Additionally, there's the OpenCLstd_Clz SPIR-V instruction, which we
>>> lower to ufind_msb using nir_clz_u(), regardless if the backend
>>> supports nir_op_uclz or not.
>>>
>>> It seems only the nouveau's NV50 backend actually wants ufind_msb,
>>> everything else seems to convert ufind_msb to some clz-variant while
>>> emitting code. Some have to special-case on zero-input, and some
>>> not...
>>>
>>> All of this is not really awesome in my eyes.
>>>
>>> So, while adding support for DXIL, I need to figure out how to map
>>> these (well, ufind_msb at least) onto the DXIL intrinsics. DXIL doesn't
>>> have a ufind_msb, but it has a firstbit_hi that is identical to
>>> nir_op_uclz... except that it returns -1 on zero-input :(
>>>
>>> For now, I'm lowering ufind_msb to something ufind_msb while emitting
>>> code, like everyone else. But this feels a bit dirty, *especially*
>>> since we have a clz-instruction that *almost* fits. And since we're
>>> targetting OpenCL, which use clz as it's primitive, we end up doing 32
>>> - (32 - x), and since that inner isub happens while emitting, we can't
>>> easily optimize it away without introducing an optimizing backend...
>>>
>>> The solution seems obvious; use nir_op_uclz instead.
>>>
>>> But that's also a bit annoying, for a few reasons:
>>>
>>> 1. Only *one* backend actually implements support for it. So this
>>> either means a lot of work, or making it an opt-in feature somehow.
> That's likely fairly easily fixed.  That said, making it an optional
> feature is also easy.  Add lowering in nir_opt_algebraic.py hidden
> behind a lower_ufind_msb_to_clz flag.  If setting that flag on Intel
> doesn't hurt shader-db (I think our back-end lowering may be slightly
> more efficient), we'll set it and delete a pile of code.
>
>>> 2. We would probably have to support lowering in either direction to
>>> support what all hardware prefers.
> I suspect that virtually everyone who has an instruction for this in
> hardware has one that supports returning the bit-width for 0.  There's
> an interesting wikipedia page on this:
>
> https://en.wikipedia.org/wiki/Find_first_set
>
> According to the table there, virtually all CPUs that implement this
> return the bit-width for 0 except for the old way to do it on Intel.
> Since this is also what's defined for OpenCL, that's what we're likely
> to see on mobile.  Intel has instructions for both and I would guess
> AMD and Nvidia do as well since they care a lot about D3D.
>
>>> 3. That zero-case still needs special treatment in several backends, it
>>> seems. We could alternatively declare that nir_op_uclz is undefined for
>>> zero-input, and handle this when lowering...?
>>>
>>> 4. It seems some (Intel?) hardware only supports 32-bit clz, so we
>>> would have to lower to something else for other bit-sizes. That's not
>>> too hard, though.
> On Intel, we have two instructions for this: FBH which returns -1 for
> 0 and LZD which returns 32 for 0.  Both count leading zeros from the
> MSB side.  We don't have native hardware support for computing from
> the LSB side.  And, yeah, we can only do it on 32-bit types so that
> sucks a bit.
>
>>> So yeah...
>>>
>>> I guess the first step would be to add a switch to use nir_uclz()
>>> instead of nir_clz_u() when handling OpenCLstd_Clz in vtn.
> I would say just implement clz in nouveau and then make it always emit
> the clz instruction.  Not that many consumers for OpenCL NIR right
> now.
>
>>> Next, I guess I would add a lower_ufind_msb flag to
>>> nir_shader_compiler_options, and make nir_opt_algebraic.py lower
>>> ufind_msb to uclz.
> I would add the lowering first, set the flag for nouveau (those are
> the people hacking on OpenCL NIR stuff right now), and then make
> OpenCL use the new nir_op_uclz instruction.
>
>>> Finally, we can start implementing support for this in more drivers,
>>> and flip on some switches.
>>>
>>> I'm still not really sold on what to do about the special-case for
>>> zero... By making it undefined, I think we're just punishing all
>>> backends, just in the name of making the compiler backends a bit
>>> simpler, so that doesn't seem too good of an idea either.
> IMO, have both and lowering from one to the other.  The biggest
> problem with that approach is figuring out what to name the -1 for
> zero variant. :-)  As a straw-man, I suggest we call it nir_op_ufbh
> because that matches the Intel instruction name as well as the HLSL
> firstbithigh() intrinsic name.
>
>>> Does anyone have a better idea? I would kinda love to optimize away the
>>> zero-case if it's obvious that it's impossible, e.g cases like "clz(x |
>>> 1)"...
>> FWIW, as a datapoint: broadcom's v3d has a clz that returns 32 for clz(0).
>>
>> I would generally be of the opinion that we should have NIR opcodes
>> that match any common hardware instructions, and lowering in algebraic
>> to help turn input patterns into clean sequences of hardware
>> instructions.
> I tend to agree unless the back-end can do significantly better
> somehow in which case it should just implement both.  For instance,
> Intel hardware can set flag values as part of the CLZ which we can
> then use to do any fixup ops more efficiently.
>
> --Jason
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20200402/372cfb44/attachment.htm>