[Mesa-dev] nir: find_msb vs clz

Thu Apr 2 18:06:16 UTC 2020

On 4/1/20 11:39 AM, Erik Faye-Lund wrote:
> While working on the NIR to DXIL conversion code for D3D12, I've
> noticed that we're not exactly doing the best we could here.
> 
> First some background:
> 
> NIR currently has a few instructions that does kinda the same:
> 
> 1. nir_op_ufind_msb: Finds the index of the most significant bit,
> counting from the least significant bit. It returns -1 on zero-input.
> 
> 2. nir_op_ifind_msb: A signed version of ufind_msb; looks for the first
> non sign-bit. It's not terribly interesting in this context, as it can
> be trivially lowered if missing, and it doesn't seem like any hardware
> supports this natively. I'm just mentioning it for completeness.

These instructions map almost directly to GLSL findMSB().

> 3. nir_op_uclz: Counts the amount of leading zeroes, counding from the
> most significant bit. It returns 32 on zero-input, and only exist in an
> unsigned 32-bit variation.
> 
> ufind_msb is kinda the O.G here, uclz was recently added, and is as far
> as I can see only used in an intel-specific SPIR-V instruction.
> 
> Additionally, there's the OpenCLstd_Clz SPIR-V instruction, which we
> lower to ufind_msb using nir_clz_u(), regardless if the backend
> supports nir_op_uclz or not.

That extension (mostly) brings an handful of OpenCL instructions to
graphics.  The only outlier is the instruction for 32-bit × 16-bit
multiplication.

> It seems only the nouveau's NV50 backend actually wants ufind_msb,
> everything else seems to convert ufind_msb to some clz-variant while
> emitting code. Some have to special-case on zero-input, and some
> not... 
> 
> All of this is not really awesome in my eyes.
> 
> So, while adding support for DXIL, I need to figure out how to map
> these (well, ufind_msb at least) onto the DXIL intrinsics. DXIL doesn't
> have a ufind_msb, but it has a firstbit_hi that is identical to
> nir_op_uclz... except that it returns -1 on zero-input :(

Here's the first question you should be asking: how often does this
occur in real shaders?  Is it worth caring about generating the optimal
thing?  As Jason pointed out in a different message, the GLSL findMSB
functions seem to occur within epsilon of never.  If the same is true
for the DXIL instructions, is it worth spending any effort?

Honestly, until there is a known user, I would just do the thing that
has the least impact and move on.  Queue the old quote about premature
optimization...

> For now, I'm lowering ufind_msb to something ufind_msb while emitting
> code, like everyone else. But this feels a bit dirty, *especially*
> since we have a clz-instruction that *almost* fits. And since we're
> targetting OpenCL, which use clz as it's primitive, we end up doing 32
> - (32 - x), and since that inner isub happens while emitting, we can't
> easily optimize it away without introducing an optimizing backend...

If 12+ years of graphics compiler experience has taught us anything, it
has taught us that you need an optimizing backend.  Maybe not as job #1,
but probably very, very soon.  Being able to recognize common code
patterns like 32 - fbh(x) and generating the right thing is not that
hard.  If you have a code-generator generator (!2680), it is trivial.

> The solution seems obvious; use nir_op_uclz instead.
> 
> But that's also a bit annoying, for a few reasons:
> 
> 1. Only *one* backend actually implements support for it. So this
> either means a lot of work, or making it an opt-in feature somehow.
> 
> 2. We would probably have to support lowering in either direction to
> support what all hardware prefers.
> 
> 3. That zero-case still needs special treatment in several backends, it
> seems. We could alternatively declare that nir_op_uclz is undefined for
> zero-input, and handle this when lowering...?
> 
> 4. It seems some (Intel?) hardware only supports 32-bit clz, so we
> would have to lower to something else for other bit-sizes. That's not
> too hard, though.
> 
> So yeah...
> 
> I guess the first step would be to add a switch to use nir_uclz()
> instead of nir_clz_u() when handling OpenCLstd_Clz in vtn.
> 
> Next, I guess I would add a lower_ufind_msb flag to
> nir_shader_compiler_options, and make nir_opt_algebraic.py lower
> ufind_msb to uclz.
> 
> Finally, we can start implementing support for this in more drivers,
> and flip on some switches.
> 
> I'm still not really sold on what to do about the special-case for
> zero... By making it undefined, I think we're just punishing all
> backends, just in the name of making the compiler backends a bit
> simpler, so that doesn't seem too good of an idea either.
> 
> Does anyone have a better idea? I would kinda love to optimize away the
> zero-case if it's obvious that it's impossible, e.g cases like "clz(x |
> 1)"... 
> 
> 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev