[Mesa-dev] [PATCH 07/12] nir: Allow [iu]mul_high on non-32-bit types

Tue Oct 9 16:32:54 UTC 2018

On 10/08/2018 02:14 PM, Jason Ekstrand wrote:
> On Mon, Oct 8, 2018 at 3:46 PM Ian Romanick <idr at freedesktop.org
> <mailto:idr at freedesktop.org>> wrote:
> 
>     On 10/05/2018 09:10 PM, Jason Ekstrand wrote:
>     > ---
>     >  src/compiler/nir/nir_constant_expressions.py |  1 +
>     >  src/compiler/nir/nir_opcodes.py              | 43
>     ++++++++++++++++++--
>     >  2 files changed, 40 insertions(+), 4 deletions(-)
>     >
>     > diff --git a/src/compiler/nir/nir_constant_expressions.py
>     b/src/compiler/nir/nir_constant_expressions.py
>     > index 118af9f7818..afc0739e8b2 100644
>     > --- a/src/compiler/nir/nir_constant_expressions.py
>     > +++ b/src/compiler/nir/nir_constant_expressions.py
>     > @@ -79,6 +79,7 @@ template = """\
>     >  #include <math.h>
>     >  #include "util/rounding.h" /* for _mesa_roundeven */
>     >  #include "util/half_float.h"
>     > +#include "util/bigmath.h"
>     >  #include "nir_constant_expressions.h"
>     > 
>     >  /**
>     > diff --git a/src/compiler/nir/nir_opcodes.py
>     b/src/compiler/nir/nir_opcodes.py
>     > index 4ef4ecc6f22..209f0c5509b 100644
>     > --- a/src/compiler/nir/nir_opcodes.py
>     > +++ b/src/compiler/nir/nir_opcodes.py
>     > @@ -443,12 +443,47 @@ binop("isub", tint, "", "src0 - src1")
>     >  binop("fmul", tfloat, commutative + associative, "src0 * src1")
>     >  # low 32-bits of signed/unsigned integer multiply
>     >  binop("imul", tint, commutative + associative, "src0 * src1")
>     > +
>     >  # high 32-bits of signed integer multiply
>     > -binop("imul_high", tint32, commutative,
>     > -      "(int32_t)(((int64_t) src0 * (int64_t) src1) >> 32)")
>     > +binop("imul_high", tint, commutative, """
> 
>     This will enable imul_high for all integer types (ditto for umul_high
>     below).  A later patch adds lowering for 64-bit integer type.  Will the
>     backend do the right thing for [iu]mul_high of 16- or 8-bit types?
> 
> 
> That's a good question.  Looks like lower_integer_multiplication in the
> back-end will do nothing whatsoever, and we'll emit an illegal opcode
> which will probably hang the GPU.  For 8 and 16, it's easy enough to
> lower to a couple of conversions, a N*2-bit multiply, and a shift.  It's
> also not obvious where the cut-off point for the optimization is. 
> Certainly, it's better in 64-bits than doing the division algorithm in
> the shader and I think it's better for 32 but maybe not in 8 and 16? 
> I'm not sure.  I'm pretty sure my 32-bit benchmark gave positive results
> (about 40-50% faster) but it was very noisy.
> 
> I don't think anything allows 8 and 16-bit arithmetic right now.  Still,
> should probably fix it...

Hm... if an extension adds GL or Vulkan support for 16-bit arithmetic, I
doubt it would add [iu]mul_high (e.g., GL_AMD_gpu_shader_int16).  I'd
expect every GPU would support a 16x16=32 multiplier.  Would it be
better to restrict this instruction to 32- and 64-bit and implement the
integer division optimization differently for 8- and 16-bit sources?

> --Jason
>  
> 
>     > +if (bit_size == 64) {
>     > +   /* We need to do a full 128-bit x 128-bit multiply in order
>     for the sign
>     > +    * extension to work properly.  The casts are kind-of annoying
>     but needed
>     > +    * to prevent compiler warnings.
>     > +    */
>     > +   uint32_t src0_u32[4] = {
>     > +      src0,
>     > +      (int64_t)src0 >> 32,
>     > +      (int64_t)src0 >> 63,
>     > +      (int64_t)src0 >> 63,
>     > +   };
>     > +   uint32_t src1_u32[4] = {
>     > +      src1,
>     > +      (int64_t)src1 >> 32,
>     > +      (int64_t)src1 >> 63,
>     > +      (int64_t)src1 >> 63,
>     > +   };
>     > +   uint32_t prod_u32[4];
>     > +   ubm_mul_u32arr(prod_u32, src0_u32, src1_u32);
>     > +   dst = (uint64_t)prod_u32[2] | ((uint64_t)prod_u32[3] << 32);
>     > +} else {
>     > +   dst = ((int64_t)src0 * (int64_t)src1) >> bit_size;
>     > +}
>     > +""")
>     > +
>     >  # high 32-bits of unsigned integer multiply
>     > -binop("umul_high", tuint32, commutative,
>     > -      "(uint32_t)(((uint64_t) src0 * (uint64_t) src1) >> 32)")
>     > +binop("umul_high", tuint, commutative, """
>     > +if (bit_size == 64) {
>     > +   /* The casts are kind-of annoying but needed to prevent
>     compiler warnings. */
>     > +   uint32_t src0_u32[2] = { src0, (uint64_t)src0 >> 32 };
>     > +   uint32_t src1_u32[2] = { src1, (uint64_t)src1 >> 32 };
>     > +   uint32_t prod_u32[4];
>     > +   ubm_mul_u32arr(prod_u32, src0_u32, src1_u32);
>     > +   dst = (uint64_t)prod_u32[2] | ((uint64_t)prod_u32[3] << 32);
>     > +} else {
>     > +   dst = ((uint64_t)src0 * (uint64_t)src1) >> bit_size;
>     > +}
>     > +""")
>     > 
>     >  binop("fdiv", tfloat, "", "src0 / src1")
>     >  binop("idiv", tint, "", "src1 == 0 ? 0 : (src0 / src1)")
>