<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Mon, Oct 8, 2018 at 3:46 PM Ian Romanick <<a href="mailto:idr@freedesktop.org">idr@freedesktop.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 10/05/2018 09:10 PM, Jason Ekstrand wrote:<br>
> ---<br>
> src/compiler/nir/nir_constant_expressions.py | 1 +<br>
> src/compiler/nir/nir_opcodes.py | 43 ++++++++++++++++++--<br>
> 2 files changed, 40 insertions(+), 4 deletions(-)<br>
> <br>
> diff --git a/src/compiler/nir/nir_constant_expressions.py b/src/compiler/nir/nir_constant_expressions.py<br>
> index 118af9f7818..afc0739e8b2 100644<br>
> --- a/src/compiler/nir/nir_constant_expressions.py<br>
> +++ b/src/compiler/nir/nir_constant_expressions.py<br>
> @@ -79,6 +79,7 @@ template = """\<br>
> #include <math.h><br>
> #include "util/rounding.h" /* for _mesa_roundeven */<br>
> #include "util/half_float.h"<br>
> +#include "util/bigmath.h"<br>
> #include "nir_constant_expressions.h"<br>
> <br>
> /**<br>
> diff --git a/src/compiler/nir/nir_opcodes.py b/src/compiler/nir/nir_opcodes.py<br>
> index 4ef4ecc6f22..209f0c5509b 100644<br>
> --- a/src/compiler/nir/nir_opcodes.py<br>
> +++ b/src/compiler/nir/nir_opcodes.py<br>
> @@ -443,12 +443,47 @@ binop("isub", tint, "", "src0 - src1")<br>
> binop("fmul", tfloat, commutative + associative, "src0 * src1")<br>
> # low 32-bits of signed/unsigned integer multiply<br>
> binop("imul", tint, commutative + associative, "src0 * src1")<br>
> +<br>
> # high 32-bits of signed integer multiply<br>
> -binop("imul_high", tint32, commutative,<br>
> - "(int32_t)(((int64_t) src0 * (int64_t) src1) >> 32)")<br>
> +binop("imul_high", tint, commutative, """<br>
<br>
This will enable imul_high for all integer types (ditto for umul_high<br>
below). A later patch adds lowering for 64-bit integer type. Will the<br>
backend do the right thing for [iu]mul_high of 16- or 8-bit types?<br></blockquote><div><br></div><div>That's a good question. Looks like lower_integer_multiplication in the back-end will do nothing whatsoever, and we'll emit an illegal opcode which will probably hang the GPU. For 8 and 16, it's easy enough to lower to a couple of conversions, a N*2-bit multiply, and a shift. It's also not obvious where the cut-off point for the optimization is. Certainly, it's better in 64-bits than doing the division algorithm in the shader and I think it's better for 32 but maybe not in 8 and 16? I'm not sure. I'm pretty sure my 32-bit benchmark gave positive results (about 40-50% faster) but it was very noisy.</div><div><br></div><div>I don't think anything allows 8 and 16-bit arithmetic right now. Still, should probably fix it...</div><div><br></div><div>--Jason<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
> +if (bit_size == 64) {<br>
> + /* We need to do a full 128-bit x 128-bit multiply in order for the sign<br>
> + * extension to work properly. The casts are kind-of annoying but needed<br>
> + * to prevent compiler warnings.<br>
> + */<br>
> + uint32_t src0_u32[4] = {<br>
> + src0,<br>
> + (int64_t)src0 >> 32,<br>
> + (int64_t)src0 >> 63,<br>
> + (int64_t)src0 >> 63,<br>
> + };<br>
> + uint32_t src1_u32[4] = {<br>
> + src1,<br>
> + (int64_t)src1 >> 32,<br>
> + (int64_t)src1 >> 63,<br>
> + (int64_t)src1 >> 63,<br>
> + };<br>
> + uint32_t prod_u32[4];<br>
> + ubm_mul_u32arr(prod_u32, src0_u32, src1_u32);<br>
> + dst = (uint64_t)prod_u32[2] | ((uint64_t)prod_u32[3] << 32);<br>
> +} else {<br>
> + dst = ((int64_t)src0 * (int64_t)src1) >> bit_size;<br>
> +}<br>
> +""")<br>
> +<br>
> # high 32-bits of unsigned integer multiply<br>
> -binop("umul_high", tuint32, commutative,<br>
> - "(uint32_t)(((uint64_t) src0 * (uint64_t) src1) >> 32)")<br>
> +binop("umul_high", tuint, commutative, """<br>
> +if (bit_size == 64) {<br>
> + /* The casts are kind-of annoying but needed to prevent compiler warnings. */<br>
> + uint32_t src0_u32[2] = { src0, (uint64_t)src0 >> 32 };<br>
> + uint32_t src1_u32[2] = { src1, (uint64_t)src1 >> 32 };<br>
> + uint32_t prod_u32[4];<br>
> + ubm_mul_u32arr(prod_u32, src0_u32, src1_u32);<br>
> + dst = (uint64_t)prod_u32[2] | ((uint64_t)prod_u32[3] << 32);<br>
> +} else {<br>
> + dst = ((uint64_t)src0 * (uint64_t)src1) >> bit_size;<br>
> +}<br>
> +""")<br>
> <br>
> binop("fdiv", tfloat, "", "src0 / src1")<br>
> binop("idiv", tint, "", "src1 == 0 ? 0 : (src0 / src1)")<br>
</blockquote></div></div>