[Mesa-dev] [PATCH 1/2] gallivm: introduce 32x32->64bit lp_build_mul_32_lohi function
Roland Scheidegger
sroland at vmware.com
Mon Nov 7 19:09:56 UTC 2016
Am 06.11.2016 um 16:50 schrieb Jose Fonseca:
> On 04/11/16 04:14, sroland at vmware.com wrote:
>> From: Roland Scheidegger <sroland at vmware.com>
>>
>> This is used by shader umul_hi/imul_hi functions (and soon by draw).
>> It's actually useful separating this out on its own, however the real
>> reason for doing it is because we're using an optimized sse2 version,
>> since the code llvm generates is atrocious (since there's no widening
>> mul in llvm, and it does not recognize the widening mul pattern, so
>> it generates code for real 64x64->64bit mul, which the cpu can't do
>> natively, in contrast to 32x32->64bit mul which it could do).
>> ---
>> src/gallium/auxiliary/gallivm/lp_bld_arit.c | 150
>> +++++++++++++++++++++
>> src/gallium/auxiliary/gallivm/lp_bld_arit.h | 6 +
>> src/gallium/auxiliary/gallivm/lp_bld_tgsi_action.c | 54 +++-----
>> 3 files changed, 172 insertions(+), 38 deletions(-)
>>
>> diff --git a/src/gallium/auxiliary/gallivm/lp_bld_arit.c
>> b/src/gallium/auxiliary/gallivm/lp_bld_arit.c
>> index 3ea0734..3de4628 100644
>> --- a/src/gallium/auxiliary/gallivm/lp_bld_arit.c
>> +++ b/src/gallium/auxiliary/gallivm/lp_bld_arit.c
>> @@ -1091,6 +1091,156 @@ lp_build_mul(struct lp_build_context *bld,
>> return res;
>> }
>>
>> +/*
>> + * Widening mul, valid for 32x32 bit -> 64bit only.
>> + * Result is low 32bits, high bits returned in res_hi.
>> + */
>> +LLVMValueRef
>> +lp_build_mul_32_lohi(struct lp_build_context *bld,
>> + LLVMValueRef a,
>> + LLVMValueRef b,
>> + LLVMValueRef *res_hi)
>> +{
>> + struct gallivm_state *gallivm = bld->gallivm;
>> + LLVMBuilderRef builder = gallivm->builder;
>> +
>> + assert(bld->type.width == 32);
>> + assert(bld->type.floating == 0);
>> + assert(bld->type.fixed == 0);
>> + assert(bld->type.norm == 0);
>> +
>> + /*
>> + * XXX: for some reason, with zext/zext/mul/trunc the code llvm
>> produces
>> + * for x86 simd is atrocious (even if the high bits weren't
>> required),
>> + * trying to handle real 64bit inputs (which of course can't
>> happen due
>> + * to using 64bit umul with 32bit numbers zero-extended to 64bit, but
>> + * apparently llvm does not recognize this widening mul). This
>> includes 6
>> + * (instead of 2) pmuludq plus extra adds and shifts
>> + * The same story applies to signed mul, albeit fixing this
>> requires sse41.
>> + * https://llvm.org/bugs/show_bug.cgi?id=30845
>> + * So, whip up our own code, albeit only for length 4 and 8 (which
>> + * should be good enough)...
>> + */
>> + if ((bld->type.length == 4 || bld->type.length == 8) &&
>> + ((util_cpu_caps.has_sse2 && (bld->type.sign == 0)) ||
>> + util_cpu_caps.has_sse4_1)) {
>> + const char *intrinsic = NULL;
>> + LLVMValueRef aeven, aodd, beven, bodd, muleven, mulodd;
>> + LLVMValueRef shuf[LP_MAX_VECTOR_WIDTH / 32], shuf_vec;
>> + struct lp_type type_wide = lp_wider_type(bld->type);
>> + LLVMTypeRef wider_type = lp_build_vec_type(gallivm, type_wide);
>> + unsigned i;
>> + for (i = 0; i < bld->type.length; i += 2) {
>> + shuf[i] = lp_build_const_int32(gallivm, i+1);
>> + shuf[i+1] =
>> LLVMGetUndef(LLVMInt32TypeInContext(gallivm->context));
>> + }
>> + shuf_vec = LLVMConstVector(shuf, bld->type.length);
>> + aeven = a;
>> + beven = b;
>> + aodd = LLVMBuildShuffleVector(builder, aeven, bld->undef,
>> shuf_vec, "");
>> + bodd = LLVMBuildShuffleVector(builder, beven, bld->undef,
>> shuf_vec, "");
>> +
>> + if (util_cpu_caps.has_avx2 && bld->type.length == 8) {
>> + if (bld->type.sign) {
>> + intrinsic = "llvm.x86.avx2.pmul.dq";
>> + } else {
>> + intrinsic = "llvm.x86.avx2.pmulu.dq";
>> + }
>> + muleven = lp_build_intrinsic_binary(builder, intrinsic,
>> + wider_type, aeven, beven);
>> + mulodd = lp_build_intrinsic_binary(builder, intrinsic,
>> + wider_type, aodd, bodd);
>> + }
>> + else {
>> + /* for consistent naming look elsewhere... */
>> + if (bld->type.sign) {
>> + intrinsic = "llvm.x86.sse41.pmuldq";
>> + } else {
>> + intrinsic = "llvm.x86.sse2.pmulu.dq";
>> + }
>> + /*
>> + * XXX If we only have AVX but not AVX2 this is a pain.
>> + * lp_build_intrinsic_binary_anylength() can't handle it
>> + * (due to src and dst type not being identical).
>> + */
>> + if (bld->type.length == 8) {
>> + LLVMValueRef aevenlo, aevenhi, bevenlo, bevenhi;
>> + LLVMValueRef aoddlo, aoddhi, boddlo, boddhi;
>> + LLVMValueRef muleven2[2], mulodd2[2];
>> + struct lp_type type_wide_half = type_wide;
>> + LLVMTypeRef wtype_half;
>> + type_wide_half.length = 2;
>> + wtype_half = lp_build_vec_type(gallivm, type_wide_half);
>> + aevenlo = lp_build_extract_range(gallivm, aeven, 0, 4);
>> + aevenhi = lp_build_extract_range(gallivm, aeven, 4, 4);
>> + bevenlo = lp_build_extract_range(gallivm, beven, 0, 4);
>> + bevenhi = lp_build_extract_range(gallivm, beven, 4, 4);
>> + aoddlo = lp_build_extract_range(gallivm, aodd, 0, 4);
>> + aoddhi = lp_build_extract_range(gallivm, aodd, 4, 4);
>> + boddlo = lp_build_extract_range(gallivm, bodd, 0, 4);
>> + boddhi = lp_build_extract_range(gallivm, bodd, 4, 4);
>> + muleven2[0] = lp_build_intrinsic_binary(builder, intrinsic,
>> + wtype_half,
>> aevenlo, bevenlo);
>> + mulodd2[0] = lp_build_intrinsic_binary(builder, intrinsic,
>> + wtype_half,
>> aoddlo, boddlo);
>> + muleven2[1] = lp_build_intrinsic_binary(builder, intrinsic,
>> + wtype_half,
>> aevenhi, bevenhi);
>> + mulodd2[1] = lp_build_intrinsic_binary(builder, intrinsic,
>> + wtype_half,
>> aoddhi, boddhi);
>> + muleven = lp_build_concat(gallivm, muleven2,
>> type_wide_half, 2);
>> + mulodd = lp_build_concat(gallivm, mulodd2,
>> type_wide_half, 2);
>> +
>> + }
>> + else {
>> + muleven = lp_build_intrinsic_binary(builder, intrinsic,
>> + wider_type, aeven,
>> beven);
>> + mulodd = lp_build_intrinsic_binary(builder, intrinsic,
>> + wider_type, aodd, bodd);
>> + }
>> + }
>> + muleven = LLVMBuildBitCast(builder, muleven, bld->vec_type, "");
>> + mulodd = LLVMBuildBitCast(builder, mulodd, bld->vec_type, "");
>> +
>> + for (i = 0; i < bld->type.length; i += 2) {
>> + shuf[i] = lp_build_const_int32(gallivm, i + 1);
>> + shuf[i+1] = lp_build_const_int32(gallivm, i + 1 +
>> bld->type.length);
>> + }
>> + shuf_vec = LLVMConstVector(shuf, bld->type.length);
>> + *res_hi = LLVMBuildShuffleVector(builder, muleven, mulodd,
>> shuf_vec, "");
>> +
>> + for (i = 0; i < bld->type.length; i += 2) {
>> + shuf[i] = lp_build_const_int32(gallivm, i);
>> + shuf[i+1] = lp_build_const_int32(gallivm, i +
>> bld->type.length);
>> + }
>> + shuf_vec = LLVMConstVector(shuf, bld->type.length);
>> + return LLVMBuildShuffleVector(builder, muleven, mulodd,
>> shuf_vec, "");
>> + }
>> + else {
>> + LLVMValueRef tmp;
>> + struct lp_type type_tmp;
>> + LLVMTypeRef wide_type, cast_type;
>> +
>> + type_tmp = bld->type;
>> + type_tmp.width *= 2;
>> + wide_type = lp_build_vec_type(gallivm, type_tmp);
>> + type_tmp = bld->type;
>> + type_tmp.length *= 2;
>> + cast_type = lp_build_vec_type(gallivm, type_tmp);
>> +
>> + if (bld->type.sign) {
>> + a = LLVMBuildSExt(builder, a, wide_type, "");
>> + b = LLVMBuildSExt(builder, b, wide_type, "");
>> + } else {
>> + a = LLVMBuildZExt(builder, a, wide_type, "");
>> + b = LLVMBuildZExt(builder, b, wide_type, "");
>> + }
>> + tmp = LLVMBuildMul(builder, a, b, "");
>> + tmp = LLVMBuildBitCast(builder, tmp, cast_type, "");
>> + *res_hi = lp_build_uninterleave1(gallivm, bld->type.length * 2,
>> tmp, 1);
>> + return lp_build_uninterleave1(gallivm, bld->type.length * 2,
>> tmp, 0);
>> + }
>> +}
>> +
>>
>> /* a * b + c */
>> LLVMValueRef
>> diff --git a/src/gallium/auxiliary/gallivm/lp_bld_arit.h
>> b/src/gallium/auxiliary/gallivm/lp_bld_arit.h
>> index 622b930..5d48b1c 100644
>> --- a/src/gallium/auxiliary/gallivm/lp_bld_arit.h
>> +++ b/src/gallium/auxiliary/gallivm/lp_bld_arit.h
>> @@ -77,6 +77,12 @@ lp_build_mul(struct lp_build_context *bld,
>> LLVMValueRef b);
>>
>> LLVMValueRef
>> +lp_build_mul_32_lohi(struct lp_build_context *bld,
>> + LLVMValueRef a,
>> + LLVMValueRef b,
>> + LLVMValueRef *res_hi);
>> +
>> +LLVMValueRef
>> lp_build_mul_imm(struct lp_build_context *bld,
>> LLVMValueRef a,
>> int b);
>> diff --git a/src/gallium/auxiliary/gallivm/lp_bld_tgsi_action.c
>> b/src/gallium/auxiliary/gallivm/lp_bld_tgsi_action.c
>> index 2e837af..72d4579 100644
>> --- a/src/gallium/auxiliary/gallivm/lp_bld_tgsi_action.c
>> +++ b/src/gallium/auxiliary/gallivm/lp_bld_tgsi_action.c
>> @@ -842,26 +842,15 @@ imul_hi_emit(
>> struct lp_build_tgsi_context * bld_base,
>> struct lp_build_emit_data * emit_data)
>> {
>> - LLVMBuilderRef builder = bld_base->base.gallivm->builder;
>> struct lp_build_context *int_bld = &bld_base->int_bld;
>> - struct lp_type type = int_bld->type;
>> - LLVMValueRef src0, src1;
>> - LLVMValueRef dst64;
>> - LLVMTypeRef typeRef;
>> -
>> - assert(type.width == 32);
>> - type.width = 64;
>> - typeRef = lp_build_vec_type(bld_base->base.gallivm, type);
>> - src0 = LLVMBuildSExt(builder, emit_data->args[0], typeRef, "");
>> - src1 = LLVMBuildSExt(builder, emit_data->args[1], typeRef, "");
>> - dst64 = LLVMBuildMul(builder, src0, src1, "");
>> - dst64 = LLVMBuildAShr(
>> - builder, dst64,
>> - lp_build_const_vec(bld_base->base.gallivm, type, 32), "");
>> - type.width = 32;
>> - typeRef = lp_build_vec_type(bld_base->base.gallivm, type);
>> - emit_data->output[emit_data->chan] =
>> - LLVMBuildTrunc(builder, dst64, typeRef, "");
>> + LLVMValueRef hi_bits;
>> +
>> + assert(int_bld->type.width == 32);
>> +
>> + /* low result bits are tossed away */
>> + lp_build_mul_32_lohi(int_bld, emit_data->args[0],
>> + emit_data->args[0], &hi_bits);
>> + emit_data->output[emit_data->chan] = hi_bits;
>> }
>>
>> /* TGSI_OPCODE_UMUL_HI */
>> @@ -871,26 +860,15 @@ umul_hi_emit(
>> struct lp_build_tgsi_context * bld_base,
>> struct lp_build_emit_data * emit_data)
>> {
>> - LLVMBuilderRef builder = bld_base->base.gallivm->builder;
>> struct lp_build_context *uint_bld = &bld_base->uint_bld;
>> - struct lp_type type = uint_bld->type;
>> - LLVMValueRef src0, src1;
>> - LLVMValueRef dst64;
>> - LLVMTypeRef typeRef;
>> -
>> - assert(type.width == 32);
>> - type.width = 64;
>> - typeRef = lp_build_vec_type(bld_base->base.gallivm, type);
>> - src0 = LLVMBuildZExt(builder, emit_data->args[0], typeRef, "");
>> - src1 = LLVMBuildZExt(builder, emit_data->args[1], typeRef, "");
>> - dst64 = LLVMBuildMul(builder, src0, src1, "");
>> - dst64 = LLVMBuildLShr(
>> - builder, dst64,
>> - lp_build_const_vec(bld_base->base.gallivm, type, 32), "");
>> - type.width = 32;
>> - typeRef = lp_build_vec_type(bld_base->base.gallivm, type);
>> - emit_data->output[emit_data->chan] =
>> - LLVMBuildTrunc(builder, dst64, typeRef, "");
>> + LLVMValueRef hi_bits;
>> +
>> + assert(uint_bld->type.width == 32);
>> +
>> + /* low result bits are tossed away */
>> + lp_build_mul_32_lohi(uint_bld, emit_data->args[0],
>> + emit_data->args[0], &hi_bits);
>> + emit_data->output[emit_data->chan] = hi_bits;
>> }
>>
>> /* TGSI_OPCODE_MAX */
>>
>
> Series looks great to me.
Thanks for the review!
>
> It's worth testing with both AVX and SSE2 if you haven't already.
Yep, piglit didn't show anything (though I didn't do a full run with
sse2, just forced 128bit vectors).
FWIW I could not actually determine a runtime difference with the
generated vs. The reason is mostly that it looks impossible to make
vertex fetch a bottleneck - with something like mesa's trispd, even if I
force rasterizer discard, the 3 most prominent functions are
do_triangle_ccw, lp_setup_bin_triangle and triangle_both, which alone
use over 50% of all cpu cycles... Our method of handling tris with
rasterizer discard is, to put it mildly, not the most efficient (we
still bin everything in llvmpipe, even though those tris don't even need
to reach the draw pipeline stages...). Though it has to be said that the
calculations needed for binning certainly look slow too. Next up in the
list are vbo_Vertex2f and vbo_Color3fv (ok I guess a benchmark using
buffer objects would fare better there), leaving the entire vs itself in
the end with only something like 3%...
I think though where the new vectorized code fares worst compared to the
old one is the 256bit avx (not avx2) path. Since the calculations need
to be split. But it's basically just the mul plus a couple comparisons,
so the generated code still looks better than the old one to me.
Roland
>
> Reviewed-by: Jose Fonseca <jfonseca at vmware.com>
More information about the mesa-dev
mailing list