[Mesa-dev] [PATCH 1/2] gallivm: better support for fast rsqrt

Thu Jul 11 08:28:19 PDT 2013

Am 11.07.2013 16:13, schrieb Brian Paul:
> Just minor things below.
Ok, I'll fix this (and the things Jose mentioned).

> 
> Reviewed-by: Brian Paul <brianp at vmware.com>
> 
> On 07/10/2013 06:21 PM, sroland at vmware.com wrote:
>> From: Roland Scheidegger <sroland at vmware.com>
>>
>> We had to disable fast rsqrt before because it wasn't precise enough etc.
>> However in situations when we know we're not going to need more precision
>> we can still use a fast rsqrt (which can be several times faster than
>> the quite expensive sqrt). Hence introduce a new helper which does
>> exactly
>> that - it is probably not useful calling it in some situations if there's
>> no fast rsqrt available so make it queryable if it's available too.
>> ---
>>   src/gallium/auxiliary/gallivm/lp_bld_arit.c |   55
>> +++++++++++++++++++++++++++
>>   src/gallium/auxiliary/gallivm/lp_bld_arit.h |    7 ++++
>>   2 files changed, 62 insertions(+)
>>
>> diff --git a/src/gallium/auxiliary/gallivm/lp_bld_arit.c
>> b/src/gallium/auxiliary/gallivm/lp_bld_arit.c
>> index 08aec79..2e84543 100644
>> --- a/src/gallium/auxiliary/gallivm/lp_bld_arit.c
>> +++ b/src/gallium/auxiliary/gallivm/lp_bld_arit.c
>> @@ -2361,6 +2361,61 @@ lp_build_rsqrt(struct lp_build_context *bld,
>>      return lp_build_rcp(bld, lp_build_sqrt(bld, a));
>>   }
>>
>> +/**
>> + * If there's a fast (inaccurate) rsqrt instruction available
>> + * (caller may want to avoid to call rsqrt_fast if it's not available,
>> + * i.e. for calculating x^0.5 it may do rsqrt_fast(x) * x but if
>> + * unavailable it would result in sqrt/div/mul so obviously
>> + * much better to just call sqrt, skipping both div and mul).
>> + */
>> +boolean
>> +lp_build_fast_rsqrt_available(struct lp_type type) {
> 
> Opening brace on next line.
> 
> 
>> +
>> +   assert(type.floating);
>> +
>> +   if ((util_cpu_caps.has_sse && type.width == 32 && type.length ==
>> 4) ||
>> +       (util_cpu_caps.has_avx && type.width == 32 && type.length ==
>> 8)) {
>> +      return true;
>> +   }
>> +   return false;
>> +}
>> +
>> +
>> +/**
>> + * Generate 1/sqrt(a).
>> + * Result is undefined for values < 0, infinity for +0.
>> + * Precision is limited, only ~10 bits guaranteed
>> + * (rsqrt 1.0 may not be 1.0, denorms may be flushed to 0).
>> + */
>> +LLVMValueRef
>> +lp_build_rsqrt_fast(struct lp_build_context *bld,
>> +                    LLVMValueRef a)
>> +{
>> +   LLVMBuilderRef builder = bld->gallivm->builder;
>> +   const struct lp_type type = bld->type;
>> +
>> +   assert(lp_check_value(type, a));
>> +
>> +   assert(type.floating);
>> +
>> +   if ((util_cpu_caps.has_sse && type.width == 32 && type.length ==
>> 4) ||
>> +       (util_cpu_caps.has_avx && type.width == 32 && type.length ==
>> 8)) {
> 
> This looks like it could be "if (lp_build_fast_rsqrt_available(type)"
> 
> 
>> +      const char *intrinsic = NULL;
>> +
>> +      if (type.length == 4) {
>> +         intrinsic = "llvm.x86.sse.rsqrt.ps";
>> +      }
>> +      else {
>> +         intrinsic = "llvm.x86.avx.rsqrt.ps.256";
>> +      }
>> +      return lp_build_intrinsic_unary(builder, intrinsic,
>> bld->vec_type, a);
>> +   }
>> +   else {
>> +      debug_printf("%s: emulating fast rsqrt with rcp/sqrt\n",
>> __FUNCTION__);
> 
> Are you sure you want to leave the debug_printf() in?
I left it in on purpose, I guess it's arguable - might want to revisit
if there's really some caller which absolutely needs fast reciprocal
square root. The problem is that we can't use it for the shader rsqrt -
while these don't need full precision it's just too crappy and our
attempts at fixing it up were not really all that succesful.
So this leaves only internal code, and I suspect most of the time you
could possibly avoid needing it. The emulation really is slow with
sqrt/div - both operations are (on pretty much all cpus I believe)
non-pipelined instructions taking quite some time (both about the same
usually), both executed by divide unit. In that light it makes sense if
callers try to avoid using it. But it can easily be changed if there's
some caller really needing fast rsqrt (normalization is the obvious use
case), so far all callers (other than linear_to_srgb) need the "normal"
rsqrt.
I actually initially wanted to change the interface of lp_build_rsqrt()
instead to use a precision argument (or flag) instead. But that way it
couldn't be part of easy unary operation testing in lp_test code, plus
at least the linear_to_srgb code really wants to avoid calling it if
it's not available so it seemed cleaner to just have a separate fast
rsqrt (along with the capability to query if it's available).

Roland


> 
> 
>> +   }
>> +   return lp_build_rcp(bld, lp_build_sqrt(bld, a));
>> +}
>> +
>>
>>   /**
>>    * Generate sin(a) using SSE2
>> diff --git a/src/gallium/auxiliary/gallivm/lp_bld_arit.h
>> b/src/gallium/auxiliary/gallivm/lp_bld_arit.h
>> index 966796c..d53e471 100644
>> --- a/src/gallium/auxiliary/gallivm/lp_bld_arit.h
>> +++ b/src/gallium/auxiliary/gallivm/lp_bld_arit.h
>> @@ -231,6 +231,13 @@ LLVMValueRef
>>   lp_build_rsqrt(struct lp_build_context *bld,
>>                  LLVMValueRef a);
>>
>> +boolean
>> +lp_build_fast_rsqrt_available(struct lp_type type);
>> +
>> +LLVMValueRef
>> +lp_build_rsqrt_fast(struct lp_build_context *bld,
>> +                    LLVMValueRef a);
>> +
>>   LLVMValueRef
>>   lp_build_cos(struct lp_build_context *bld,
>>                LLVMValueRef a);
>>