[Mesa-dev] [PATCH] glsl: improve accuracy of atan()

Wed Sep 24 04:35:34 PDT 2014

On Wed, Sep 24, 2014 at 4:10 AM, Ian Romanick <idr at freedesktop.org> wrote:
> On 09/23/2014 03:39 PM, Erik Faye-Lund wrote:
>> Our current atan()-approximation is pretty inaccurate at 1.0, so
>> let's try to improve the situation by doing a direct approximation
>> without going through atan.
>>
>> This new implementation uses an 11th degree polynomial to approximate
>> atan in the [-1..1] range, and the following identitiy to reduce the
>> entire range to [-1..1]:
>>
>> atan(x) = 0.5 * pi * sign(x) - atan(1.0 / x)
>>
>> This range-reduction idea is taken from the paper "Fast computation
>> of Arctangent Functions for Embedded Applications: A Comparative
>> Analysis" (Ukil et al. 2011).
>>
>> The polynomial that approximates atan(x) is:
>>
>> x   * 0.9999793128310355 - x^3  * 0.3326756418091246 +
>> x^5 * 0.1938924977115610 - x^7  * 0.1173503194786851 +
>> x^9 * 0.0536813784310406 - x^11 * 0.0121323213173444
>>
>> This polynomial was found with the following GNU Octave script:
>>
>> x = linspace(0, 1);
>> y = atan(x);
>> n = [1, 3, 5, 7, 9, 11];
>> format long;
>> polyfitc(x, y, n)
>>
>> The polyfitc function is not built-in, but too long to include here.
>> It can be downloaded from the following URL:
>>
>> http://www.mathworks.com/matlabcentral/fileexchange/47851-constraint-polynomial-fit/content/polyfitc.m
>>
>> This fixes the following piglit test:
>> shaders/glsl-const-folding-01
>>
>> Signed-off-by: Erik Faye-Lund <kusmabite at gmail.com>
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=49713
>

Ah! Thanks a lot for that link, very useful.

>> We could also go down that path, by introducing an ir_unop_atan that
>> gets lowered to a polynomial before code-generation. That would benefit
>> drivers for hardware with atan-support (at least Mali supports this,
>> according to http://limadriver.org/Lima+ISA/), but it worries me a bit to
>> do constant folding with different precision than execution. But perhaps
>> we already have that problem, only a bit more subtle?
>
> We used to implement constant folding for many built-in functions this
> way.  Built-in functions like atan were detected in the constant folding
> process, and C library functions were used.  Commit 363c14ae0 changed
> this to simplify the constant folding code.  This test case began
> failing at that time, and Vinson submitted bug #49713.

Aha, thanks for clearing that up. However, just to be clear, my
alternative suggestion isn't to go back in that direction. It's to
make atan a middle-class citizen, like ir_unop_[exp|log]. These always
gets lowered to ir_unop_[exp|log]2, but in this case the lowering
would go to arithmetic.

Something like this (also no piglit regressions):

--8<--

diff --git a/src/glsl/builtin_functions.cpp b/src/glsl/builtin_functions.cpp
index 9be7f6d..cc6c8b3 100644
--- a/src/glsl/builtin_functions.cpp
+++ b/src/glsl/builtin_functions.cpp
@@ -442,6 +442,7 @@ private:
    ir_swizzle *matrix_elt(ir_variable *var, int col, int row);

    ir_expression *asin_expr(ir_variable *x);
+   void do_atan(ir_factory &body, const glsl_type *type, ir_variable
*res, operand y_over_x);

    /**
     * Call function \param f with parameters specified as the linked
@@ -2596,6 +2597,7 @@ builtin_builder::_degrees(const glsl_type *type)

 UNOP(sin, ir_unop_sin, always_available)
 UNOP(cos, ir_unop_cos, always_available)
+UNOP(atan, ir_unop_atan, always_available)

 ir_function_signature *
 builtin_builder::_tan(const glsl_type *type)
@@ -2684,11 +2686,7 @@ builtin_builder::_atan2(const glsl_type *type)
       ir_factory outer_then(&outer_if->then_instructions, mem_ctx);

       /* Then...call atan(y/x) */
-      ir_variable *y_over_x =
outer_then.make_temp(glsl_type::float_type, "y_over_x");
-      outer_then.emit(assign(y_over_x, div(y, x)));
-      outer_then.emit(assign(r, mul(y_over_x, rsq(add(mul(y_over_x, y_over_x),
-                                                      imm(1.0f))))));
-      outer_then.emit(assign(r, asin_expr(r)));
+      body.emit(assign(r, expr(ir_unop_atan, div(y, x))));

       /*     ...and fix it up: */
       ir_if *inner_if = new(mem_ctx) ir_if(less(x, imm(0.0f)));
@@ -2712,21 +2710,6 @@ builtin_builder::_atan2(const glsl_type *type)
 }

 ir_function_signature *
-builtin_builder::_atan(const glsl_type *type)
-{
-   ir_variable *y_over_x = in_var(type, "y_over_x");
-   MAKE_SIG(type, always_available, 1, y_over_x);
-
-   ir_variable *t = body.make_temp(type, "t");
-   body.emit(assign(t, mul(y_over_x, rsq(add(mul(y_over_x, y_over_x),
-                                             imm(1.0f))))));
-
-   body.emit(ret(asin_expr(t)));
-
-   return sig;
-}
-
-ir_function_signature *
 builtin_builder::_sinh(const glsl_type *type)
 {
    ir_variable *x = in_var(type, "x");
diff --git a/src/glsl/ir.cpp b/src/glsl/ir.cpp
index 739a9f4..1ec7789 100644
--- a/src/glsl/ir.cpp
+++ b/src/glsl/ir.cpp
@@ -247,6 +247,7 @@ ir_expression::ir_expression(int op, ir_rvalue *op0)
    case ir_unop_cos:
    case ir_unop_sin_reduced:
    case ir_unop_cos_reduced:
+   case ir_unop_atan:
    case ir_unop_dFdx:
    case ir_unop_dFdx_coarse:
    case ir_unop_dFdx_fine:
@@ -513,6 +514,7 @@ static const char *const operator_strs[] = {
    "cos",
    "sin_reduced",
    "cos_reduced",
+   "atan",
    "dFdx",
    "dFdxCoarse",
    "dFdxFine",
diff --git a/src/glsl/ir.h b/src/glsl/ir.h
index 8003f88..c715203 100644
--- a/src/glsl/ir.h
+++ b/src/glsl/ir.h
@@ -1200,6 +1200,7 @@ enum ir_expression_operation {
    ir_unop_cos,
    ir_unop_sin_reduced,    /**< Reduced range sin. [-pi, pi] */
    ir_unop_cos_reduced,    /**< Reduced range cos. [-pi, pi] */
+   ir_unop_atan,
    /*@}*/

    /**
diff --git a/src/glsl/ir_constant_expression.cpp
b/src/glsl/ir_constant_expression.cpp
index 1e8b3a3..ca3d4d7 100644
--- a/src/glsl/ir_constant_expression.cpp
+++ b/src/glsl/ir_constant_expression.cpp
@@ -730,6 +730,12 @@ ir_expression::constant_expression_value(struct
hash_table *variable_context)
       }
       break;

+   case ir_unop_atan:
+      for (unsigned c = 0; c < op[0]->type->components(); c++) {
+        data.f[c] = atanf(op[0]->value.f[c]);
+      }
+      break;
+
    case ir_unop_neg:
       for (unsigned c = 0; c < op[0]->type->components(); c++) {
         switch (this->type->base_type) {
diff --git a/src/glsl/ir_validate.cpp b/src/glsl/ir_validate.cpp
index 97a581d..9cd04d4 100644
--- a/src/glsl/ir_validate.cpp
+++ b/src/glsl/ir_validate.cpp
@@ -317,6 +317,7 @@ ir_validate::visit_leave(ir_expression *ir)
    case ir_unop_cos:
    case ir_unop_sin_reduced:
    case ir_unop_cos_reduced:
+   case ir_unop_atan:
    case ir_unop_dFdx:
    case ir_unop_dFdx_coarse:
    case ir_unop_dFdx_fine:
diff --git a/src/glsl/lower_instructions.cpp b/src/glsl/lower_instructions.cpp
index 6842853..4879424 100644
--- a/src/glsl/lower_instructions.cpp
+++ b/src/glsl/lower_instructions.cpp
@@ -112,6 +112,7 @@
  */

 #include "main/core.h" /* for M_LOG2E */
+#include "program/prog_instruction.h" /* for SWIZZLE_XYZW */
 #include "glsl_types.h"
 #include "ir.h"
 #include "ir_builder.h"
@@ -145,6 +146,7 @@ private:
    void carry_to_arith(ir_expression *);
    void borrow_to_arith(ir_expression *);
    void sat_to_clamp(ir_expression *);
+   void atan_to_arith(ir_expression *);
 };

 } /* anonymous namespace */
@@ -508,6 +510,75 @@ lower_instructions_visitor::sat_to_clamp(ir_expression *ir)
    this->progress = true;
 }

+void
+lower_instructions_visitor::atan_to_arith(ir_expression *ir)
+{
+   ir_variable *y_over_x = new(ir) ir_variable(ir->type, "y_over_x",
ir_var_temporary);
+   ir_variable *x = new(ir) ir_variable(ir->type, "x", ir_var_temporary);
+   ir_variable *xx = new(ir) ir_variable(ir->type, "xx", ir_var_temporary);
+   ir_variable *tmp = new(ir) ir_variable(ir->type, "tmp", ir_var_temporary);
+
+   ir_instruction &i = *base_ir;
+
+   /* Copy <y_over_x> argument */
+   i.insert_before(y_over_x);
+   i.insert_before(assign(y_over_x, ir->operands[0]));
+
+   /*
+    * range-reduction, first step:
+    *
+    *      / y_over_x         if |y_over_x| <= 1.0;
+    * x = <
+    *      \ 1.0 / y_over_x   otherwise
+    */
+   i.insert_before(x);
+   i.insert_before(assign(x, div(min2(abs(y_over_x),
+                                      new(ir) ir_constant(1.0f)),
+                                 max2(abs(y_over_x),
+                                      new(ir) ir_constant(1.0f)))));
+
+   /*
+    * approximate atan by evaluating polynomial:
+    *
+    * x   * 0.9999793128310355 - x^3  * 0.3326756418091246 +
+    * x^5 * 0.1938924977115610 - x^7  * 0.1173503194786851 +
+    * x^9 * 0.0536813784310406 - x^11 * 0.0121323213173444
+    */
+   i.insert_before(xx);
+   i.insert_before(assign(xx, mul(x, x)));
+   ir_constant *term1 = new(ir) ir_constant(+0.9999793128310355f);
+   ir_constant *term3 = new(ir) ir_constant(-0.3326756418091246f);
+   ir_constant *term5 = new(ir) ir_constant(+0.1938924977115610f);
+   ir_constant *term7 = new(ir) ir_constant(-0.1173503194786851f);
+   ir_constant *term9 = new(ir) ir_constant(+0.0536813784310406f);
+   ir_constant *term11 = new(ir) ir_constant(-0.0121323213173444f);
+   i.insert_before(tmp);
+   i.insert_before(assign(tmp,
mul(add(mul(add(mul(add(mul(add(mul(add(mul(term11,
+                                                                           xx),
+                                                                       term9),
+                                                                   xx),
+                                                               term7),
+                                                           xx),
+                                                       term5),
+                                                   xx),
+                                               term3),
+                                           xx),
+                                       term1),
+                                   x)));
+
+   /* range-reduction and sign fixup */
+   ir->operation = ir_binop_mul;
+   ir->operands[0] = add(tmp,
+                         csel(greater(abs(y_over_x),
+                                      new(ir) ir_constant(1.0f,
ir->type->components())),
+                              add(mul(tmp,
+                                      new(ir) ir_constant(-2.0f)),
+                                  new(ir) ir_constant((float) M_PI_2)),
+                              new(ir) ir_constant(0.0f,
ir->type->components())));
+   ir->operands[1] = sign(y_over_x);
+   this->progress = true;
+}
+
 ir_visitor_status
 lower_instructions_visitor::visit_leave(ir_expression *ir)
 {
@@ -569,6 +640,10 @@ lower_instructions_visitor::visit_leave(ir_expression *ir)
          sat_to_clamp(ir);
       break;

+   case ir_unop_atan:
+      // TODO: make a flag for this also
+      atan_to_arith(ir);
+      break;
+
    default:
       return visit_continue;
    }
diff --git a/src/mesa/drivers/dri/i965/brw_fs_channel_expressions.cpp
b/src/mesa/drivers/dri/i965/brw_fs_channel_expressions.cpp
index cb0a079..5b56f9a 100644
--- a/src/mesa/drivers/dri/i965/brw_fs_channel_expressions.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs_channel_expressions.cpp
@@ -431,6 +431,7 @@
ir_channel_expressions_visitor::visit_leave(ir_assignment *ir)
    case ir_unop_unpack_unorm_2x16:
    case ir_unop_unpack_unorm_4x8:
    case ir_unop_unpack_half_2x16:
+   case ir_unop_atan:
    case ir_binop_ldexp:
    case ir_binop_vector_extract:
    case ir_triop_vector_insert:
diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
index 2d5318a..2a8a6fb 100644
--- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
@@ -585,6 +586,8 @@ fs_visitor::visit(ir_expression *ir)
    case ir_unop_cos_reduced:
       emit_math(SHADER_OPCODE_COS, this->result, op[0]);
       break;
+   case ir_unop_atan:
+      unreachable("not reached: should be handled by atan_to_arith");

    case ir_unop_dFdx:
       emit(FS_OPCODE_DDX, this->result, op[0], fs_reg(BRW_DERIVATIVE_BY_HINT));
diff --git a/src/mesa/drivers/dri/i965/brw_vec4_visitor.cpp
b/src/mesa/drivers/dri/i965/brw_vec4_visitor.cpp
index 473b3c5..8ed09b8 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4_visitor.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4_visitor.cpp
@@ -1365,6 +1366,8 @@ vec4_visitor::visit(ir_expression *ir)
    case ir_unop_cos_reduced:
       emit_math(SHADER_OPCODE_COS, result_dst, op[0]);
       break;
+   case ir_unop_atan:
+      unreachable("not reached: should be handled by atan_to_arith");

    case ir_unop_dFdx:
    case ir_unop_dFdx_coarse:
diff --git a/src/mesa/program/ir_to_mesa.cpp b/src/mesa/program/ir_to_mesa.cpp
index 83cd1ae..991127d 100644
--- a/src/mesa/program/ir_to_mesa.cpp
+++ b/src/mesa/program/ir_to_mesa.cpp
@@ -1115,6 +1117,9 @@ ir_to_mesa_visitor::visit(ir_expression *ir)
    case ir_unop_cos_reduced:
       emit_scs(ir, OPCODE_COS, result_dst, op[0]);
       break;
+   case ir_unop_atan:
+      assert(!"not reached: should be handled by atan_to_arith");
+      break;

    case ir_unop_dFdx:
       emit(ir, OPCODE_DDX, result_dst, op[0]);
diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp
b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp
index 7c2f0fc..ce3744b 100644
--- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp
+++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp
@@ -1401,6 +1403,9 @@ glsl_to_tgsi_visitor::visit(ir_expression *ir)
    case ir_unop_cos_reduced:
       emit_scs(ir, TGSI_OPCODE_COS, result_dst, op[0]);
       break;
+   case ir_unop_atan:
+      assert(!"not reached: should be handled by atan_to_arith");
+      break;
    case ir_unop_saturate: {
       glsl_to_tgsi_instruction *inst;
       inst = emit(ir, TGSI_OPCODE_MOV, result_dst, op[0]);

-->8--

Note that this version also uses the new polynomial, rather than the
old approximation. But we get even better constant folding. Dunno if
it's worth it -- but it does get rid of that do_atan()-ugliness ;)

> I'm not worried about any potential performance impact because I've
> never seen any application use atan.   Of course, that's also why that
> bug has sat open for long. :)

Yeah, atan is not very common, and when it's needed people seem to
tend to use 1d textures instead. So I think accuracy trumps
performance in this case.

> Reviewed-by: Ian Romanick <ian.d.romanick at intel.com>
>
>> +   /* range-reduction fixup */
>> +   body.emit(assign(tmp, add(tmp,
>> +                             csel(greater(abs(y_over_x),
>> +                                          swizzle(imm(1.0f),
>> +                                                  SWIZZLE_XXXX,
>> +                                                  type->components())),

I guess this  should work just as well (sorry for my lack of ir_builder-fu):

-                                          swizzle(imm(1.0f),
-                                                  SWIZZLE_XXXX,
-                                                  type->components())),
+                                          imm(1.0f, type->components())),

>> +                                  add(mul(tmp,
>> +                                          imm(-2.0f)),
>> +                                      imm(M_PI_2f)),
>> +                                  imm(0.0f)))));

Hm. Don't I need to expand this last immediate to a vector of
type->components() size as well?

If so, this patch should go on top:

diff --git a/src/glsl/builtin_functions.cpp b/src/glsl/builtin_functions.cpp
index 1820dd5..bfa46eb 100644
--- a/src/glsl/builtin_functions.cpp
+++ b/src/glsl/builtin_functions.cpp
@@ -2749,13 +2749,11 @@ builtin_builder::do_atan(ir_factory &body,
const glsl_type *type, ir_variable *r
    /* range-reduction fixup */
    body.emit(assign(tmp, add(tmp,
                              csel(greater(abs(y_over_x),
-                                          swizzle(imm(1.0f),
-                                                  SWIZZLE_XXXX,
-                                                  type->components())),
+                                          imm(1.0f, type->components())),
                                   add(mul(tmp,
                                           imm(-2.0f)),
                                       imm(M_PI_2f)),
-                                  imm(0.0f)))));
+                                  imm(0.0f, type->components())))));

    /* sign fixup */
    body.emit(assign(res, mul(tmp, sign(y_over_x))));