[cairo] Brand new _cairo_lround implementation

Mon Dec 4 17:40:09 PST 2006

So, I've got a totally different _cairo_lround implementation. It was
originally meant to solve some FP burn in cairo (text rendering only)
on non-FPU devices, but as pointed out here:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217819

we also need a new approach that solves the banker's rounding problem.
Lucky, we can kill two birds with one stone. This new approach uses
away-from-zero rounding and doesn't perform any floating-point
operations. A minor disadvantage is that it only works on the input
range (INT_MIN, INT_MAX] instead of [INT_MIN, INT_MAX], but I figure
that if you're that close to the edge, you shouldn't be surprised if
you fall off a little early. Plus, I think we're only using this
function with device coordinates, so negative numbers aren't that
interesting.

I don't get any performance regressions on x86, but I do get a minor
performance boost in text_solid and text_image on the 770 (~1.05).
Nothing too exciting. At least we don't have to use the new
DISABLE_SOME_FLOATING_POINT (yet).

I actually spent a lot of time tweaking this function to really cook
on both ARM and x86 (with two different code paths), but it became
painfully obvious that it really didn't make a difference in the
overall performance. What we have now is plenty good enough on both
architectures.

You can get an additional 1.05x speedup (text cases, again) on the 770
if you inline _cairo_lround, for a grand total 1.10x speedup with the
two patches combined. Not sure if it's worth it, but might as well
bring it up. Plus, I wasn't sure if I inlined it the right way, as
currently there are no cairo-wide inline functions to imitate. So,
this gives us a chance to discuss what is the right way to inline for
future reference.

Dan
-------------- next part --------------
From nobody Mon Sep 17 00:00:00 2001
From: Dan Amelang <dan at amelang.net>
Date: Mon Dec 4 01:24:32 2006 -0800
Subject: [PATCH] Change _cairo_lround to use away-from-zero rounding

This fixes the text rendering bug reported here:

    https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217819

No performance impact on x86. On the 770, I see minor speedups in text_solid
and text_image (~1.05x).

---

 src/cairo.c |  116 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 105 insertions(+), 11 deletions(-)

b42979f1d7781cc8ae55f313397b83d33f347757

diff --git a/src/cairo.c b/src/cairo.c
index ce6a728..55cb004 100644
--- a/src/cairo.c
+++ b/src/cairo.c
@@ -3197,27 +3197,121 @@ _cairo_restrict_value (double *value, do
 	*value = max;
 }
 
-/* This function is identical to the C99 function lround, except that it
- * uses banker's rounding instead of arithmetic rounding. This implementation
- * is much faster (on the platforms we care about) than lround, round, rint,
- * lrint or float (d + 0.5).
+/* This function is identical to the C99 function lround(), except that its
+ * valid input range is (INT_MIN, INT_MAX] instead of [INT_MIN, INT_MAX]. It
+ * is much faster on both x86 and FPU-less systems than any other commonly used
+ * method for rounding (lround, round, rint, lrint or float (d + 0.5)). This
+ * function performs away-from-zero rounding, just like its C99 equivalent.
  *
- * For an explanation of the inner workings of this implemenation, see the
- * documentation for _cairo_fixed_from_double.
+ * The reason why this function is much faster on x86 than other
+ * methods is due to the fact that it avoids the fldcw instruction.
+ * This instruction incurs a large performance penalty on modern Intel
+ * processors due to how it prevents efficient instruction pipelining.
+ *
+ * The reason why this function is much faster on FPU-less systems is for
+ * a different reason. All common rounding methods involve multiple
+ * floating-point operations. Each one of these operations has to be
+ * emulated in software, which adds up to be a large performance penalty.
+ * This function doesn't perform any floating-point calculations, and thus
+ * avoids this penalty.
  */
-#define CAIRO_MAGIC_NUMBER_INT (6755399441055744.0)
 int
 _cairo_lround (double d)
 {
     union {
+        uint32_t ui32[2];
         double d;
-        int32_t i[2];
     } u;
+    uint32_t exponent, integer_result,
+             most_significant_word, least_significant_word;
+    u.d = d;
 
-    u.d = d + CAIRO_MAGIC_NUMBER_INT;
 #ifdef FLOAT_WORDS_BIGENDIAN
-    return u.i[1];
+    most_significant_word  = u.ui32[0];
+    least_significant_word = u.ui32[1];
 #else
-    return u.i[0];
+    most_significant_word  = u.ui32[1];
+    least_significant_word = u.ui32[0];
 #endif
+
+    /* First, we extract and calculate the exponent. We extract it from the
+     * most significant word of the double by shifting off the top of the
+     * mantissa, then masking out the sign bit. Next, the actual exponent value
+     * is calculated by subtracting the extracted value from the bias. Notice
+     * that the correct bias for 64-bit doubles is actually 1075, but we use
+     * 1053 instead for two reasons:
+     *
+     *  1) To perform rounding later on, we will first need the target
+     *     value in a 31.1 fixed-point format. Thus, the exponent value needs
+     *     to be one unit less: (1075 - 1: 1074).
+     *
+     *  2) To avoid shifting the mantissa as a full 64-bit integer (which is
+     *     costly on certain architectures), we break the shift into two parts.
+     *     First, the upper and lower parts of the mantissa are shifted
+     *     individually by a constant amount that all valid inputs will
+     *     require at the very least. This amount is chosen to be 21,
+     *     because this will allow the two parts of the mantissa to
+     *     later be combined into a single 32-bit representation, on which the
+     *     remainder of the shift will be performed. Thus, we decrease the
+     *     exponent value by an additional 21: (1074 - 21: 1053).
+     */
+    exponent = 1053 - ((most_significant_word >> 20) & 0x7FF);
+
+    /* Next, we take the upper part of the mantissa from the most significant
+     * word of the input double, OR in the implicit 1, and perform the shift
+     * described above. Notice that by shifting the 32-bit integer left 11
+     * places, we get the same result as we would had we shifted a 64-bit
+     * integer right 21 places and extracted the lower 32 bits.
+     */
+    integer_result = (most_significant_word | 0x100000) << 11;
+
+    /* Then, we shift the lower part of the mantissa by the constant value as
+     * outlined above, and OR the result with the top part of the mantissa.
+     * Both parts of the mantissa are now packed into a single 32-bit integer.
+     */
+    integer_result |= (least_significant_word >> 21);
+
+    /* Here, we perform the shift that converts the X.Y fixed-point number
+     * currently found in integer_result to the desired 31.1 fixed-point format
+     * needed for the rounding step. It is important to consider all possible
+     * values for exponent at this point:
+     *
+     * - {exponent < 0} Since exponent is an unsigned integer, it really
+     *   can't have a value less than zero. But, if the exponent calculation
+     *   above caused underflow (which would happen with input > INT_MAX or
+     *   input <= INT_MIN) then exponent will now be a very large number, and
+     *   so this shift will result in complete garbage. But that's OK, as
+     *   the input was out of our range, so our output is expected to be
+     *   undefined.
+     *
+     * - {exponent > 31} If the magnitude of the input was very small
+     *   (i.e. |input| << 1.0), the exponent will have a value greater than
+     *   31. Thus, this shift will also result in garbage. So, after performing
+     *   the shift, we will zero-out integer_result if this is the case.
+     *
+     * - {0 <= exponent < 32} In this case, the shift will properly convert
+     *   the mantissa into a 31.1 fixed-point number.
+     */
+    integer_result >>= exponent;
+
+    /* We zero-out the result if the magnitude of the input was very small
+     * (as explained in the section above).
+     */
+    if (exponent > 31)
+        integer_result = 0;
+
+    /* This is where we perform rounding with the 31.1 fixed-point number.
+     * Since what we're after is arithmetic rounding, we simply add the
+     * single fractional bit into the integer part of integer_result, and
+     * just keep the integer part.
+     */
+    integer_result = (integer_result >> 1) + (integer_result & 0x1);
+
+    /* If the input double was a negative number, then we have to negate our
+     * output.
+     */
+    if (most_significant_word & 0x80000000)
+        integer_result = -integer_result;
+
+    return integer_result;
 }
-- 
1.2.6
-------------- next part --------------
From nobody Mon Sep 17 00:00:00 2001
From: Dan Amelang <dan at amelang.net>
Date: Mon Dec 4 15:45:27 2006 -0800
Subject: [PATCH] Make _cairo_lround an inline function

This speeds up the text_solid and text_image perf tests on the 770 by another
~1.05x. No performance change on x86.

---

 src/cairo.c    |  119 --------------------------------------------------------
 src/cairoint.h |  120 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 118 insertions(+), 121 deletions(-)

c8107e5c16e883822c130c27f2ef38a51e85e485
diff --git a/src/cairo.c b/src/cairo.c
index 55cb004..15230ed 100644
--- a/src/cairo.c
+++ b/src/cairo.c
@@ -3196,122 +3196,3 @@ _cairo_restrict_value (double *value, do
     else if (*value > max)
 	*value = max;
 }
-
-/* This function is identical to the C99 function lround(), except that its
- * valid input range is (INT_MIN, INT_MAX] instead of [INT_MIN, INT_MAX]. It
- * is much faster on both x86 and FPU-less systems than any other commonly used
- * method for rounding (lround, round, rint, lrint or float (d + 0.5)). This
- * function performs away-from-zero rounding, just like its C99 equivalent.
- *
- * The reason why this function is much faster on x86 than other
- * methods is due to the fact that it avoids the fldcw instruction.
- * This instruction incurs a large performance penalty on modern Intel
- * processors due to how it prevents efficient instruction pipelining.
- *
- * The reason why this function is much faster on FPU-less systems is for
- * a different reason. All common rounding methods involve multiple
- * floating-point operations. Each one of these operations has to be
- * emulated in software, which adds up to be a large performance penalty.
- * This function doesn't perform any floating-point calculations, and thus
- * avoids this penalty.
- */
-int
-_cairo_lround (double d)
-{
-    union {
-        uint32_t ui32[2];
-        double d;
-    } u;
-    uint32_t exponent, integer_result,
-             most_significant_word, least_significant_word;
-    u.d = d;
-
-#ifdef FLOAT_WORDS_BIGENDIAN
-    most_significant_word  = u.ui32[0];
-    least_significant_word = u.ui32[1];
-#else
-    most_significant_word  = u.ui32[1];
-    least_significant_word = u.ui32[0];
-#endif
-
-    /* First, we extract and calculate the exponent. We extract it from the
-     * most significant word of the double by shifting off the top of the
-     * mantissa, then masking out the sign bit. Next, the actual exponent value
-     * is calculated by subtracting the extracted value from the bias. Notice
-     * that the correct bias for 64-bit doubles is actually 1075, but we use
-     * 1053 instead for two reasons:
-     *
-     *  1) To perform rounding later on, we will first need the target
-     *     value in a 31.1 fixed-point format. Thus, the exponent value needs
-     *     to be one unit less: (1075 - 1: 1074).
-     *
-     *  2) To avoid shifting the mantissa as a full 64-bit integer (which is
-     *     costly on certain architectures), we break the shift into two parts.
-     *     First, the upper and lower parts of the mantissa are shifted
-     *     individually by a constant amount that all valid inputs will
-     *     require at the very least. This amount is chosen to be 21,
-     *     because this will allow the two parts of the mantissa to
-     *     later be combined into a single 32-bit representation, on which the
-     *     remainder of the shift will be performed. Thus, we decrease the
-     *     exponent value by an additional 21: (1074 - 21: 1053).
-     */
-    exponent = 1053 - ((most_significant_word >> 20) & 0x7FF);
-
-    /* Next, we take the upper part of the mantissa from the most significant
-     * word of the input double, OR in the implicit 1, and perform the shift
-     * described above. Notice that by shifting the 32-bit integer left 11
-     * places, we get the same result as we would had we shifted a 64-bit
-     * integer right 21 places and extracted the lower 32 bits.
-     */
-    integer_result = (most_significant_word | 0x100000) << 11;
-
-    /* Then, we shift the lower part of the mantissa by the constant value as
-     * outlined above, and OR the result with the top part of the mantissa.
-     * Both parts of the mantissa are now packed into a single 32-bit integer.
-     */
-    integer_result |= (least_significant_word >> 21);
-
-    /* Here, we perform the shift that converts the X.Y fixed-point number
-     * currently found in integer_result to the desired 31.1 fixed-point format
-     * needed for the rounding step. It is important to consider all possible
-     * values for exponent at this point:
-     *
-     * - {exponent < 0} Since exponent is an unsigned integer, it really
-     *   can't have a value less than zero. But, if the exponent calculation
-     *   above caused underflow (which would happen with input > INT_MAX or
-     *   input <= INT_MIN) then exponent will now be a very large number, and
-     *   so this shift will result in complete garbage. But that's OK, as
-     *   the input was out of our range, so our output is expected to be
-     *   undefined.
-     *
-     * - {exponent > 31} If the magnitude of the input was very small
-     *   (i.e. |input| << 1.0), the exponent will have a value greater than
-     *   31. Thus, this shift will also result in garbage. So, after performing
-     *   the shift, we will zero-out integer_result if this is the case.
-     *
-     * - {0 <= exponent < 32} In this case, the shift will properly convert
-     *   the mantissa into a 31.1 fixed-point number.
-     */
-    integer_result >>= exponent;
-
-    /* We zero-out the result if the magnitude of the input was very small
-     * (as explained in the section above).
-     */
-    if (exponent > 31)
-        integer_result = 0;
-
-    /* This is where we perform rounding with the 31.1 fixed-point number.
-     * Since what we're after is arithmetic rounding, we simply add the
-     * single fractional bit into the integer part of integer_result, and
-     * just keep the integer part.
-     */
-    integer_result = (integer_result >> 1) + (integer_result & 0x1);
-
-    /* If the input double was a negative number, then we have to negate our
-     * output.
-     */
-    if (most_significant_word & 0x80000000)
-        integer_result = -integer_result;
-
-    return integer_result;
-}
diff --git a/src/cairoint.h b/src/cairoint.h
index 1f74d62..69cab0c 100755
--- a/src/cairoint.h
+++ b/src/cairoint.h
@@ -1155,8 +1155,124 @@ typedef struct _cairo_stroke_face {
 cairo_private void
 _cairo_restrict_value (double *value, double min, double max);
 
-cairo_private int
-_cairo_lround (double d);
+/* This function is identical to the C99 function lround(), except that its
+ * valid input range is (INT_MIN, INT_MAX] instead of [INT_MIN, INT_MAX]. It
+ * is much faster on both x86 and FPU-less systems than any other commonly used
+ * method for rounding (lround, round, rint, lrint or float (d + 0.5)). This
+ * function performs away-from-zero rounding, just like its C99 equivalent.
+ *
+ * The reason why this function is much faster on x86 than other
+ * methods is due to the fact that it avoids the fldcw instruction.
+ * This instruction incurs a large performance penalty on modern Intel
+ * processors due to how it prevents efficient instruction pipelining.
+ *
+ * The reason why this function is much faster on FPU-less systems is for
+ * a different reason. All common rounding methods involve multiple
+ * floating-point operations. Each one of these operations has to be
+ * emulated in software, which adds up to be a large performance penalty.
+ * This function doesn't perform any floating-point calculations, and thus
+ * avoids this penalty.
+ */
+static inline int
+_cairo_lround (double d)
+{
+    union {
+        uint32_t ui32[2];
+        double d;
+    } u;
+    uint32_t exponent, integer_result,
+             most_significant_word, least_significant_word;
+    u.d = d;
+
+#ifdef FLOAT_WORDS_BIGENDIAN
+    most_significant_word  = u.ui32[0];
+    least_significant_word = u.ui32[1];
+#else
+    most_significant_word  = u.ui32[1];
+    least_significant_word = u.ui32[0];
+#endif
+
+    /* First, we extract and calculate the exponent. We extract it from the
+     * most significant word of the double by shifting off the top of the
+     * mantissa, then masking out the sign bit. Next, the actual exponent value
+     * is calculated by subtracting the extracted value from the bias. Notice
+     * that the correct bias for 64-bit doubles is actually 1075, but we use
+     * 1053 instead for two reasons:
+     *
+     *  1) To perform rounding later on, we will first need the target
+     *     value in a 31.1 fixed-point format. Thus, the exponent value needs
+     *     to be one unit less: (1075 - 1: 1074).
+     *
+     *  2) To avoid shifting the mantissa as a full 64-bit integer (which is
+     *     costly on certain architectures), we break the shift into two parts.
+     *     First, the upper and lower parts of the mantissa are shifted
+     *     individually by a constant amount that all valid inputs will
+     *     require at the very least. This amount is chosen to be 21,
+     *     because this will allow the two parts of the mantissa to
+     *     later be combined into a single 32-bit representation, on which the
+     *     remainder of the shift will be performed. Thus, we decrease the
+     *     exponent value by an additional 21: (1074 - 21: 1053).
+     */
+    exponent = 1053 - ((most_significant_word >> 20) & 0x7FF);
+
+    /* Next, we take the upper part of the mantissa from the most significant
+     * word of the input double, OR in the implicit 1, and perform the shift
+     * described above. Notice that by shifting the 32-bit integer left 11
+     * places, we get the same result as we would had we shifted a 64-bit
+     * integer right 21 places and extracted the lower 32 bits.
+     */
+    integer_result = (most_significant_word | 0x100000) << 11;
+
+    /* Then, we shift the lower part of the mantissa by the constant value as
+     * outlined above, and OR the result with the top part of the mantissa.
+     * Both parts of the mantissa are now packed into a single 32-bit integer.
+     */
+    integer_result |= (least_significant_word >> 21);
+
+    /* Here, we perform the shift that converts the X.Y fixed-point number
+     * currently found in integer_result to the desired 31.1 fixed-point format
+     * needed for the rounding step. It is important to consider all possible
+     * values for exponent at this point:
+     *
+     * - {exponent < 0} Since exponent is an unsigned integer, it really
+     *   can't have a value less than zero. But, if the exponent calculation
+     *   above caused underflow (which would happen with input > INT_MAX or
+     *   input <= INT_MIN) then exponent will now be a very large number, and
+     *   so this shift will result in complete garbage. But that's OK, as
+     *   the input was out of our range, so our output is expected to be
+     *   undefined.
+     *
+     * - {exponent > 31} If the magnitude of the input was very small
+     *   (i.e. |input| << 1.0), the exponent will have a value greater than
+     *   31. Thus, this shift will also result in garbage. So, after performing
+     *   the shift, we will zero-out integer_result if this is the case.
+     *
+     * - {0 <= exponent < 32} In this case, the shift will properly convert
+     *   the mantissa into a 31.1 fixed-point number.
+     */
+    integer_result >>= exponent;
+
+    /* We zero-out the result if the magnitude of the input was very small
+     * (as explained in the section above).
+     */
+    if (exponent > 31)
+        integer_result = 0;
+
+    /* This is where we perform rounding with the 31.1 fixed-point number.
+     * Since what we're after is arithmetic rounding, we simply add the
+     * single fractional bit into the integer part of integer_result, and
+     * just keep the integer part.
+     */
+    integer_result = (integer_result >> 1) + (integer_result & 0x1);
+
+    /* If the input double was a negative number, then we have to negate our
+     * output.
+     */
+    if (most_significant_word & 0x80000000)
+        integer_result = -integer_result;
+
+    return integer_result;
+}
 
 /* cairo_fixed.c */
 cairo_private cairo_fixed_t
-- 
1.2.6