[cairo] Brand new _cairo_lround implementation
Daniel Amelang
daniel.amelang at gmail.com
Mon Dec 4 17:40:09 PST 2006
So, I've got a totally different _cairo_lround implementation. It was
originally meant to solve some FP burn in cairo (text rendering only)
on nonFPU devices, but as pointed out here:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217819
we also need a new approach that solves the banker's rounding problem.
Lucky, we can kill two birds with one stone. This new approach uses
awayfromzero rounding and doesn't perform any floatingpoint
operations. A minor disadvantage is that it only works on the input
range (INT_MIN, INT_MAX] instead of [INT_MIN, INT_MAX], but I figure
that if you're that close to the edge, you shouldn't be surprised if
you fall off a little early. Plus, I think we're only using this
function with device coordinates, so negative numbers aren't that
interesting.
I don't get any performance regressions on x86, but I do get a minor
performance boost in text_solid and text_image on the 770 (~1.05).
Nothing too exciting. At least we don't have to use the new
DISABLE_SOME_FLOATING_POINT (yet).
I actually spent a lot of time tweaking this function to really cook
on both ARM and x86 (with two different code paths), but it became
painfully obvious that it really didn't make a difference in the
overall performance. What we have now is plenty good enough on both
architectures.
You can get an additional 1.05x speedup (text cases, again) on the 770
if you inline _cairo_lround, for a grand total 1.10x speedup with the
two patches combined. Not sure if it's worth it, but might as well
bring it up. Plus, I wasn't sure if I inlined it the right way, as
currently there are no cairowide inline functions to imitate. So,
this gives us a chance to discuss what is the right way to inline for
future reference.
Dan
 next part 
From nobody Mon Sep 17 00:00:00 2001
From: Dan Amelang <dan at amelang.net>
Date: Mon Dec 4 01:24:32 2006 0800
Subject: [PATCH] Change _cairo_lround to use awayfromzero rounding
This fixes the text rendering bug reported here:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217819
No performance impact on x86. On the 770, I see minor speedups in text_solid
and text_image (~1.05x).

src/cairo.c  116 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 105 insertions(+), 11 deletions()
b42979f1d7781cc8ae55f313397b83d33f347757
diff git a/src/cairo.c b/src/cairo.c
index ce6a728..55cb004 100644
 a/src/cairo.c
+++ b/src/cairo.c
@@ 3197,27 +3197,121 @@ _cairo_restrict_value (double *value, do
*value = max;
}
/* This function is identical to the C99 function lround, except that it
 * uses banker's rounding instead of arithmetic rounding. This implementation
 * is much faster (on the platforms we care about) than lround, round, rint,
 * lrint or float (d + 0.5).
+/* This function is identical to the C99 function lround(), except that its
+ * valid input range is (INT_MIN, INT_MAX] instead of [INT_MIN, INT_MAX]. It
+ * is much faster on both x86 and FPUless systems than any other commonly used
+ * method for rounding (lround, round, rint, lrint or float (d + 0.5)). This
+ * function performs awayfromzero rounding, just like its C99 equivalent.
*
 * For an explanation of the inner workings of this implemenation, see the
 * documentation for _cairo_fixed_from_double.
+ * The reason why this function is much faster on x86 than other
+ * methods is due to the fact that it avoids the fldcw instruction.
+ * This instruction incurs a large performance penalty on modern Intel
+ * processors due to how it prevents efficient instruction pipelining.
+ *
+ * The reason why this function is much faster on FPUless systems is for
+ * a different reason. All common rounding methods involve multiple
+ * floatingpoint operations. Each one of these operations has to be
+ * emulated in software, which adds up to be a large performance penalty.
+ * This function doesn't perform any floatingpoint calculations, and thus
+ * avoids this penalty.
*/
#define CAIRO_MAGIC_NUMBER_INT (6755399441055744.0)
int
_cairo_lround (double d)
{
union {
+ uint32_t ui32[2];
double d;
 int32_t i[2];
} u;
+ uint32_t exponent, integer_result,
+ most_significant_word, least_significant_word;
+ u.d = d;
 u.d = d + CAIRO_MAGIC_NUMBER_INT;
#ifdef FLOAT_WORDS_BIGENDIAN
 return u.i[1];
+ most_significant_word = u.ui32[0];
+ least_significant_word = u.ui32[1];
#else
 return u.i[0];
+ most_significant_word = u.ui32[1];
+ least_significant_word = u.ui32[0];
#endif
+
+ /* First, we extract and calculate the exponent. We extract it from the
+ * most significant word of the double by shifting off the top of the
+ * mantissa, then masking out the sign bit. Next, the actual exponent value
+ * is calculated by subtracting the extracted value from the bias. Notice
+ * that the correct bias for 64bit doubles is actually 1075, but we use
+ * 1053 instead for two reasons:
+ *
+ * 1) To perform rounding later on, we will first need the target
+ * value in a 31.1 fixedpoint format. Thus, the exponent value needs
+ * to be one unit less: (1075  1: 1074).
+ *
+ * 2) To avoid shifting the mantissa as a full 64bit integer (which is
+ * costly on certain architectures), we break the shift into two parts.
+ * First, the upper and lower parts of the mantissa are shifted
+ * individually by a constant amount that all valid inputs will
+ * require at the very least. This amount is chosen to be 21,
+ * because this will allow the two parts of the mantissa to
+ * later be combined into a single 32bit representation, on which the
+ * remainder of the shift will be performed. Thus, we decrease the
+ * exponent value by an additional 21: (1074  21: 1053).
+ */
+ exponent = 1053  ((most_significant_word >> 20) & 0x7FF);
+
+ /* Next, we take the upper part of the mantissa from the most significant
+ * word of the input double, OR in the implicit 1, and perform the shift
+ * described above. Notice that by shifting the 32bit integer left 11
+ * places, we get the same result as we would had we shifted a 64bit
+ * integer right 21 places and extracted the lower 32 bits.
+ */
+ integer_result = (most_significant_word  0x100000) << 11;
+
+ /* Then, we shift the lower part of the mantissa by the constant value as
+ * outlined above, and OR the result with the top part of the mantissa.
+ * Both parts of the mantissa are now packed into a single 32bit integer.
+ */
+ integer_result = (least_significant_word >> 21);
+
+ /* Here, we perform the shift that converts the X.Y fixedpoint number
+ * currently found in integer_result to the desired 31.1 fixedpoint format
+ * needed for the rounding step. It is important to consider all possible
+ * values for exponent at this point:
+ *
+ *  {exponent < 0} Since exponent is an unsigned integer, it really
+ * can't have a value less than zero. But, if the exponent calculation
+ * above caused underflow (which would happen with input > INT_MAX or
+ * input <= INT_MIN) then exponent will now be a very large number, and
+ * so this shift will result in complete garbage. But that's OK, as
+ * the input was out of our range, so our output is expected to be
+ * undefined.
+ *
+ *  {exponent > 31} If the magnitude of the input was very small
+ * (i.e. input << 1.0), the exponent will have a value greater than
+ * 31. Thus, this shift will also result in garbage. So, after performing
+ * the shift, we will zeroout integer_result if this is the case.
+ *
+ *  {0 <= exponent < 32} In this case, the shift will properly convert
+ * the mantissa into a 31.1 fixedpoint number.
+ */
+ integer_result >>= exponent;
+
+ /* We zeroout the result if the magnitude of the input was very small
+ * (as explained in the section above).
+ */
+ if (exponent > 31)
+ integer_result = 0;
+
+ /* This is where we perform rounding with the 31.1 fixedpoint number.
+ * Since what we're after is arithmetic rounding, we simply add the
+ * single fractional bit into the integer part of integer_result, and
+ * just keep the integer part.
+ */
+ integer_result = (integer_result >> 1) + (integer_result & 0x1);
+
+ /* If the input double was a negative number, then we have to negate our
+ * output.
+ */
+ if (most_significant_word & 0x80000000)
+ integer_result = integer_result;
+
+ return integer_result;
}

1.2.6
 next part 
From nobody Mon Sep 17 00:00:00 2001
From: Dan Amelang <dan at amelang.net>
Date: Mon Dec 4 15:45:27 2006 0800
Subject: [PATCH] Make _cairo_lround an inline function
This speeds up the text_solid and text_image perf tests on the 770 by another
~1.05x. No performance change on x86.

src/cairo.c  119 
src/cairoint.h  120 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 118 insertions(+), 121 deletions()
c8107e5c16e883822c130c27f2ef38a51e85e485
diff git a/src/cairo.c b/src/cairo.c
index 55cb004..15230ed 100644
 a/src/cairo.c
+++ b/src/cairo.c
@@ 3196,122 +3196,3 @@ _cairo_restrict_value (double *value, do
else if (*value > max)
*value = max;
}

/* This function is identical to the C99 function lround(), except that its
 * valid input range is (INT_MIN, INT_MAX] instead of [INT_MIN, INT_MAX]. It
 * is much faster on both x86 and FPUless systems than any other commonly used
 * method for rounding (lround, round, rint, lrint or float (d + 0.5)). This
 * function performs awayfromzero rounding, just like its C99 equivalent.
 *
 * The reason why this function is much faster on x86 than other
 * methods is due to the fact that it avoids the fldcw instruction.
 * This instruction incurs a large performance penalty on modern Intel
 * processors due to how it prevents efficient instruction pipelining.
 *
 * The reason why this function is much faster on FPUless systems is for
 * a different reason. All common rounding methods involve multiple
 * floatingpoint operations. Each one of these operations has to be
 * emulated in software, which adds up to be a large performance penalty.
 * This function doesn't perform any floatingpoint calculations, and thus
 * avoids this penalty.
 */
int
_cairo_lround (double d)
{
 union {
 uint32_t ui32[2];
 double d;
 } u;
 uint32_t exponent, integer_result,
 most_significant_word, least_significant_word;
 u.d = d;

#ifdef FLOAT_WORDS_BIGENDIAN
 most_significant_word = u.ui32[0];
 least_significant_word = u.ui32[1];
#else
 most_significant_word = u.ui32[1];
 least_significant_word = u.ui32[0];
#endif

 /* First, we extract and calculate the exponent. We extract it from the
 * most significant word of the double by shifting off the top of the
 * mantissa, then masking out the sign bit. Next, the actual exponent value
 * is calculated by subtracting the extracted value from the bias. Notice
 * that the correct bias for 64bit doubles is actually 1075, but we use
 * 1053 instead for two reasons:
 *
 * 1) To perform rounding later on, we will first need the target
 * value in a 31.1 fixedpoint format. Thus, the exponent value needs
 * to be one unit less: (1075  1: 1074).
 *
 * 2) To avoid shifting the mantissa as a full 64bit integer (which is
 * costly on certain architectures), we break the shift into two parts.
 * First, the upper and lower parts of the mantissa are shifted
 * individually by a constant amount that all valid inputs will
 * require at the very least. This amount is chosen to be 21,
 * because this will allow the two parts of the mantissa to
 * later be combined into a single 32bit representation, on which the
 * remainder of the shift will be performed. Thus, we decrease the
 * exponent value by an additional 21: (1074  21: 1053).
 */
 exponent = 1053  ((most_significant_word >> 20) & 0x7FF);

 /* Next, we take the upper part of the mantissa from the most significant
 * word of the input double, OR in the implicit 1, and perform the shift
 * described above. Notice that by shifting the 32bit integer left 11
 * places, we get the same result as we would had we shifted a 64bit
 * integer right 21 places and extracted the lower 32 bits.
 */
 integer_result = (most_significant_word  0x100000) << 11;

 /* Then, we shift the lower part of the mantissa by the constant value as
 * outlined above, and OR the result with the top part of the mantissa.
 * Both parts of the mantissa are now packed into a single 32bit integer.
 */
 integer_result = (least_significant_word >> 21);

 /* Here, we perform the shift that converts the X.Y fixedpoint number
 * currently found in integer_result to the desired 31.1 fixedpoint format
 * needed for the rounding step. It is important to consider all possible
 * values for exponent at this point:
 *
 *  {exponent < 0} Since exponent is an unsigned integer, it really
 * can't have a value less than zero. But, if the exponent calculation
 * above caused underflow (which would happen with input > INT_MAX or
 * input <= INT_MIN) then exponent will now be a very large number, and
 * so this shift will result in complete garbage. But that's OK, as
 * the input was out of our range, so our output is expected to be
 * undefined.
 *
 *  {exponent > 31} If the magnitude of the input was very small
 * (i.e. input << 1.0), the exponent will have a value greater than
 * 31. Thus, this shift will also result in garbage. So, after performing
 * the shift, we will zeroout integer_result if this is the case.
 *
 *  {0 <= exponent < 32} In this case, the shift will properly convert
 * the mantissa into a 31.1 fixedpoint number.
 */
 integer_result >>= exponent;

 /* We zeroout the result if the magnitude of the input was very small
 * (as explained in the section above).
 */
 if (exponent > 31)
 integer_result = 0;

 /* This is where we perform rounding with the 31.1 fixedpoint number.
 * Since what we're after is arithmetic rounding, we simply add the
 * single fractional bit into the integer part of integer_result, and
 * just keep the integer part.
 */
 integer_result = (integer_result >> 1) + (integer_result & 0x1);

 /* If the input double was a negative number, then we have to negate our
 * output.
 */
 if (most_significant_word & 0x80000000)
 integer_result = integer_result;

 return integer_result;
}
diff git a/src/cairoint.h b/src/cairoint.h
index 1f74d62..69cab0c 100755
 a/src/cairoint.h
+++ b/src/cairoint.h
@@ 1155,8 +1155,124 @@ typedef struct _cairo_stroke_face {
cairo_private void
_cairo_restrict_value (double *value, double min, double max);
cairo_private int
_cairo_lround (double d);
+/* This function is identical to the C99 function lround(), except that its
+ * valid input range is (INT_MIN, INT_MAX] instead of [INT_MIN, INT_MAX]. It
+ * is much faster on both x86 and FPUless systems than any other commonly used
+ * method for rounding (lround, round, rint, lrint or float (d + 0.5)). This
+ * function performs awayfromzero rounding, just like its C99 equivalent.
+ *
+ * The reason why this function is much faster on x86 than other
+ * methods is due to the fact that it avoids the fldcw instruction.
+ * This instruction incurs a large performance penalty on modern Intel
+ * processors due to how it prevents efficient instruction pipelining.
+ *
+ * The reason why this function is much faster on FPUless systems is for
+ * a different reason. All common rounding methods involve multiple
+ * floatingpoint operations. Each one of these operations has to be
+ * emulated in software, which adds up to be a large performance penalty.
+ * This function doesn't perform any floatingpoint calculations, and thus
+ * avoids this penalty.
+ */
+static inline int
+_cairo_lround (double d)
+{
+ union {
+ uint32_t ui32[2];
+ double d;
+ } u;
+ uint32_t exponent, integer_result,
+ most_significant_word, least_significant_word;
+ u.d = d;
+
+#ifdef FLOAT_WORDS_BIGENDIAN
+ most_significant_word = u.ui32[0];
+ least_significant_word = u.ui32[1];
+#else
+ most_significant_word = u.ui32[1];
+ least_significant_word = u.ui32[0];
+#endif
+
+ /* First, we extract and calculate the exponent. We extract it from the
+ * most significant word of the double by shifting off the top of the
+ * mantissa, then masking out the sign bit. Next, the actual exponent value
+ * is calculated by subtracting the extracted value from the bias. Notice
+ * that the correct bias for 64bit doubles is actually 1075, but we use
+ * 1053 instead for two reasons:
+ *
+ * 1) To perform rounding later on, we will first need the target
+ * value in a 31.1 fixedpoint format. Thus, the exponent value needs
+ * to be one unit less: (1075  1: 1074).
+ *
+ * 2) To avoid shifting the mantissa as a full 64bit integer (which is
+ * costly on certain architectures), we break the shift into two parts.
+ * First, the upper and lower parts of the mantissa are shifted
+ * individually by a constant amount that all valid inputs will
+ * require at the very least. This amount is chosen to be 21,
+ * because this will allow the two parts of the mantissa to
+ * later be combined into a single 32bit representation, on which the
+ * remainder of the shift will be performed. Thus, we decrease the
+ * exponent value by an additional 21: (1074  21: 1053).
+ */
+ exponent = 1053  ((most_significant_word >> 20) & 0x7FF);
+
+ /* Next, we take the upper part of the mantissa from the most significant
+ * word of the input double, OR in the implicit 1, and perform the shift
+ * described above. Notice that by shifting the 32bit integer left 11
+ * places, we get the same result as we would had we shifted a 64bit
+ * integer right 21 places and extracted the lower 32 bits.
+ */
+ integer_result = (most_significant_word  0x100000) << 11;
+
+ /* Then, we shift the lower part of the mantissa by the constant value as
+ * outlined above, and OR the result with the top part of the mantissa.
+ * Both parts of the mantissa are now packed into a single 32bit integer.
+ */
+ integer_result = (least_significant_word >> 21);
+
+ /* Here, we perform the shift that converts the X.Y fixedpoint number
+ * currently found in integer_result to the desired 31.1 fixedpoint format
+ * needed for the rounding step. It is important to consider all possible
+ * values for exponent at this point:
+ *
+ *  {exponent < 0} Since exponent is an unsigned integer, it really
+ * can't have a value less than zero. But, if the exponent calculation
+ * above caused underflow (which would happen with input > INT_MAX or
+ * input <= INT_MIN) then exponent will now be a very large number, and
+ * so this shift will result in complete garbage. But that's OK, as
+ * the input was out of our range, so our output is expected to be
+ * undefined.
+ *
+ *  {exponent > 31} If the magnitude of the input was very small
+ * (i.e. input << 1.0), the exponent will have a value greater than
+ * 31. Thus, this shift will also result in garbage. So, after performing
+ * the shift, we will zeroout integer_result if this is the case.
+ *
+ *  {0 <= exponent < 32} In this case, the shift will properly convert
+ * the mantissa into a 31.1 fixedpoint number.
+ */
+ integer_result >>= exponent;
+
+ /* We zeroout the result if the magnitude of the input was very small
+ * (as explained in the section above).
+ */
+ if (exponent > 31)
+ integer_result = 0;
+
+ /* This is where we perform rounding with the 31.1 fixedpoint number.
+ * Since what we're after is arithmetic rounding, we simply add the
+ * single fractional bit into the integer part of integer_result, and
+ * just keep the integer part.
+ */
+ integer_result = (integer_result >> 1) + (integer_result & 0x1);
+
+ /* If the input double was a negative number, then we have to negate our
+ * output.
+ */
+ if (most_significant_word & 0x80000000)
+ integer_result = integer_result;
+
+ return integer_result;
+}
/* cairo_fixed.c */
cairo_private cairo_fixed_t

1.2.6
More information about the cairo
mailing list