[Mesa-dev] [PATCH 1/5] util: import public domain code for integer division by a constant

Wed Sep 26 21:07:04 UTC 2018

On Tue, Sep 25, 2018 at 5:33 AM Jason Ekstrand <jason at jlekstrand.net> wrote:
>
> On Mon, Sep 24, 2018 at 7:15 PM Marek Olšák <maraeo at gmail.com> wrote:
>>
>> This patch also handles all types, just differently. If you change the
>> typedefs in the header, you'll get a different type and the code is
>> exactly the same for all types, but that's not important (to me
>> anyway).
>>
>> It also supports signed division. (not important to me either, but may
>> be important to you)
>
>
> Mine supported signed division as well though what's here might be a bit more clever.  I'll have to give it some thought.
>
>>
>> Did you figure out the algorithm by yourself or did you copy it from
>> somewhere? The reason I'm asking is that yours only seems to implement
>> the Round-Up algorithm and you said:
>>
>> "In particular, we want to have m < 2^N so that we don't have any
>> overflow problems.  Unfortunately, this isn't always achievable."
>
>
> Yes, mine is based on the round up algorithm.  However, it's not the blind round-up algorithm; it's a bit smarter than that.
>
>>
>> Let me tell you what. This patch achieves it ALWAYS.
>
>
> I don't think that's true.  You still have an N+1 bit multiplier, you just call it the increment bit.  The saturated add, however, is a neat trick that probably lets you avoid the weirness around adding in the increment factor.  I'll need to look at the web-site you linked and think about this stuff again before I can verify it.

If your hw doesn't have a saturated add, you can still achieve the result with:

addcarry (32-bit add doing "+1")
subborrow (saturation)

It takes advantage of the fact that increment <= 1. The idea is that
addcarry returns carry=1 if it overflowed, and subborrow subtracts 1
if carry=1. It avoids all 64-bit math and you only need one 32-bit
intermediate register for the division.

>
>>
>> This patch implements 2 algorithms for unsigned division: Round-Up and
>> Round-Down. The paper I linked shows that the Round Down algorithm
>> generates better code for some divisors than the Round Up algorithm,
>> because the multiplier always fits into 32 bits. The most operations
>> you'll ever need are: 2 shifts, 32-bit saturated ADD and UMUL_HI.
>>
>> Marek
>>
>> On Mon, Sep 24, 2018 at 7:41 PM, Marek Olšák <maraeo at gmail.com> wrote:
>> > Did you copy the code from the same author?
>> >
>> > Does your version also have an interface for dividing by a uniform
>> > instead of a compile time constant?
>> >
>> > Note that this algorithm was originally only written for
>> > non-power-of-two divisors and I extended it to support 1 and
>> > power-of-two divisors in order to support dividing by a uniform in a
>> > generic way. The other two generic variants that I added are also
>> > important. One of them assumes no unsigned wraparounds and the other
>> > one assumes operands have 31 bits and the divisor is >= 2.
>> >
>> > Marek
>> >
>> > On Mon, Sep 24, 2018 at 10:00 AM, Jason Ekstrand <jason at jlekstrand.net> wrote:
>> >> Very similar.... And mine handles 8, 16, and 64-bit types. :-D
>> >>
>> >> --Jason
>> >>
>> >> On Mon, Sep 24, 2018 at 8:53 AM Ian Romanick <idr at freedesktop.org> wrote:
>> >>>
>> >>> I didn't look really closely at either set, but this seems really
>> >>> similar to something Jason sent out a week or two.  Perhaps you guys
>> >>> could unify these?
>> >>>
>> >>> On 09/23/2018 09:57 AM, Marek Olšák wrote:
>> >>> > From: Marek Olšák <marek.olsak at amd.com>
>> >>> >
>> >>> > Compilers can use this to generate optimal code for integer division
>> >>> > by a constant.
>> >>> >
>> >>> > Additionally, an unsigned division by a uniform that is constant but not
>> >>> > known at compile time can still be optimized by passing 2-4 division
>> >>> > factors to the shader as uniforms and executing one of the fast_udiv*
>> >>> > variants. The signed division algorithm doesn't have this capability.
>> >>> > ---
>> >>> >  src/util/Makefile.sources     |   2 +
>> >>> >  src/util/fast_idiv_by_const.c | 245
>> >>> > ++++++++++++++++++++++++++++++++++++++++++
>> >>> >  src/util/fast_idiv_by_const.h | 173 +++++++++++++++++++++++++++++
>> >>> >  src/util/meson.build          |   2 +
>> >>> >  4 files changed, 422 insertions(+)
>> >>> >  create mode 100644 src/util/fast_idiv_by_const.c
>> >>> >  create mode 100644 src/util/fast_idiv_by_const.h
>> >>> >
>> >>> > diff --git a/src/util/Makefile.sources b/src/util/Makefile.sources
>> >>> > index b562d6c..f741b2a 100644
>> >>> > --- a/src/util/Makefile.sources
>> >>> > +++ b/src/util/Makefile.sources
>> >>> > @@ -3,20 +3,22 @@ MESA_UTIL_FILES := \
>> >>> >       bitscan.h \
>> >>> >       bitset.h \
>> >>> >       build_id.c \
>> >>> >       build_id.h \
>> >>> >       crc32.c \
>> >>> >       crc32.h \
>> >>> >       debug.c \
>> >>> >       debug.h \
>> >>> >       disk_cache.c \
>> >>> >       disk_cache.h \
>> >>> > +     fast_idiv_by_const.c \
>> >>> > +     fast_idiv_by_const.h \
>> >>> >       format_r11g11b10f.h \
>> >>> >       format_rgb9e5.h \
>> >>> >       format_srgb.h \
>> >>> >       futex.h \
>> >>> >       half_float.c \
>> >>> >       half_float.h \
>> >>> >       hash_table.c \
>> >>> >       hash_table.h \
>> >>> >       list.h \
>> >>> >       macros.h \
>> >>> > diff --git a/src/util/fast_idiv_by_const.c
>> >>> > b/src/util/fast_idiv_by_const.c
>> >>> > new file mode 100644
>> >>> > index 0000000..f247b66
>> >>> > --- /dev/null
>> >>> > +++ b/src/util/fast_idiv_by_const.c
>> >>> > @@ -0,0 +1,245 @@
>> >>> > +/*
>> >>> > + * Copyright © 2018 Advanced Micro Devices, Inc.
>> >>> > + *
>> >>> > + * Permission is hereby granted, free of charge, to any person
>> >>> > obtaining a
>> >>> > + * copy of this software and associated documentation files (the
>> >>> > "Software"),
>> >>> > + * to deal in the Software without restriction, including without
>> >>> > limitation
>> >>> > + * the rights to use, copy, modify, merge, publish, distribute,
>> >>> > sublicense,
>> >>> > + * and/or sell copies of the Software, and to permit persons to whom
>> >>> > the
>> >>> > + * Software is furnished to do so, subject to the following conditions:
>> >>> > + *
>> >>> > + * The above copyright notice and this permission notice (including the
>> >>> > next
>> >>> > + * paragraph) shall be included in all copies or substantial portions
>> >>> > of the
>> >>> > + * Software.
>> >>> > + *
>> >>> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>> >>> > EXPRESS OR
>> >>> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
>> >>> > MERCHANTABILITY,
>> >>> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT
>> >>> > SHALL
>> >>> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
>> >>> > OTHER
>> >>> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
>> >>> > ARISING
>> >>> > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
>> >>> > DEALINGS
>> >>> > + * IN THE SOFTWARE.
>> >>> > + */
>> >>> > +
>> >>> > +/* Imported from:
>> >>> > + *
>> >>> > https://raw.githubusercontent.com/ridiculousfish/libdivide/master/divide_by_constants_codegen_reference.c
>> >>> > + * Paper:
>> >>> > + *
>> >>> > http://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf
>> >>> > + *
>> >>> > + * The author, ridiculous_fish, wrote:
>> >>> > + *
>> >>> > + *  ''Reference implementations of computing and using the "magic
>> >>> > number"
>> >>> > + *    approach to dividing by constants, including codegen
>> >>> > instructions.
>> >>> > + *    The unsigned division incorporates the "round down" optimization
>> >>> > per
>> >>> > + *    ridiculous_fish.
>> >>> > + *
>> >>> > + *    This is free and unencumbered software. Any copyright is
>> >>> > dedicated
>> >>> > + *    to the Public Domain.''
>> >>> > + */
>> >>> > +
>> >>> > +#include "fast_idiv_by_const.h"
>> >>> > +#include "u_math.h"
>> >>> > +#include <limits.h>
>> >>> > +#include <assert.h>
>> >>> > +
>> >>> > +/* uint_t and sint_t can be replaced by different integer types and the
>> >>> > code
>> >>> > + * will work as-is. The only requirement is that sizeof(uintN) ==
>> >>> > sizeof(intN).
>> >>> > + */
>> >>> > +
>> >>> > +struct util_fast_udiv_info
>> >>> > +util_compute_fast_udiv_info(uint_t D, unsigned num_bits)
>> >>> > +{
>> >>> > +   /* The numerator must fit in a uint_t */
>> >>> > +   assert(num_bits > 0 && num_bits <= sizeof(uint_t) * CHAR_BIT);
>> >>> > +   assert(D != 0);
>> >>> > +
>> >>> > +   /* The eventual result */
>> >>> > +   struct util_fast_udiv_info result;
>> >>> > +
>> >>> > +   if (util_is_power_of_two_nonzero(D)) {
>> >>> > +      unsigned div_shift = util_logbase2(D);
>> >>> > +
>> >>> > +      if (div_shift) {
>> >>> > +         /* Dividing by a power of two. */
>> >>> > +         result.multiplier = 1 << 31;
>
>
> This is wrong for non-32-bit

Indeed. Do we need other bit sizes for NIR?

>
>>
>> >>> > +         result.pre_shift = 0;
>> >>> > +         result.post_shift = div_shift - 1;
>> >>> > +         result.increment = 0;
>> >>> > +         return result;
>> >>> > +      } else {
>> >>> > +         /* Dividing by 1. */
>> >>> > +         /* Assuming: floor((num + 1) * (2^32 - 1) / 2^32) = num */
>> >>> > +         result.multiplier = UINT_MAX;
>
>
> So is this.
>
> Can we at the very least pull in the unit tests from my series?

Yes.

Marek