[Mesa-dev] [PATCH 2/2] llvmpipe: avoid most 64 bit math in rasterization

Wed Jan 6 13:50:40 PST 2016

On 06/01/16 18:18, Roland Scheidegger wrote:
> Am 06.01.2016 um 17:31 schrieb Jose Fonseca:
>> On 06/01/16 16:26, Jose Fonseca wrote:
>>> On 06/01/16 00:06, sroland at vmware.com wrote:
>>>> From: Roland Scheidegger <sroland at vmware.com>
>>>>
>>>> The trick here is to recognize that in the c + n * dcdx calculations,
>>>> not only can the lower FIXED_ORDER bits not change (as the dcdx values
>>>> have those all zero) but that this means the sign bit of the
>>>> calculations
>>>> cannot be different as well, that is
>>>> sign(c + n*dcdx) == sign((c >> FIXED_ORDER) + n*(dcdx >> FIXED_ORDER)).
>>>> That shaves off more than enough bits to never require 64bit masks.
>>>> A shifted plane c value could still easily exceed 32 bits, however
>>>> since we
>>>> throw out planes which are trivial accept even before binning (and
>>>> similarly
>>>> don't even get to see tris for which there was a trivial reject
>>>> plane)) this
>>>> is never a problem.
>>>> The idea isnt't all that revolutionary, in fact something similar was
>>>> tried
>>>> ages ago (9773722c2b09d5f0615a47cecf4347859474dc56) back when the
>>>> values were
>>>> only 32 bit anyway. I believe now it didn't quite work then because the
>>>> adjustment needed for testing trivial reject / partial masks wasn't
>>>> handled
>>>> correctly.
>>>> This still keeps the separate 32/64 bit paths for now, as the 32 bit
>>>> one still
>>>> looks minimally simpler (and also because if we'd pass in dcdx/dcdy/eo
>>>> unscaled
>>>> from setup which would be a good reason to ditch the 32 bit path, we'd
>>>> need to
>>>> change the special-purpose rasterization functions for small tris).
>>>>
>>>> This passes piglit triangle-rasterization (-fbo -auto -max_size
>>>> -subpixelbits 8). It still fails triangle-rasterization-overdraw
>>>> -max_size
>>>> (no change, fails everything at position 2048 - interestingly for
>>>> softpipe,
>>>> nvidia maxwell 1 blob, and amd evergreen open-source drivers the test
>>>> fails
>>>> as well but at 4096 - seems like we're missing a float mantissa bit
>>>> somewhere!).
>>>
>>> I don't think that's how the test is supposed to be run.
>>>
>>> If you do an apitrace, you'll see the test creates a fbo with 1000x1000,
>>> a viewport with 16Kx16K, and does a readpixels of 4Kx4K...
>>
>> The problem is that the generic "-fbo" option is not useful for this, as
>> we can't reliably resize it after the fact.
>>
>> Take a look at tests/general/triangle-rasterization.cpp -- it has a
>> different option "-use_fbo" that creates its own fbo.
> OK I was running that the wrong way too I think. This one still passes
> with -max_size -use_fbo -subpixelbits 8 (takes _forever_ though - all
> due to convert_ubyte in readpixel path...)
>
> triangle-rasterization-overdraw with just -auto passes. The max_size
> parameter is a bit confusing since it won't do anything at
> all without -fbo as piglit_width/height will just get overwritten to
> window_width/height (and with fbo it will just fail badly). Increasing
> the window size manually to 8192/8192 won't really work neither as the
> size will be cut down to screen size. However, increasing this and then
> use -fbo actually does the right thing. And passes.

Sounds great then.

I can't spot anything wrong with the change:

Reviewed-by: Jose Fonseca <jfonseca at vmware.com>

> Would be nice if piglit could pick up those size parameters _after_
> piglit_init...

It might be worthwhile to modify the piglit test to bail when the passed 
options are bound to not work.

Jose

>
> Roland
>
>
>> Jose
>>
>>
>>>
>>> Jose
>>>
>>>> ---
>>>>    src/gallium/drivers/llvmpipe/lp_rast_tri.c     |  84
>>>> +++++++++----------
>>>>    src/gallium/drivers/llvmpipe/lp_rast_tri_tmp.h | 107
>>>> +++++++++++++++++++++----
>>>>    2 files changed, 133 insertions(+), 58 deletions(-)
>>>>
>>>> diff --git a/src/gallium/drivers/llvmpipe/lp_rast_tri.c
>>>> b/src/gallium/drivers/llvmpipe/lp_rast_tri.c
>>>> index c9b9221..a4dd6ef 100644
>>>> --- a/src/gallium/drivers/llvmpipe/lp_rast_tri.c
>>>> +++ b/src/gallium/drivers/llvmpipe/lp_rast_tri.c
>>>> @@ -64,43 +64,43 @@ block_full_16(struct lp_rasterizer_task *task,
>>>>    }
>>>>
>>>>    static inline unsigned
>>>> -build_mask_linear(int64_t c, int64_t dcdx, int64_t dcdy)
>>>> +build_mask_linear(int32_t c, int32_t dcdx, int32_t dcdy)
>>>>    {
>>>>       unsigned mask = 0;
>>>>
>>>> -   int64_t c0 = c;
>>>> -   int64_t c1 = c0 + dcdy;
>>>> -   int64_t c2 = c1 + dcdy;
>>>> -   int64_t c3 = c2 + dcdy;
>>>> -
>>>> -   mask |= ((c0 + 0 * dcdx) >> FIXED_SHIFT) & (1 << 0);
>>>> -   mask |= ((c0 + 1 * dcdx) >> FIXED_SHIFT) & (1 << 1);
>>>> -   mask |= ((c0 + 2 * dcdx) >> FIXED_SHIFT) & (1 << 2);
>>>> -   mask |= ((c0 + 3 * dcdx) >> FIXED_SHIFT) & (1 << 3);
>>>> -   mask |= ((c1 + 0 * dcdx) >> FIXED_SHIFT) & (1 << 4);
>>>> -   mask |= ((c1 + 1 * dcdx) >> FIXED_SHIFT) & (1 << 5);
>>>> -   mask |= ((c1 + 2 * dcdx) >> FIXED_SHIFT) & (1 << 6);
>>>> -   mask |= ((c1 + 3 * dcdx) >> FIXED_SHIFT) & (1 << 7);
>>>> -   mask |= ((c2 + 0 * dcdx) >> FIXED_SHIFT) & (1 << 8);
>>>> -   mask |= ((c2 + 1 * dcdx) >> FIXED_SHIFT) & (1 << 9);
>>>> -   mask |= ((c2 + 2 * dcdx) >> FIXED_SHIFT) & (1 << 10);
>>>> -   mask |= ((c2 + 3 * dcdx) >> FIXED_SHIFT) & (1 << 11);
>>>> -   mask |= ((c3 + 0 * dcdx) >> FIXED_SHIFT) & (1 << 12);
>>>> -   mask |= ((c3 + 1 * dcdx) >> FIXED_SHIFT) & (1 << 13);
>>>> -   mask |= ((c3 + 2 * dcdx) >> FIXED_SHIFT) & (1 << 14);
>>>> -   mask |= ((c3 + 3 * dcdx) >> FIXED_SHIFT) & (1 << 15);
>>>> +   int32_t c0 = c;
>>>> +   int32_t c1 = c0 + dcdy;
>>>> +   int32_t c2 = c1 + dcdy;
>>>> +   int32_t c3 = c2 + dcdy;
>>>> +
>>>> +   mask |= ((c0 + 0 * dcdx) >> 31) & (1 << 0);
>>>> +   mask |= ((c0 + 1 * dcdx) >> 31) & (1 << 1);
>>>> +   mask |= ((c0 + 2 * dcdx) >> 31) & (1 << 2);
>>>> +   mask |= ((c0 + 3 * dcdx) >> 31) & (1 << 3);
>>>> +   mask |= ((c1 + 0 * dcdx) >> 31) & (1 << 4);
>>>> +   mask |= ((c1 + 1 * dcdx) >> 31) & (1 << 5);
>>>> +   mask |= ((c1 + 2 * dcdx) >> 31) & (1 << 6);
>>>> +   mask |= ((c1 + 3 * dcdx) >> 31) & (1 << 7);
>>>> +   mask |= ((c2 + 0 * dcdx) >> 31) & (1 << 8);
>>>> +   mask |= ((c2 + 1 * dcdx) >> 31) & (1 << 9);
>>>> +   mask |= ((c2 + 2 * dcdx) >> 31) & (1 << 10);
>>>> +   mask |= ((c2 + 3 * dcdx) >> 31) & (1 << 11);
>>>> +   mask |= ((c3 + 0 * dcdx) >> 31) & (1 << 12);
>>>> +   mask |= ((c3 + 1 * dcdx) >> 31) & (1 << 13);
>>>> +   mask |= ((c3 + 2 * dcdx) >> 31) & (1 << 14);
>>>> +   mask |= ((c3 + 3 * dcdx) >> 31) & (1 << 15);
>>>>
>>>>       return mask;
>>>>    }
>>>>
>>>>
>>>>    static inline void
>>>> -build_masks(int64_t c,
>>>> -            int64_t cdiff,
>>>> -            int64_t dcdx,
>>>> -            int64_t dcdy,
>>>> -        unsigned *outmask,
>>>> -        unsigned *partmask)
>>>> +build_masks(int32_t c,
>>>> +            int32_t cdiff,
>>>> +            int32_t dcdx,
>>>> +            int32_t dcdy,
>>>> +            unsigned *outmask,
>>>> +            unsigned *partmask)
>>>>    {
>>>>       *outmask |= build_mask_linear(c, dcdx, dcdy);
>>>>       *partmask |= build_mask_linear(c + cdiff, dcdx, dcdy);
>>>> @@ -168,12 +168,12 @@ lp_rast_triangle_32_3_4(struct
>>>> lp_rasterizer_task *task,
>>>>
>>>>
>>>>    static inline void
>>>> -build_masks_32(int c,
>>>> -               int cdiff,
>>>> -               int dcdx,
>>>> -               int dcdy,
>>>> -               unsigned *outmask,
>>>> -               unsigned *partmask)
>>>> +build_masks_sse(int c,
>>>> +                int cdiff,
>>>> +                int dcdx,
>>>> +                int dcdy,
>>>> +                unsigned *outmask,
>>>> +                unsigned *partmask)
>>>>    {
>>>>       __m128i cstep0 = _mm_setr_epi32(c, c+dcdx, c+dcdx*2, c+dcdx*3);
>>>>       __m128i xdcdy = _mm_set1_epi32(dcdy);
>>>> @@ -214,7 +214,7 @@ build_masks_32(int c,
>>>>
>>>>
>>>>    static inline unsigned
>>>> -build_mask_linear_32(int c, int dcdx, int dcdy)
>>>> +build_mask_linear_sse(int c, int dcdx, int dcdy)
>>>>    {
>>>>       __m128i cstep0 = _mm_setr_epi32(c, c+dcdx, c+dcdx*2, c+dcdx*3);
>>>>       __m128i xdcdy = _mm_set1_epi32(dcdy);
>>>> @@ -474,8 +474,15 @@ lp_rast_triangle_32_3_4(struct lp_rasterizer_task
>>>> *task,
>>>>    #endif
>>>>
>>>>
>>>> +#ifdef PIPE_ARCH_SSE
>>>> +#define BUILD_MASKS(c, cdiff, dcdx, dcdy, omask, pmask)
>>>> build_masks_sse((int)c, (int)cdiff, dcdx, dcdy, omask, pmask)
>>>> +#define BUILD_MASK_LINEAR(c, dcdx, dcdy)
>>>> build_mask_linear_sse((int)c, dcdx, dcdy)
>>>> +#else
>>>>    #define BUILD_MASKS(c, cdiff, dcdx, dcdy, omask, pmask)
>>>> build_masks(c, cdiff, dcdx, dcdy, omask, pmask)
>>>>    #define BUILD_MASK_LINEAR(c, dcdx, dcdy) build_mask_linear(c, dcdx,
>>>> dcdy)
>>>> +#endif
>>>> +
>>>> +#define RASTER_64 1
>>>>
>>>>    #define TAG(x) x##_1
>>>>    #define NR_PLANES 1
>>>> @@ -512,12 +519,7 @@ lp_rast_triangle_32_3_4(struct lp_rasterizer_task
>>>> *task,
>>>>    #define NR_PLANES 8
>>>>    #include "lp_rast_tri_tmp.h"
>>>>
>>>> -#ifdef PIPE_ARCH_SSE
>>>> -#undef BUILD_MASKS
>>>> -#undef BUILD_MASK_LINEAR
>>>> -#define BUILD_MASKS(c, cdiff, dcdx, dcdy, omask, pmask)
>>>> build_masks_32((int)c, (int)cdiff, dcdx, dcdy, omask, pmask)
>>>> -#define BUILD_MASK_LINEAR(c, dcdx, dcdy) build_mask_linear_32((int)c,
>>>> dcdx, dcdy)
>>>> -#endif
>>>> +#undef RASTER_64
>>>>
>>>>    #define TAG(x) x##_32_1
>>>>    #define NR_PLANES 1
>>>> diff --git a/src/gallium/drivers/llvmpipe/lp_rast_tri_tmp.h
>>>> b/src/gallium/drivers/llvmpipe/lp_rast_tri_tmp.h
>>>> index e0aea94..05d2242 100644
>>>> --- a/src/gallium/drivers/llvmpipe/lp_rast_tri_tmp.h
>>>> +++ b/src/gallium/drivers/llvmpipe/lp_rast_tri_tmp.h
>>>> @@ -50,9 +50,15 @@ TAG(do_block_4)(struct lp_rasterizer_task *task,
>>>>       int j;
>>>>
>>>>       for (j = 0; j < NR_PLANES; j++) {
>>>> -      mask &= ~BUILD_MASK_LINEAR(c[j] - 1,
>>>> -                 -plane[j].dcdx,
>>>> -                 plane[j].dcdy);
>>>> +#ifdef RASTER_64
>>>> +      mask &= ~BUILD_MASK_LINEAR(((c[j] - 1) >> (int64_t)FIXED_ORDER),
>>>> +                                 -plane[j].dcdx >> FIXED_ORDER,
>>>> +                                 plane[j].dcdy >> FIXED_ORDER);
>>>> +#else
>>>> +      mask &= ~BUILD_MASK_LINEAR((c[j] - 1),
>>>> +                                 -plane[j].dcdx,
>>>> +                                 plane[j].dcdy);
>>>> +#endif
>>>>       }
>>>>
>>>>       /* Now pass to the shader:
>>>> @@ -79,17 +85,33 @@ TAG(do_block_16)(struct lp_rasterizer_task *task,
>>>>       partmask = 0;                /* outside one or more trivial
>>>> accept planes */
>>>>
>>>>       for (j = 0; j < NR_PLANES; j++) {
>>>> +#ifdef RASTER_64
>>>> +      int32_t dcdx = -plane[j].dcdx >> FIXED_ORDER;
>>>> +      int32_t dcdy = plane[j].dcdy >> FIXED_ORDER;
>>>> +      const int32_t cox = plane[j].eo >> FIXED_ORDER;
>>>> +      const int32_t ei = (dcdy + dcdx - cox) << 2;
>>>> +      const int32_t cox_s = cox << 2;
>>>> +      const int32_t co = (int32_t)(c[j] >> (int64_t)FIXED_ORDER) +
>>>> cox_s;
>>>> +      int32_t cdiff;
>>>> +      cdiff = ei - cox_s + ((int32_t)((c[j] - 1) >>
>>>> (int64_t)FIXED_ORDER) -
>>>> +                            (int32_t)(c[j] >> (int64_t)FIXED_ORDER));
>>>> +      dcdx <<= 2;
>>>> +      dcdy <<= 2;
>>>> +#else
>>>>          const int64_t dcdx = -IMUL64(plane[j].dcdx, 4);
>>>>          const int64_t dcdy = IMUL64(plane[j].dcdy, 4);
>>>>          const int64_t cox = IMUL64(plane[j].eo, 4);
>>>> -      const int64_t ei = plane[j].dcdy - plane[j].dcdx -
>>>> (int64_t)plane[j].eo;
>>>> +      const int32_t ei = plane[j].dcdy - plane[j].dcdx -
>>>> (int64_t)plane[j].eo;
>>>>          const int64_t cio = IMUL64(ei, 4) - 1;
>>>> +      int32_t co, cdiff;
>>>> +      co = c[j] + cox;
>>>> +      cdiff = cio - cox;
>>>> +#endif
>>>>
>>>> -      BUILD_MASKS(c[j] + cox,
>>>> -          cio - cox,
>>>> -          dcdx, dcdy,
>>>> -          &outmask,   /* sign bits from c[i][0..15] + cox */
>>>> -          &partmask); /* sign bits from c[i][0..15] + cio */
>>>> +      BUILD_MASKS(co, cdiff,
>>>> +                  dcdx, dcdy,
>>>> +                  &outmask,   /* sign bits from c[i][0..15] + cox */
>>>> +                  &partmask); /* sign bits from c[i][0..15] + cio */
>>>>       }
>>>>
>>>>       if (outmask == 0xffff)
>>>> @@ -179,14 +201,65 @@ TAG(lp_rast_triangle)(struct lp_rasterizer_task
>>>> *task,
>>>>          c[j] = plane[j].c + IMUL64(plane[j].dcdy, y) -
>>>> IMUL64(plane[j].dcdx, x);
>>>>
>>>>          {
>>>> -         const int64_t dcdx = -IMUL64(plane[j].dcdx, 16);
>>>> -         const int64_t dcdy = IMUL64(plane[j].dcdy, 16);
>>>> -         const int64_t cox = IMUL64(plane[j].eo, 16);
>>>> -         const int64_t ei = plane[j].dcdy - plane[j].dcdx -
>>>> (int64_t)plane[j].eo;
>>>> -         const int64_t cio = IMUL64(ei, 16) - 1;
>>>> -
>>>> -         BUILD_MASKS(c[j] + cox,
>>>> -                     cio - cox,
>>>> +#ifdef RASTER_64
>>>> +         /*
>>>> +          * Strip off lower FIXED_ORDER bits. Note that those bits from
>>>> +          * dcdx, dcdy, eo are always 0 (by definition).
>>>> +          * c values, however, are not. This means that for every
>>>> +          * addition of the form c + n*dcdx the lower FIXED_ORDER
>>>> bits will
>>>> +          * NOT change. And those bits are not relevant to the sign
>>>> bit (which
>>>> +          * is only what we need!) that is,
>>>> +          * sign(c + n*dcdx) == sign((c >> FIXED_ORDER) + n*(dcdx >>
>>>> FIXED_ORDER))
>>>> +          * This means we can get away with using 32bit math for the
>>>> most part.
>>>> +          * Only tricky part is the -1 adjustment for cdiff.
>>>> +          */
>>>> +         int32_t dcdx = -plane[j].dcdx >> FIXED_ORDER;
>>>> +         int32_t dcdy = plane[j].dcdy >> FIXED_ORDER;
>>>> +         const int32_t cox = plane[j].eo >> FIXED_ORDER;
>>>> +         const int32_t ei = (dcdy + dcdx - cox) << 4;
>>>> +         const int32_t cox_s = cox << 4;
>>>> +         const int32_t co = (int32_t)(c[j] >> (int64_t)FIXED_ORDER) +
>>>> cox_s;
>>>> +         int32_t cdiff;
>>>> +         /*
>>>> +          * Plausibility check to ensure the 32bit math works.
>>>> +          * Note that within a tile, the max we can move the edge
>>>> function
>>>> +          * is essentially dcdx * TILE_SIZE + dcdy * TILE_SIZE.
>>>> +          * TILE_SIZE is 64, dcdx/dcdy are nominally 21 bit (for 8192
>>>> max size
>>>> +          * and 8 subpixel bits), I'd be happy with 2 bits more too
>>>> (1 for
>>>> +          * increasing fb size to 16384, the required d3d11 value,
>>>> another one
>>>> +          * because I'm not quite sure we can't be _just_ above the
>>>> max value
>>>> +          * here). This gives us 30 bits max - hence if c would
>>>> exceed that here
>>>> +          * that means the plane is either trivial reject for the
>>>> whole tile
>>>> +          * (in which case the tri will not get binned), or trivial
>>>> accept for
>>>> +          * the whole tile (in which case plane_mask will not include
>>>> it).
>>>> +          */
>>>> +         assert((c[j] >> (int64_t)FIXED_ORDER) > (int32_t)0xb0000000 &&
>>>> +                (c[j] >> (int64_t)FIXED_ORDER) < (int32_t)0x3fffffff);
>>>> +         /*
>>>> +          * Note the fixup part is constant throughout the tile -
>>>> thus could
>>>> +          * just calculate this and avoid _all_ 64bit math in
>>>> rasterization
>>>> +          * (except exactly this fixup calc).
>>>> +          * In fact theoretically could move that even to setup,
>>>> albeit that
>>>> +          * seems tricky (pre-bin certainly can have values larger
>>>> than 32bit,
>>>> +          * and would need to communicate that fixup value through).
>>>> +          * And if we want to support msaa, we'd probably don't want
>>>> to do the
>>>> +          * downscaling in setup in any case...
>>>> +          */
>>>> +         cdiff = ei - cox_s + ((int32_t)((c[j] - 1) >>
>>>> (int64_t)FIXED_ORDER) -
>>>> +                               (int32_t)(c[j] >>
>>>> (int64_t)FIXED_ORDER));
>>>> +         dcdx <<= 4;
>>>> +         dcdy <<= 4;
>>>> +#else
>>>> +         const int32_t dcdx = -plane[j].dcdx << 4;
>>>> +         const int32_t dcdy = plane[j].dcdy << 4;
>>>> +         const int32_t cox = plane[j].eo << 4;
>>>> +         const int32_t ei = plane[j].dcdy - plane[j].dcdx -
>>>> (int32_t)plane[j].eo;
>>>> +         const int32_t cio = (ei << 4) - 1;
>>>> +         int32_t co, cdiff;
>>>> +         co = c[j] + cox;
>>>> +         cdiff = cio - cox;
>>>> +#endif
>>>> +         BUILD_MASKS(co, cdiff,
>>>>                         dcdx, dcdy,
>>>>                         &outmask,   /* sign bits from c[i][0..15] +
>>>> cox */
>>>>                         &partmask); /* sign bits from c[i][0..15] +
>>>> cio */
>>>>
>>>
>>
>