[Mesa-dev] [PATCH 3/3] llvmpipe: add sse code for fixed position calculation

Fri Jan 8 11:33:10 PST 2016

Am 07.01.2016 um 19:56 schrieb Roland Scheidegger:
> Am 07.01.2016 um 16:40 schrieb Jose Fonseca:
>> On 07/01/16 06:18, Roland Scheidegger wrote:
>>> Am 04.01.2016 um 20:38 schrieb Jose Fonseca:
>>>> On 02/01/16 20:39, sroland at vmware.com wrote:
>>>>> From: Roland Scheidegger <sroland at vmware.com>
>>>>>
>>>>> This is quite a few less instructions, albeit still do the 2 64bit muls
>>>>> with scalar c code (they'd need way more shuffles, plus fixup for the
>>>>> signed
>>>>> mul so it totally doesn't seem worth it - x86 can do 32x32->64bit
>>>>> signed
>>>>> scalar muls natively just fine after all (even on 32bit).
>>>>>
>>>>> (This still doesn't have a measurable performance impact in reality,
>>>>> although
>>>>> profiler seems to say time spent in setup indeed has gone down by 10%
>>>>> or so
>>>>> overall.)
>>>>> ---
>>>>>    src/gallium/drivers/llvmpipe/lp_setup_tri.c | 58
>>>>> +++++++++++++++++++++++++----
>>>>>    1 file changed, 50 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>>>> b/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>>>> index cb1d715..fefd1c1 100644
>>>>> --- a/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>>>> +++ b/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>>>> @@ -65,11 +65,11 @@ fixed_to_float(int a)
>>>>>    struct fixed_position {
>>>>>       int32_t x[4];
>>>>>       int32_t y[4];
>>>>> -   int64_t area;
>>>>>       int32_t dx01;
>>>>>       int32_t dy01;
>>>>>       int32_t dx20;
>>>>>       int32_t dy20;
>>>>> +   int64_t area;
>>>>>    };
>>>>>
>>>>>
>>>>> @@ -866,29 +866,71 @@ static void retry_triangle_ccw( struct
>>>>> lp_setup_context *setup,
>>>>>
>>>>>    /**
>>>>>     * Calculate fixed position data for a triangle
>>>>> + * It is unfortunate we need to do that here (as we need area
>>>>> + * calculated in fixed point), as there's quite some code duplication
>>>>> + * to what is done in the jit setup prog.
>>>>>     */
>>>>>    static inline void
>>>>> -calc_fixed_position( struct lp_setup_context *setup,
>>>>> -                     struct fixed_position* position,
>>>>> -                     const float (*v0)[4],
>>>>> -                     const float (*v1)[4],
>>>>> -                     const float (*v2)[4])
>>>>> +calc_fixed_position(struct lp_setup_context *setup,
>>>>> +                    struct fixed_position* position,
>>>>> +                    const float (*v0)[4],
>>>>> +                    const float (*v1)[4],
>>>>> +                    const float (*v2)[4])
>>>>>    {
>>>>> +   /*
>>>>> +    * The rounding may not be quite the same with PIPE_ARCH_SSE
>>>>> +    * (util_iround right now only does nearest/even on x87,
>>>>> +    * otherwise nearest/away-from-zero).
>>>>> +    * Both should be acceptable, I think.
>>>>> +    */
>>>>> +#if defined(PIPE_ARCH_SSE)
>>>>> +   __m128d v0r, v1r, v2r;
>>>>> +   __m128 vxy0xy2, vxy1xy0;
>>>>> +   __m128i vxy0xy2i, vxy1xy0i;
>>>>> +   __m128i dxdy0120, x0x2y0y2, x1x0y1y0, x0120, y0120;
>>>>> +   __m128 pix_offset = _mm_set1_ps(setup->pixel_offset);
>>>>> +   __m128 fixed_one = _mm_set1_ps((float)FIXED_ONE);
>>>>> +   v0r = _mm_load_sd((const double *)v0[0]);
>>>>> +   v1r = _mm_load_sd((const double *)v1[0]);
>>>>> +   v2r = _mm_load_sd((const double *)v2[0]);
>>>>> +   vxy0xy2 = (__m128)_mm_unpacklo_pd(v0r, v2r);
>>>>> +   vxy1xy0 = (__m128)_mm_unpacklo_pd(v1r, v0r);
>>>>> +   vxy0xy2 = _mm_sub_ps(vxy0xy2, pix_offset);
>>>>> +   vxy1xy0 = _mm_sub_ps(vxy1xy0, pix_offset);
>>>>> +   vxy0xy2 = _mm_mul_ps(vxy0xy2, fixed_one);
>>>>> +   vxy1xy0 = _mm_mul_ps(vxy1xy0, fixed_one);
>>>>> +   vxy0xy2i = _mm_cvtps_epi32(vxy0xy2);
>>>>> +   vxy1xy0i = _mm_cvtps_epi32(vxy1xy0);
>>>>> +   dxdy0120 = _mm_sub_epi32(vxy0xy2i, vxy1xy0i);
>>>>> +   _mm_store_si128((__m128i *)&position->dx01, dxdy0120);
>>>>> +   /*
>>>>> +    * For the mul, would need some more shuffles, plus emulation
>>>>> +    * for the signed mul (without sse41), so don't bother.
>>>>> +    */
>>>>> +   x0x2y0y2 = _mm_shuffle_epi32(vxy0xy2i, _MM_SHUFFLE(3,1,2,0));
>>>>> +   x1x0y1y0 = _mm_shuffle_epi32(vxy1xy0i, _MM_SHUFFLE(3,1,2,0));
>>>>> +   x0120 = _mm_unpacklo_epi32(x0x2y0y2, x1x0y1y0);
>>>>> +   y0120 = _mm_unpackhi_epi32(x0x2y0y2, x1x0y1y0);
>>>>> +   _mm_store_si128((__m128i *)&position->x[0], x0120);
>>>>> +   _mm_store_si128((__m128i *)&position->y[0], y0120);
>>>>> +
>>>>> +#else
>>>>>       position->x[0] = subpixel_snap(v0[0][0] - setup->pixel_offset);
>>>>>       position->x[1] = subpixel_snap(v1[0][0] - setup->pixel_offset);
>>>>>       position->x[2] = subpixel_snap(v2[0][0] - setup->pixel_offset);
>>>>> -   position->x[3] = 0;
>>>>> +   position->x[3] = 0; // should be unused
>>>>>
>>>>>       position->y[0] = subpixel_snap(v0[0][1] - setup->pixel_offset);
>>>>>       position->y[1] = subpixel_snap(v1[0][1] - setup->pixel_offset);
>>>>>       position->y[2] = subpixel_snap(v2[0][1] - setup->pixel_offset);
>>>>> -   position->y[3] = 0;
>>>>> +   position->y[3] = 0; // should be unused
>>>>>
>>>>>       position->dx01 = position->x[0] - position->x[1];
>>>>>       position->dy01 = position->y[0] - position->y[1];
>>>>>
>>>>>       position->dx20 = position->x[2] - position->x[0];
>>>>>       position->dy20 = position->y[2] - position->y[0];
>>>>> +#endif
>>>>>
>>>>>       position->area = IMUL64(position->dx01, position->dy20) -
>>>>>             IMUL64(position->dx20, position->dy01);
>>>>>
>>>>
>>>> LGTM too.
>>>>
>>>> Reviewed-by: Jose Fonseca <jfonseca at vmware.com>
>>>
>>> Hmm actually I suppose I didn't do enough testing with that. This fails
>>> one piglit (completely unrelated to what the test actually wants to
>>> test), piglit/bin/glsl-1.50-geometry-end-primitive 128 -auto -fbo.
>>>
>>> This draws twice the same just shifted in x direction and expects
>>> results to be the same. But due to rounding that's apparently not quite
>>> the case. I am left wondering if the nearest-away-from-zero rounding we
>>> did before is really superior for subpixel snap or if that's just dumb
>>> luck it passes with that rounding (though util_iround previously used
>>> would have used the same nearest-even rounding on x86 without sse).
>>>
>>> GL spec of course is completely silent on how rasterization is done.
>>> d3d10 generally loves round-to-nearest-even for float->fixed point
>>> conversions but doesn't say if that applies to pixel coord snapping as
>>> well...
>>>
>>> Some analysis shows that the point x value in question is
>>> 7.90039825f - when we see it a second time it is 263.900391f (it is not
>>> exactly 256.0f more, but as exact as float math allows).
>>> So when we do the pixel offset adjust (sub 0.5f) * 256 in fixed point
>>> conversion, we get for the former: 1894.50183f. This will be rounded up
>>> of course, regardless the exact nearest rounding mode.
>>> For the second value however, we get: 67430.5f - rounded up with the
>>> util_iround (round nearest, away from zero) function, but rounded down
>>> with nearest/even rounding...
>>>
>>> So, I'd blame the test. With some different float values, it could
>>> easily fail no matter the exact rounding mode.
>>> (Both my nvidia and amd hardware (which should also have 8 subpixel bits
>>> accuracy) manage to pass the test, albeit it looks like both of them
>>> actually round down both values there, not up, so the result is
>>> different to what we get in any case - if they really do similar logic
>>> with subpixel snapping they must have calculated slightly different
>>> float values in the first place.)
>>>
>>> Still not entirely sure though what's the preferred rounding mode for
>>> subpixel snap, the assembly can of course be adjusted either way.
>>>
>>> Roland
>>
>> Yes, it sounds we want to modify the test to not rely on the
>> float->fixed rounding.
>>
> This potentially affects all tests which try to render twice the same
> thing just shifted by some amount. I suppose usually it shouldn't be a
> problem because most tests tend to use simple squares (so, 2 vertices on
> the same x and y axis, with one 45 degree each) which should be
> numerically very unproblematic as long as the verties don't exactly lie
> between two fragment centers.
> This test, though, gets its vertex positions from some more complex
> calculations, so not easily controlled. Not sure how exactly it should
> be fixed? Maybe would need to perform some sort of its own subpixel snap
> before converting back to floats.

I wasn't actually able to trigger the same issue with that test on real
hw. So I was wondering if hw actually does some rounding which makes
that impossible (I think this should be doable if they'd effectively
throw away the low-order bits for small values). However, that's clearly
not the case, as I hacked together some test which shows that indeed
this can fail on nvidia hw as well (basically, draw some tri with one
edge going from 1/256, 0 to 2/256, 1, with viewport width/height of 256
and starting at x,y 0 then draw the same thing again but with viewport
shifted by 256 in x direction - then loop over and over again with one
vertex x position adjusted by + 1/2^24 per loop iteration until it fails
- takes ages...).

However I'm actually beginning to think it would be desirable to have
some "nearest-floor" or "nearest-ceil" rounding instead of nearest-even
(to ensure even spacing, albeit that would only really be true if two
values have the same exponent in any case). So maybe it's indeed not
worth worrying about exact rounding.

Roland