[Mesa-dev] [PATCH 3/3] llvmpipe: add sse code for fixed position calculation

Thu Jan 7 07:40:33 PST 2016

On 07/01/16 06:18, Roland Scheidegger wrote:
> Am 04.01.2016 um 20:38 schrieb Jose Fonseca:
>> On 02/01/16 20:39, sroland at vmware.com wrote:
>>> From: Roland Scheidegger <sroland at vmware.com>
>>>
>>> This is quite a few less instructions, albeit still do the 2 64bit muls
>>> with scalar c code (they'd need way more shuffles, plus fixup for the
>>> signed
>>> mul so it totally doesn't seem worth it - x86 can do 32x32->64bit signed
>>> scalar muls natively just fine after all (even on 32bit).
>>>
>>> (This still doesn't have a measurable performance impact in reality,
>>> although
>>> profiler seems to say time spent in setup indeed has gone down by 10%
>>> or so
>>> overall.)
>>> ---
>>>    src/gallium/drivers/llvmpipe/lp_setup_tri.c | 58
>>> +++++++++++++++++++++++++----
>>>    1 file changed, 50 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>> b/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>> index cb1d715..fefd1c1 100644
>>> --- a/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>> +++ b/src/gallium/drivers/llvmpipe/lp_setup_tri.c
>>> @@ -65,11 +65,11 @@ fixed_to_float(int a)
>>>    struct fixed_position {
>>>       int32_t x[4];
>>>       int32_t y[4];
>>> -   int64_t area;
>>>       int32_t dx01;
>>>       int32_t dy01;
>>>       int32_t dx20;
>>>       int32_t dy20;
>>> +   int64_t area;
>>>    };
>>>
>>>
>>> @@ -866,29 +866,71 @@ static void retry_triangle_ccw( struct
>>> lp_setup_context *setup,
>>>
>>>    /**
>>>     * Calculate fixed position data for a triangle
>>> + * It is unfortunate we need to do that here (as we need area
>>> + * calculated in fixed point), as there's quite some code duplication
>>> + * to what is done in the jit setup prog.
>>>     */
>>>    static inline void
>>> -calc_fixed_position( struct lp_setup_context *setup,
>>> -                     struct fixed_position* position,
>>> -                     const float (*v0)[4],
>>> -                     const float (*v1)[4],
>>> -                     const float (*v2)[4])
>>> +calc_fixed_position(struct lp_setup_context *setup,
>>> +                    struct fixed_position* position,
>>> +                    const float (*v0)[4],
>>> +                    const float (*v1)[4],
>>> +                    const float (*v2)[4])
>>>    {
>>> +   /*
>>> +    * The rounding may not be quite the same with PIPE_ARCH_SSE
>>> +    * (util_iround right now only does nearest/even on x87,
>>> +    * otherwise nearest/away-from-zero).
>>> +    * Both should be acceptable, I think.
>>> +    */
>>> +#if defined(PIPE_ARCH_SSE)
>>> +   __m128d v0r, v1r, v2r;
>>> +   __m128 vxy0xy2, vxy1xy0;
>>> +   __m128i vxy0xy2i, vxy1xy0i;
>>> +   __m128i dxdy0120, x0x2y0y2, x1x0y1y0, x0120, y0120;
>>> +   __m128 pix_offset = _mm_set1_ps(setup->pixel_offset);
>>> +   __m128 fixed_one = _mm_set1_ps((float)FIXED_ONE);
>>> +   v0r = _mm_load_sd((const double *)v0[0]);
>>> +   v1r = _mm_load_sd((const double *)v1[0]);
>>> +   v2r = _mm_load_sd((const double *)v2[0]);
>>> +   vxy0xy2 = (__m128)_mm_unpacklo_pd(v0r, v2r);
>>> +   vxy1xy0 = (__m128)_mm_unpacklo_pd(v1r, v0r);
>>> +   vxy0xy2 = _mm_sub_ps(vxy0xy2, pix_offset);
>>> +   vxy1xy0 = _mm_sub_ps(vxy1xy0, pix_offset);
>>> +   vxy0xy2 = _mm_mul_ps(vxy0xy2, fixed_one);
>>> +   vxy1xy0 = _mm_mul_ps(vxy1xy0, fixed_one);
>>> +   vxy0xy2i = _mm_cvtps_epi32(vxy0xy2);
>>> +   vxy1xy0i = _mm_cvtps_epi32(vxy1xy0);
>>> +   dxdy0120 = _mm_sub_epi32(vxy0xy2i, vxy1xy0i);
>>> +   _mm_store_si128((__m128i *)&position->dx01, dxdy0120);
>>> +   /*
>>> +    * For the mul, would need some more shuffles, plus emulation
>>> +    * for the signed mul (without sse41), so don't bother.
>>> +    */
>>> +   x0x2y0y2 = _mm_shuffle_epi32(vxy0xy2i, _MM_SHUFFLE(3,1,2,0));
>>> +   x1x0y1y0 = _mm_shuffle_epi32(vxy1xy0i, _MM_SHUFFLE(3,1,2,0));
>>> +   x0120 = _mm_unpacklo_epi32(x0x2y0y2, x1x0y1y0);
>>> +   y0120 = _mm_unpackhi_epi32(x0x2y0y2, x1x0y1y0);
>>> +   _mm_store_si128((__m128i *)&position->x[0], x0120);
>>> +   _mm_store_si128((__m128i *)&position->y[0], y0120);
>>> +
>>> +#else
>>>       position->x[0] = subpixel_snap(v0[0][0] - setup->pixel_offset);
>>>       position->x[1] = subpixel_snap(v1[0][0] - setup->pixel_offset);
>>>       position->x[2] = subpixel_snap(v2[0][0] - setup->pixel_offset);
>>> -   position->x[3] = 0;
>>> +   position->x[3] = 0; // should be unused
>>>
>>>       position->y[0] = subpixel_snap(v0[0][1] - setup->pixel_offset);
>>>       position->y[1] = subpixel_snap(v1[0][1] - setup->pixel_offset);
>>>       position->y[2] = subpixel_snap(v2[0][1] - setup->pixel_offset);
>>> -   position->y[3] = 0;
>>> +   position->y[3] = 0; // should be unused
>>>
>>>       position->dx01 = position->x[0] - position->x[1];
>>>       position->dy01 = position->y[0] - position->y[1];
>>>
>>>       position->dx20 = position->x[2] - position->x[0];
>>>       position->dy20 = position->y[2] - position->y[0];
>>> +#endif
>>>
>>>       position->area = IMUL64(position->dx01, position->dy20) -
>>>             IMUL64(position->dx20, position->dy01);
>>>
>>
>> LGTM too.
>>
>> Reviewed-by: Jose Fonseca <jfonseca at vmware.com>
>
> Hmm actually I suppose I didn't do enough testing with that. This fails
> one piglit (completely unrelated to what the test actually wants to
> test), piglit/bin/glsl-1.50-geometry-end-primitive 128 -auto -fbo.
>
> This draws twice the same just shifted in x direction and expects
> results to be the same. But due to rounding that's apparently not quite
> the case. I am left wondering if the nearest-away-from-zero rounding we
> did before is really superior for subpixel snap or if that's just dumb
> luck it passes with that rounding (though util_iround previously used
> would have used the same nearest-even rounding on x86 without sse).
>
> GL spec of course is completely silent on how rasterization is done.
> d3d10 generally loves round-to-nearest-even for float->fixed point
> conversions but doesn't say if that applies to pixel coord snapping as
> well...
>
> Some analysis shows that the point x value in question is
> 7.90039825f - when we see it a second time it is 263.900391f (it is not
> exactly 256.0f more, but as exact as float math allows).
> So when we do the pixel offset adjust (sub 0.5f) * 256 in fixed point
> conversion, we get for the former: 1894.50183f. This will be rounded up
> of course, regardless the exact nearest rounding mode.
> For the second value however, we get: 67430.5f - rounded up with the
> util_iround (round nearest, away from zero) function, but rounded down
> with nearest/even rounding...
>
> So, I'd blame the test. With some different float values, it could
> easily fail no matter the exact rounding mode.
> (Both my nvidia and amd hardware (which should also have 8 subpixel bits
> accuracy) manage to pass the test, albeit it looks like both of them
> actually round down both values there, not up, so the result is
> different to what we get in any case - if they really do similar logic
> with subpixel snapping they must have calculated slightly different
> float values in the first place.)
>
> Still not entirely sure though what's the preferred rounding mode for
> subpixel snap, the assembly can of course be adjusted either way.
>
> Roland

Yes, it sounds we want to modify the test to not rely on the 
float->fixed rounding.

Jose